Journal of an errant developer

Troubleshooting MongoDB query performances

Troubleshooting database performances can be time consuming and require deep expertise of the database. Fortunately MongoDB is actually quite easy to troubleshoot for common muggles. There are a lot of tools and information available to understand and assess your performance (e.g. Cloud Manager, Kibana, mtools, …)  but, the most useful and widely available tool is by far the MongoDB logs.


MongoDB logs, a must read!

Every slow query (>100ms by default) is profiled in the MongoDB logs, therefore the log file is a rich source of information to assess the current state of your environment. In it, there will be information concerning potential issues & events (e.g. server restarts), connections & sessions (e.g. how many connections is opened to the database) and query performance (e.g. how many slow queries).

For example, every time the database detects a slow query, it will log a new entry with the following shape:

Here we have a slow query that took 806ms to complete. It has been executed in the euph.health collection with the following command db.euph.find({ myField: “Deesh, da must be a slow query, ya!” } ). Along with the query shape, we have profiling information explaining how MongoDB executed it:

  • MongoDB performed a Collection Scan for this query (planSummary: COLLSCAN). A Collection Scan is the equivalent of Full Table Scan, well it is actually much worse but this a tale for another day
  • MongoDB had to fetch and analyze 42601 documents for this query (docsExamined:42601)
  • No document have been returned by this query (nreturned:0)
  • MongoDB yielded the query lock a few times (numYields:45). Typically, a query will yield its lock when it needs to wait for Disk IO, this to avoid blocking other queries in the meantime.

The problem here is that MongoDB had to perform a Collection Scan. You can quickly fix this issue by creating the proper index on { myField: 1 }

If you need more information about the profiler output, I highly recommend you to read https://docs.mongodb.com/v3.2/reference/database-profiler/.


mtools is you best friend!

Manually reading log files can become quite tedious as they tend to be quite large. There is a tool available to help you take advantage of these files and assess the healthiness of your environment: mtools. It is actually a set of tools not officially supported but mainly developed by MongoDB support team, to quickly visualize, aggregate and filter log files. For example, you can use mloginfo –queries $logpath to group slow queries and highlight which one consumes more resources or you could use mplotqueries $logpath to transform your log file into a graph.

mloginfo mongod.log –queries

mplotqueries mongod.log

screen-shot-2016-11-07-at-14-32-36

For more information about mtools, I invite you to read: http://blog.mongodb.org/post/85123256973/introducing-mtools and https://github.com/rueckstiess/mtools/


Typical bottlenecks

Ok, so now we know how to read & explore log files and how to identify queries that are not performing as they could. If we detect a slow operation, to fix it, we need to understand the cause. Let’s try to do a differential diagnosis, what could cause slowness?: resource contention, lock contention, inefficient execution plan and network or disk latency.

Inefficient execution plan

This is the most common issue observed in the field. MongoDB needs proper indexes to efficiently search through the documents. When the database can not use an index to support a query or when the existing indexes are not optimal, MongoDB will need to fetch and examine additional documents. An efficient index for a query is easily identifiable: the number of document returned by a query is equal to the number of document fetched from a collection and is equal to the number of keys in the index used. You can easily identify an inefficient plan by looking in the plan summary in your log file or by using the explain(‘executionStats’) method after your query (nReturned = totalKeysExamined = totalDocsExamined).

Example of an optimal index for a query:

Example of an sub-optimal index for a query:

Lock contention

Some operations might block other operations and impact the performance of your environment. For example if you don’t create your index in a rolling way but directly on the the primary, the creation will block all writes and reads on your database if the background option is omitted. This would generate a lot of slow queries as they will need to wait for the blocking operation to complete. Also, be aware if you are using the MMAPv1 storage engine, operations require a collection or database level lock, this means that you can not perform two write operation concurrently as they will need the same lock.

This bottleneck is easily identifiable from MongoDB’s log file or from the output of the profiler, the time waited to acquire all the required locks is provided if abnormally long. You can find this information in locks.*.timeAcquiringMicros.

Resource contention

Now that’s becoming a little bit trickier. To detect resource contention, we need another tool as MongoDB does not provide resource information in the profiler output or in the log file. If you suspect resource contention, you will need to monitor the disk activity and the CPU utilization. The most easily available and common tools usually are: sar and iostat.

CPU contention is easily identifiable: your CPU is almost 0% idle. Disk contention is harder to identify. MongoDB writes in batch to the data files, so it is expected that the disk could be maxed out during these checkpoints (flushes), nonetheless some operations might require to perform disk access: to load the required pages in memory, writes might require for the journal to be flushed according to your write concern.  I personally like to focus on the await column, the average time in milliseconds it takes for I/O requests to be served, this metric will have a direct impact on the latency of operations that require I/O. Memory contention is even trickier as it would result in an excessive access to disk. To identify and detect memory contention, the iowait%, percentage  of idle during which the system had an outstanding disk I/O request, is a  good indicator.


Let’s practice

Case 1: sub-optimal execution plan

Here you can see that the database had to fetch 42601 documents (docsExamined) to return 0 documents (nreturned), this is a sign of a missing index. Creating the proper index should drastically improve the performance.

Case 2: Blocking operations and lock contention

Here we can see that the slowness of the query is related to a lock as this query waited 528ms to acquire it (timeAcquiringMicros: { r: 528902 }). So what is causing this delay? If we use mplotqueries, we can notice that every slow query correlates with another operation: db.eval

Hum, suspicious, we should look at the documentation! Touché, by default the eval command requires a global write lock. Using the nolock flag in our eval command should fix this slowness.


Monitoring commands and tools

Monitoring tools:

  • mtools: Awesome set of tools to visualize, aggregate and filter MongoDB log files
  • mongostat: Out of the box tool to gather real-time metrics
  • Cloud Manager: Cool monitoring, automation and backup tool for MongoDB
  • iostat / sar / top / dstat: Awesome unix tools

Useful MongoDB commands:

 

«

© 2019 Journal of an errant developer. Theme by Anders Norén.