MongoDB could be executed on many servers and with many different configurations, this add an overwhelming complexity while provisioning and projecting the required hardware/vm for your project.

The following parameters should have a strong influence on your sizing:

  • Storage engine: wiredTiger and MMAPv1 are different and change the data format both in memory and on disk. For example, wiredTiger support compression, which will drastically reduce your storage requirements.
  • Block compression: by default, wiredTiger use the snappy algorithm to compress its collections on disk, you can choose between: zlib, snappy or none in your deployment. In general, we observe that:
    • zlib has a compression ratio of 4 to 5 but consume slighly more CPU(~1%) than snappy
    • snappy has a compression ratio between 3 and 4
    • Note: the compression ratio depends entirely of your data and can greatly vary according to your data & cardinality.
  • Index prefix compression: wiredTiger compress both in memory and on disk MongoDB indexes, this greatly reduce the amount of space required to save indices (~40-60% observed here)
  • Oplog size: one things that you should be aware, MongoDB has its own internal journal located in: local.oplog.rs. The equivalent of the redo log in Oracle.  By default, this collection has a capped size of: 5% of your dbPath volume, with a minimum size of 1GB and maximum size of 50GB. Note: it is critical to have at least one or two days of oplog!
  • The working set: the working set is the frequently accessed data at a specific time. It’s critical, for performance purpose, that it fits in memory. It’s the most complex parameter to compute in your environment, in general, to compute it, I recommend to:
    • ensure that all indices fit into memories!
    • understand which collections & how many documents per collections could be accessed concurrently

 

So now the important question: how can we estimate the ressources required for our project? Let’s try to approach this questions with more questions:

  • How many GB of memories do I need?
  • How many GB of storage do I need?
  • How many CPU do I need?
  • How many IO per seconds do I need?

 

How many GB of memory do I need?

First thing first: to understand how many GB you need, you will need to understand what must fit into memory. MongoDB use its memory for three main purposes:

  • Binaries & mongod process: it should only be a few hundred of MB, let’s say 100MB
  • Collections & indices: straight forward, data need to be fetch into memories at some point
  • Temporary & miscellaneous information: like TCP connection, sort in memory, etc…

Concerning the indices, the simplest but non strict rule is to ensure that everything fit into memories. The easiest way to compute the index size is to push a few MB of documents in a test collection, create the appropriate indices, get the totalIndexSize for this number of document and project it to the number of expected document.

For example: I expect to have at least 2,000,000 document in two years. With 100,000 document, the total index size is 20MB, extrapolating this result, the collection will require: 20*(2,000,000/100,000) = 400MB

Regarding the number of document that need to fit into memory, this is more tricky… To compute that, you need to understand how many queries will be executed and how many documents they will need to fetch at a specific time.

For example: I expect to perform 2,000 queries per second, the queries will fetch only one document each (because I have efficient indices 😉 ). Thus I will need, approximately: 2,000 * averageBSONSizeOfMyDocument.

As I don’t intend to write a book today and as the rest should have only minor impacts, I will not cover the other facets affecting memory. Nonetheless, please consider & be aware of the following points:

  • Every connection in MongoDB use approximately 1MB
  • Other commands could require temp memory. e.g. sort in memory (please avoid that), aggregation, etc…
  • Data in the wiredTiger cache is uncompressed, but not on the filesystem cache, it means that the actual memory requirement is lower that you expect

How many GB of storage do I need?

On this question, due to the amount of unknown parameters (e.g. compression ratio), the best is just to try with a smaller set of data and do a projection on the longer term. The same way that we computed the index size.

For example: I expect to have at least 2,000,000 document in two years in a specific collection. With 100,000 document, the totalIndexSize is 20M and the storageSize is 10GB, this collection will require: 10*(2,000,000/100,000) = 200GB

 

How many CPU do I need?

MongoDB does not require a lot of CPU to works properly, even if it has been noticed that wiredTiger consume more ressources than MMAPv1. In general, except if you notice that this is not sufficient or that you are planning to execute a lot of expensive queries, I usually recommend:

  • For wiredTiger deployment: 8 vCPU, more could be required according to the type of queries you are performing, but this should be good enough to handle most use cases
  • For MMAPv1: 4 vCPU, MMAPv1 doesn’t scale well with the number of vCPU, even if you increase the number of vCPU, mongod will not be able to use then….

 

Is this over? Is my deployment safe for the next two years? Those are some very good questions, and for sure you will not be able to foresee the future and prevent all posibilities. If you did the sizing exercise properly, you should not encounter major surprises, especially if you include a safety margin in your calculation (e.g. +10%).

Nonetheless, monitoring should be a critical component in your deployment: if you start to reach the limit of your ressources, you must ensure that you will get noticed before reaching an actual issue. If you don’t know how to monitor mongodb, I highly recommend you to check: http://cloud.mongodb.com or opsmanager