Mongodb – high background flush avg mongodb

linuxlockingmongodb

similar question: High Global Lock % on Mongodb

Overview

We have a production setup replica set in v2.4.8 mongodb running on five 4-core, 28gb RAM VMs with standard azure datadisk HDDs running on 64bit CentOS 6. We distribute reads across the secondaries at about 600-700 ops/sec/secondary. The CPU usage is around 15% per secondary. CPU usage is ~5-10% on the primary. We are currently having problems with high global write lock and background flush avg on our primary. Our global write lock is between 30-40% on our primary despite only having ~200 insert/update/deletes per second (see MMS output below). We also have noticed that our background flush avg is between 2 and 15 seconds. Unfortunately, this is causing a serious amount of slow queries (up to 50 updates/inserts > 100ms per second). We have considered sharding but feel that mongodb should be performing better than this.

Testing

This says to me that we are having issues writing to our HDDs but running a simple iostat shows that our utilization on sdc (the disk we are writing to) is NOT maxed out and is between 20 and 40%:

$ iostat -x 1

The result for 4 seconds:

Linux 2.6.32-279.14.1.el6.openlogic.x86_64 (mongodb3-wus)   05/08/2014  _x86_64_    (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.28    0.00    1.82    5.50    0.00   87.40

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.05     0.04    0.06    0.11     3.25     1.23    26.13     0.00   18.07  14.87   0.25
sdc               0.02   216.57    1.70   95.83   216.22  3106.45    34.07     9.27   95.07   4.32  42.11
sdb               0.00    11.35    0.01    0.56     0.05    95.25   169.44     0.01   18.44   0.11   0.01

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.56    0.00    2.05    0.00    0.00   95.38

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00   15.00     0.00   624.00    41.60     0.20   11.80  13.47  20.20
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.07    0.00    3.07    0.26    0.00   93.61

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    3.00   15.00    24.00   352.00    20.89     0.25   15.17  13.44  24.20
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.33    0.00    1.79    0.77    0.00   94.10

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00    11.00    0.00   17.00     0.00   768.00    45.18     0.26   15.18  14.35  24.40
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

I also ran a simple load test using dd:

dd if=/dev/zero of=/dev/sdc1 count=512 bs=1024k

The result of this test showed that the write speed is ~840 MB/s:

512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 0.638451 s, 841 MB/s

Ulimit results for mongodb:

[mongod #8066 -- limits]
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            10485760             unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             224341               224341               processes 
Max open files            20000                20000                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       224341               224341               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

MMS, Mongostat, Mongotop for primary

I have also provided our MMS output, mongostat and mongotop outputs below:

!MMS: MMS output click here

Mongostat:

connected to: 127.0.0.1:27019
    insert  query update delete getmore command flushes mapped  vsize    res faults      locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time 
        26     41     95     *0     294   178|0       0  52.4g   107g  25.1g      0  chronicle:5.6%          0      65|2     1|4    45k   136k   486 rs0  PRI   23:15:18 
        96    158    524     *0    1266   783|0       0  52.4g   107g  25.1g      1 chronicle:82.9%          0       0|0     0|0   235k   759k   486 rs0  PRI   23:15:19 
        33     62    109     *0     637   253|0       0  52.4g   107g  25.1g      0      local:7.2%          0       0|0     0|0    78k   208k   486 rs0  PRI   23:15:20 
        58     89    153     *0     920   321|0       0  52.4g   107g  25.1g      0     local:16.1%          0       0|0     0|1   113k   569k   486 rs0  PRI   23:15:21 
        55     95    138     *0     887   322|0       0  52.4g   107g  25.1g      0 chronicle:20.3%          0       0|0     0|0   111k   297k   486 rs0  PRI   23:15:22 
        24     59     81     *0     217   174|0       0  52.4g   107g  25.1g      1         .:88.5%          0      23|0     0|1    46k   141k   486 rs0  PRI   23:15:23 
        51     64    136     *0     760   263|0       0  52.4g   107g  25.1g      0 chronicle:17.1%          0       0|0     0|0    93k   266k   486 rs0  PRI   23:15:24 
        42     60    129     *0     695   288|0       0  52.4g   107g  25.1g      0      local:7.3%          0       0|0     0|0    90k   253k   486 rs0  PRI   23:15:25 
        33     55     99     *0     693   215|0       0  52.4g   107g  25.1g      1      local:3.1%          0       0|0     0|0    76k   455k   486 rs0  PRI   23:15:26 
        45     70     95     *0     763   250|0       0  52.4g   107g  25.1g      1      local:9.0%          0       0|0     0|0    88k   225k   486 rs0  PRI   23:15:27

Mongotop:

connected to: 127.0.0.1:27019

                                     ns       total        read       write     2014-05-07T23:09:17
                      chronicle.ledgers        93ms         0ms        93ms
                         local.oplog.rs        47ms        47ms         0ms
                         cliqueme.sites        13ms         0ms        13ms
                    chronicle.analytics         4ms         0ms         4ms
          chronicle_test.system.indexes         0ms         0ms         0ms
       chronicle_test.system.namespaces         0ms         0ms         0ms
            chronicle_test.system.users         0ms         0ms         0ms

                                     ns       total        read       write     2014-05-07T23:09:18
                      chronicle.ledgers       101ms         0ms       101ms
                         local.oplog.rs        66ms        66ms         0ms
                       cliqueme.cliques        19ms         0ms        19ms
                    chronicle.analytics         6ms         0ms         6ms
                         cliqueme.sites         4ms         0ms         4ms
                           local.slaves         1ms         0ms         1ms
                 cliqueme.notifications         0ms         0ms         0ms
                      cliqueme.messages         0ms         0ms         0ms

                                     ns       total        read       write     2014-05-07T23:09:19
                         local.oplog.rs        66ms        66ms         0ms
                      chronicle.ledgers        52ms         0ms        52ms
                    chronicle.analytics        24ms         0ms        24ms
                       cliqueme.cliques         7ms         0ms         7ms
                         cliqueme.sites         4ms         0ms         4ms
                           local.slaves         1ms         0ms         1ms
                 cliqueme.notifications         0ms         0ms         0ms
                      cliqueme.messages         0ms         0ms         0ms

                                     ns       total        read       write     2014-05-07T23:09:20
                      chronicle.ledgers      1842ms         0ms      1842ms
                         cliqueme.sites       885ms         0ms       885ms
                       cliqueme.cliques        70ms         0ms        70ms
                         local.oplog.rs        55ms        55ms         0ms
                    chronicle.analytics         5ms         0ms         5ms
                           local.slaves         1ms         0ms         1ms
                 cliqueme.notifications         0ms         0ms         0ms
                      cliqueme.messages         0ms         0ms         0ms

                                     ns       total        read       write     2014-05-07T23:09:21
                      chronicle.ledgers        84ms         0ms        84ms
                         local.oplog.rs        64ms        64ms         0ms
                         cliqueme.sites        41ms         0ms        41ms
                       cliqueme.cliques        11ms         0ms        11ms
                    chronicle.analytics         4ms         0ms         4ms
          chronicle_test.system.indexes         0ms         0ms         0ms
       chronicle_test.system.namespaces         0ms         0ms         0ms
            chronicle_test.system.users         0ms         0ms         0ms

                                     ns       total        read       write     2014-05-07T23:09:22
                      chronicle.ledgers       276ms         0ms       276ms
                         local.oplog.rs        90ms        90ms         0ms
                       cliqueme.cliques        16ms         0ms        16ms
                    chronicle.analytics         6ms         0ms         6ms
                         cliqueme.sites         4ms         0ms         4ms
                           local.slaves         1ms         0ms         1ms
                 cliqueme.notifications         0ms         0ms         0ms
                      cliqueme.messages         0ms         0ms         0ms

Does anyone have any suggestions as to how we can optimize this performance? We have heard that some people can get up to 2K writes per second in standalones? Would switching from HDD to RAID or SSD possibly solve this?

We would like to use sharding as a last resort.

UPDATE:
we have still been unable to solve this issue but because we needed a solution quickly have moved to a sharded cluster. We still would like to figure out the problem because it is still affecting us in the sharded cluster.

Best Answer

Your mongo stat shows higher number of updates vs inserts. One thing that could cause high write lock issues is if your updates typically are increasing the document size and causing the document to move in the data file. We ran into this ourselves, but we were working with mongo support at the time to figure out so I don't remember what metric or stat would tell you this is the case. This would likely only be an issue if your document sizes were very large. We ended up splitting out a sub array that was always being added to into its own collection so that we were just adding new documents instead of modifying an existing one.

The usePowerOf2Sizes flag on the collection can also help alleviate this by giving the documents room for growth. This is apparently the default now on 2.6, but you would need to turn it on if you're not on 2.6 yet. Setting that is described here: http://docs.mongodb.org/manual/reference/command/collMod/

Related Solutions

Mongodb – One of databases in MongoDB suddenly gets high lock %

The lock spike coincides with a spike in page faults in MMS, which means that for that period of time (which looks pretty brief), there is data being paged in off disk (that is: not in memory). Since you mention that you are doing updates I would guess that something about the updates during that time frame is hitting a rarely used section of the data that is no longer in memory (the kernel uses LRU to decide what stays in memory). The other possibility is that you are evicting db2 pages right before this because of some other activity and then attempting to use that data again (i.e. classic memory contention).

To get a better picture of this you could do before and after comparisons using something like mongomem but that may not tell you what the usage pattern is that is causing this, unless the data impacted is indicative.

On the usage front, with lock percentage hitting higher values like that you should see the occasional slow operation logged in the log files during that time. If not then perhaps drop your slowms value down a little (to 50ms for example) from the 100ms default and then see if anything gets logged. You can do this without a restart and without using profiling itself by using db.setProfilingLevel() as follows:

db.setProfilingLevel(0, 50)

This has less impact than full profiling and you can leave it on for a longer period to catch the slow operations, though it will cause your log files to be larger, so be careful about disk space for logs if things are tight.

For help filtering and visualizing log files, I highly recommend using mtools, though the usual suspects like grep, awk, sed etc. all work too. As always, beware the false positive - high lock can cause other regularly run operations (like db.serverStatus()) to show up as slow. They are symptoms rather than causes most of the time and they are showing up because they are being run continuously and can be blocked by a busy database or high lock.

Mongodb – Does the MongoDB background flush (MMAP) update the entire document even if only a small portion has changed? i.e. set on an array position

The array contains 348 empty documents to begin with, and over the course of a week these array element will have sub-documents inserted (if empty) and then subsequently updated (if they already exist). The sub-documents are approximately 100 bytes in size and are not indexed.

One consideration with this use case is that your documents are consistently growing. MongoDB used a record allocation or padding strategy to allow documents to grow in-place. For example, if your document starts off as 1000 bytes MongoDB 2.6 or newer will round this up to a 1024 byte record allocation for MMAP (as per the Power of 2 Size default strategy). Updates that don't grow the size of the document beyond the current record allocation are more efficient for the server to execute.

However, if you added 100 bytes to a document which was initially 1000 bytes, the document would have to be moved to a new record allocation in storage (and associated index entries would also have to be updated). So in this example, the next allocation for a 1100 byte document would be 2048 bytes (allowing for ~9 more 100 byte fields to be added before a new record allocation was needed for this document). Indexes in MongoDB include the storage location of the document, so a document move will result in an update for every index entry referencing that document.

You can check the frequency of document moves by looking at the nmoved value for slow updates (or by enabling increased levels of logging / system profiling). Frequent document moves can definitely have a performance impact. Common strategies include either reconsidering the data model (eg. moving the growing portion of the document to a separate collection if appropriate) or adding manual padding to the initial document allocation. The default power of 2 allocation strategy is designed to avoid the need for manual padding in most cases, but if your documents start small and grow quickly you might be able to avoid some initial document moves.

So my question is, when the document is flushed to disk, what actually gets written, the entire document or just the sub-document that has been updated?

The answer will depend on the size of your document and the nature of updates since the last background flush. I'll assume you are using a default configuration with MMAP storage engine and journal enabled.

By default data changes are written twice: once to fast append-only journal files (committed to disk every 100ms) and again to a private view in memory (flushed to data files every 60s). The background flush process is a periodic asynchronous write of all pages that have been "dirtied" in memory since the last flush. Journal commit and background flush intervals can be influenced by both server configuration and write concerns. For a good overview of the process see How MongoDB’s Journaling Works.

The MMAP storage engine will fetch the full document into memory before applying updates. The standard x86 page size is 4KiB so a single document may be represented by one or more pages -- or multiple documents may be part of a single page in memory.

So, if you are updating a single document the writes will include:

all changes written to the journal
all changes written to the oplog (if that node is part of replica set)
any pages dirtied for that document since the last background flush

An important caveat is "since the last background flush". Multiple updates affecting the same pages within a given sync interval will effectively be batched.

If you're trying to get to the bottom of performance issues then consistently high background flush times (particularly as a large or increasing percentage of the default 60s flush interval) are definitely of concern, but should be reviewed in the context of other metrics such as page faults, I/O stats, and lock percentage. I would also review the MongoDB Production Notes for general tips and upgrade to the latest MongoDB production release for your major version (i.e. latest 2.6.x or 3.0.x if there's a newer x than your current version).