MongoDB TokuMX – Tuning for Fastest Bulk Reads

mongodbperformancequery-performancetokumx

What options do I have for tuning MongoDB / TokuMX for reading the results of queries with large result sets? I'm currently on a single server (no sharding) with the data on an SSD drive. My queries are simply {field: value} with an index on field. The collection has ~1B records, of which I'm retrieving ~50M.

Looking at top, it seems like there are alternating spikes of mongod CPU (~200% CPU, i.e., 100% of 2 cores) and times when it appears that mongod is waiting on IO.

What kinds of things can I try to improve query time?

Best Answer

The first thing to look at is db.serverStatus().ft. This has a bunch of metrics that may be helpful, to figure out where you're spending time. These are documented here: http://docs.tokutek.com/tokumx/tokumx-server-status.html

Usually the way to improve query time is to make sure you have the right index for your query. You might be doing a query on {field: value} and have an index on {field: 1}, but if you're not projecting in your query, you'll still be doing lots of point queries back to the primary key index to get the full document, which is very inefficient. You may want to add other fields to the index to make it covering, and then project to those fields, e.g. index on {field: 1, other: 1, fields: 1} and then make sure you project in your query like db.foo.find({field: "val"}, {field: 1, other: 1, fields: 1}).

Alternatively, you can use a custom primary key of {field: 1, _id: 1} if these are your most common queries, or instead just make the secondary key on {field: 1} a clustering key.

If you want more help, it would be great if you could provide your server configuration, collection schema (db.foo.getIndexes() is fine) and stats (db.foo.stats()), realistic sample queries and cardinality, and output of db.serverStatus().ft.