Mongodb – Improving mongodb read throughput for tiny database

mongodbperformance

I am given to understand that MongoDB will essentially perform like it is pulling records from memory if the working set is small. I wrote a simple MongoDB test program that inserts just one record into a collection with an indexed primary key and another field and uses findOne to read the field of the inserted key.

The read throughput I am getting with many threads is just ~14K/s on my 2-core laptop, which is better than, say, mysql, but this throughput still seems awfully low given that a java hashmap gives me a read throughput of nearly ~2 million/s. Shouldn't I be getting performance comparable to a completely in-memory map? What else does MongoDB really have to do for a read-only workload with a tiny database? Do I need to change any settings from the MongoDB defaults?

I have just one "document" that has a string primary key and small string field "some value".

Test code I scratched up that gives me 15-20K/s on a 2-core machine. You would need org.json and mongodb jars to run it.

import java.net.UnknownHostException;
import java.text.DecimalFormat;
import java.util.concurrent.ScheduledThreadPoolExecutor;

import org.json.JSONException;
import org.json.JSONObject;

import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.DuplicateKeyException;
import com.mongodb.MongoClient;
import com.mongodb.MongoException;
import com.mongodb.util.JSON;

@SuppressWarnings("javadoc")
public class MongoSmallWorkingSetRead {
private static ScheduledThreadPoolExecutor executor = new   ScheduledThreadPoolExecutor(
        8);

private static long initTime = System.currentTimeMillis();
private static int count = 0;

private static synchronized int incrCount() {
    return ++count;
}

private static synchronized int getCount() {
    return count;
}

private static synchronized void reset() {
    count = 0;
    initTime = System.currentTimeMillis();
}

private static void testReadRate(String dbName, String collectionName,
        String primaryID, String fieldKey) throws UnknownHostException,
        JSONException {
    MongoClient mongoClient = new MongoClient("localhost");
    DB db = mongoClient.getDB(dbName);
    String primaryIDKey = "primaryIDKey";
    db.getCollection(collectionName).createIndex(
            new BasicDBObject(primaryIDKey, 1),
            new BasicDBObject("unique", true));

    JSONObject json = new JSONObject();
    json.put(fieldKey, "some value");
    json.put(primaryIDKey, primaryID);

    DBCollection collection = null;
    db.requestStart();
    try {
        db.requestEnsureConnection();

        collection = db.getCollection(collectionName);
        DBObject dbObject = (DBObject) JSON.parse(json.toString());
        try {
            collection.insert(dbObject);
        } catch (DuplicateKeyException e) {
            // throw new RecordExistsException(collectionName, primaryKey);
            // suppress it as it's expected
        } catch (MongoException e) {
            e.printStackTrace();
        }
    } finally {
        db.requestDone();
    }

    db.requestStart();
    db.requestEnsureConnection();

    // test read speed
    try {
        DBObject dbObject = null;
        BasicDBObject query = new BasicDBObject(primaryIDKey, primaryID);
        BasicDBObject projection = new BasicDBObject().append("_id", 0)
                .append(fieldKey, 1);
        int frequency = 10000;
        do {
            try {
                dbObject = collection.findOne(query, projection);
            } finally {
                db.requestDone();
            }

            if (incrCount() % frequency == 0) {
                System.out.println(dbObject);
                System.out.println("op/s = "
                        + new DecimalFormat().format(getCount() * 1000.0
                                / (System.currentTimeMillis() - initTime)));
                if (getCount() > frequency * 20) {
                    System.out
                            .println("**********************resetting************************");
                    reset();
                }
            }
        } while (true);
    } catch (Exception e) {
        System.out.println("Lookup failed: " + e);
    }
}

public static void main(String[] args) throws Exception {
    if (args.length == 3) {
        for (int i = 0; i < executor.getCorePoolSize(); i++) {
            executor.submit(new Runnable() {
                public void run() {
                    try {
                        testReadRate("dbName", args[0], args[1], args[2]);
                    } catch (UnknownHostException | JSONException e) {
                        e.printStackTrace();
                    }
                }
            });
        }
    } else {
        System.out.println("Usage: "
                + MongoSmallWorkingSetRead.class.getSimpleName()
                + " <collectionName> <primaryIDKey> <fieldKey>");
    }
}
}

db.stats() output:

db.stats()
{
"db" : "node",
"collections" : 0,
"objects" : 0,
"avgObjSize" : 0,
"dataSize" : 0,
"storageSize" : 0,
"numExtents" : 0,
"indexes" : 0,
"indexSize" : 0,
"fileSize" : 0,
"ok" : 1
}

Best Answer

You have a major mistake in your code. MongoClient creates a connection pool. Even in large applications, it is hence usually a singleton. So you should have it as a global variable, initialize it in main and reuse it in each runnable. Which is perfectly fine, since MongoClient is thread safe.

Another thing to keep in mind is that although the single document sure is in the working set and hence should be in RAM, you application still needs to communicate with mongod. So your query will be translated to MongoDB's wire protocol, sent to the server where it will be executed, the matching documents identified (in this case only one, though this is not necessarily transparent before execution), finally sent back to the client and the answer is translated from MongoDB's wire protocol to Java terms. This is obviously going to be slower than simple Java native accesses in the same JVM without match conditions.

Finally, let's do some maths:

$\frac{1s}{14000\ docs}=\frac{0.00007142857143s}{doc} = \frac{71.42857143\mu s}{doc}$

which even not taking the overhead of unnecessary MongoClients into account is pretty fast in my book.

Related Solutions

Mongodb – Improving concurrency in MongoDB via sharding

You are correct, with MongoDB the way to engineer around a write contention issue is to shard.

Your environment sounds like it's fairly bursty, in that you're not continually ingesting data and instead ingesting it in fairly discrete chunks. With this in mind, you could go with a collector/distributor model such as this:

     ------>{ workers }<------
     |                       |
[Shard01]               [Shard02]
    |                      |
    ---->[Persistant]<------

The workers would upsert/push their results into the sharded collection/database, and once the job was completed a batch-process then uses something like db.copyDatabases() to push the result set at the monolithic (and cheaper to run) single instance Mongo. As the copy database process can push all of its updates in one run, it should experience much fewer write-lock problems.

MongoDB MMS – How to Determine Database Read/Write Ratio

MMS charts have multiple quantities you can add to determine the number of reads and the number of writes - at a glance they are:

Reads:
- queries
- getmores
Writes:
- inserts
- updates
- deletes/removes

As a general ballpark value, the ratio of these is fine.

It's not quite this simple though because there are other read and write "loads" on the system:

system writes which don't show up in counters, such as TTL deletes, etc.
commands which may be reads: count, aggregate, in 2.4 and earlier getLastError().
commands which may be writes: findAndModify, etc.
on a primary a multi-update or remove that matches multiple documents is recorded as a single write operation (on secondary it would be a single write per document affected)
a single read operation (aka find) may be a fast, indexed query or a slow collection scan of millions of documents.

Hope this helps, as in general you usually care about not exact absolute ratio of reads to writes, but rather how that ratio may be changing over time.

Best Answer

Related Solutions

Mongodb – Improving concurrency in MongoDB via sharding

MongoDB MMS – How to Determine Database Read/Write Ratio

Related Question