How Does Pymongo Connection Work?

clientconnection-poolingmongodbquery

I created a small speed test

from misc.Database import Database
import time

db = Database.getDb()


def main():
    test_db = db.test_db.find({})
    return "done"

if __name__ == '__main__':
    start = time.time()
    for i in range(10000):
        main()
    end = time.time()
    print(end - start)

where db is my pymongo client. When monitoring the mongod log, I realized it opened 2 connections when I run the test. When I run Robo3T, it opened 25 connections to the Mongod. Why is the connection not opened per request? How many connections will open each time you query the database?

Best Answer

First, you are using an old method to connect. The newer and better method is using MongoClient, which is supported in all recently released supported drivers. It's explained in detail here: https://mongodb.github.io/node-mongodb-native/driver-articles/mongoclient.html. Note that although the link discusses the node driver implementation of it, it is also relevant for pymongo.

Second, you're not iterating on the returned cursor. This means that the find() query was not executed on the server. I presume you see a really fast result in this test, which will not be the case if you're actually getting data from the server.

Third, Python is single-threaded by its nature. Pymongo is smart enough to realize you're not iterating the cursor, so it reuses the connection instead of creating a new one. Creating a new connection is very expensive, so drivers will not do it unless it's really necessary. The two connections you're seeing is likely one for monitoring server status, and the other for executing your query. The cost of opening a new connection leads to drivers using a connection pool instead.

Fourth, Robo3T is a GUI, so it naturally needs to open more connections since presumably it requires a lot of information from the server, perhaps asynchronously. This is a very different situation vs. a driver. You can't really compare the two.

Finally, performance testing is a tricky subject and must be done in a very controlled manner. Some things that needs planning:

  • How is the server provisioned, how big are the documents, how compressible are they, etc.
  • Can you determine whether any performance bottleneck is due to the server, the network, the driver, the query, or how the language is used. For example, if you try to iterate on the cursor using list(db.test_db.find()), how can you tell if the database is slow, or is it Python's list() method that is slow?
  • Running the testing code in the same machine as the database server might introduce resource contention, artificially skewing the result.