Mongodb – How to automate sharding in the presence of continuous reads

mongodb

I am using a pymongo in a sharded mongo environment.

I have some fairly write-intensive systems, so I split the data up by time, with one database per hour, so the heavy-lifting of index-updates can happen in-memory, and to give us more flexibility to make performance-related tradeoffs.

I have a number of processes writing to the latest hour's database, and another handful of processes reading from the last n hours of data. To keep things simple, I would like these processes to remain independent. (I do not want to have to synchronize them).

The hourly databases are sharded on creation, so that reads and writes are distributed across the cluster. Unfortunately, the (independent) read processes fall over when they try to read from the latest db whilst the sharding operation is still incomplete.

As I see it, I have a small handful of options available to me:

  1. I can make the read process tolerant to errors, so that when it encounters a failure due to in-progress sharding operations, it can carry on regardless. (This seems the neatest and most pythonic option to me, but runs the risk that the throwing/catching process might lead to large delays).

  2. I can make the read process ignore the most recent database until it is at least 10 minutes old. This will give the sharding operation some time to complete. I do not really like this option because of the involvement of a seemingly arbitrary time interval, making it feel less deterministic than I would like. (Perhaps more acceptable, I could pre-create a few hours worth of databases, and ensure that they are sharded & fully configured well in advance)

  3. I can make the write process create each hourly database by calling the copyDatabase command on a pre-sharded empty database. This approach presumes that the sharding state will be copied with the rest of the database, and that the in-progress copy operation will not be vulnerable to the same (or similar) errors as the in-progress sharding operation was.

  4. I can attempt to synchronise the reads and the writes by maintaining a record of the most-recently-created database, populated when the database is fully ready to be written to. I do not like this approach because it couples the read & write processes together, and introduces a level of complexity that I think I can avoid.

Are there any options that I have missed?

Right now, option 2b seems the most attractive.

Best Answer

Some thoughts on each of the options:

  1. I would say that writing this code in any case is probably a good idea - if something does go wrong on the performance side of things you already have the code written to deal with it in some way, rather than just sending exceptions or timeout messages back to the users.
  2. This sounds like pre-splitting and pre-allocation, which again, is probably a good idea to do anyway - why wait until just before use to create a database? In fact, this would mirror how mongodb does data file allocation, it always has one empty data file waiting to be filled before it is needed. The pre-allocation it does is for adding data to a DB though, it can't psychically figure out the need for a new DB, hence that initial allocation may be slowing you down. Also, make sure you are on ext4/XFS and not using an old kernel to ensure that the allocation is as quick as possible.
  3. I believe that doing pre-allocation is a better option here - I have not heard of anyone using copyDatabase in this manner, so it would require a lot of testing. Pre-allocating databases in advance though is quite common.
  4. This does sound a bit painful, I think there are better options described above

Overall, if you know you are going to need a new database every hour, create multiple databases in advance of their need. I know this would mean extra disk usage, but it should be quite predictable (look into the --smallfiles option if your data set is quite small).