I am using a pymongo in a sharded mongo environment.
I have some fairly write-intensive systems, so I split the data up by time, with one database per hour, so the heavy-lifting of index-updates can happen in-memory, and to give us more flexibility to make performance-related tradeoffs.
I have a number of processes writing to the latest hour's database, and another handful of processes reading from the last n hours of data. To keep things simple, I would like these processes to remain independent. (I do not want to have to synchronize them).
The hourly databases are sharded on creation, so that reads and writes are distributed across the cluster. Unfortunately, the (independent) read processes fall over when they try to read from the latest db whilst the sharding operation is still incomplete.
As I see it, I have a small handful of options available to me:
-
I can make the read process tolerant to errors, so that when it encounters a failure due to in-progress sharding operations, it can carry on regardless. (This seems the neatest and most pythonic option to me, but runs the risk that the throwing/catching process might lead to large delays).
-
I can make the read process ignore the most recent database until it is at least 10 minutes old. This will give the sharding operation some time to complete. I do not really like this option because of the involvement of a seemingly arbitrary time interval, making it feel less deterministic than I would like. (Perhaps more acceptable, I could pre-create a few hours worth of databases, and ensure that they are sharded & fully configured well in advance)
-
I can make the write process create each hourly database by calling the copyDatabase command on a pre-sharded empty database. This approach presumes that the sharding state will be copied with the rest of the database, and that the in-progress copy operation will not be vulnerable to the same (or similar) errors as the in-progress sharding operation was.
-
I can attempt to synchronise the reads and the writes by maintaining a record of the most-recently-created database, populated when the database is fully ready to be written to. I do not like this approach because it couples the read & write processes together, and introduces a level of complexity that I think I can avoid.
Are there any options that I have missed?
Right now, option 2b seems the most attractive.
Best Answer
Some thoughts on each of the options:
Overall, if you know you are going to need a new database every hour, create multiple databases in advance of their need. I know this would mean extra disk usage, but it should be quite predictable (look into the
--smallfiles
option if your data set is quite small).