For non-incremental, but overlapping, imports, I worry about upserts to be very slow and locking the resources for hours.
From different sources, I receive 1 json dump per day per source for up to 3 days.
Sometimes, they arrive 1-2 days delayed.
That's why every day, all of them – if available – are re-imported into the "merged" collection. This is done by upsert to make sure that e.g. yesterdays documents, that have been imported and also processed and sometimes updated, won't be overwritten.
The input data from different sources is sort by date, but there is not a single unique field.
The merged collection where all the data becomes imported into, 5 indexes, which seems to make the upsert/import even slower.
Each document has a (non-unique) unix timestamp value that has an index, too (not the mongodb date/timestamp, but a number).
It feels like there was no advantage by the ordered data and the upsert looks up the entire index although a unix timestamp exists in each document.
Is there a better practise or at least an option to increase the speed of imports of this non-incremental but sorted data by taking advantage of the unix_timestamp field/index instead of seeking for former documents in the entire collection?
Best Answer
Since you are using
mongoimport
to do these upserts without--upsertFields
, it will be using the _id index to do the upsert (see the note in the docs). That means that it will be scanning that index, and if you are including the_id
field in the json dump that should be fine in terms of that particular search as long as that index is in memory. If you are not including the_id
, then that will mean a full scan.You can alter this behavior by using the aforementioned
upsertFields
option, though you will want to make sure those fields you pick are indexed (a compound index of the fields used rather than _id).The overhead from the other indexes is actually probably just the overhead that occurs when you are updating several indexes as part of a data load.
Finally, if you are doing this repeatedly and intend to keep going, I would recommend creating your own tool to do this and having more control over the entire process. While
mongoimport
is a fine choice for simple tasks, more complex imports are better off handled by a more complex and customizable tool.