MongoDB fast upsert

importmongodbupsert

For non-incremental, but overlapping, imports, I worry about upserts to be very slow and locking the resources for hours.

From different sources, I receive 1 json dump per day per source for up to 3 days.
Sometimes, they arrive 1-2 days delayed.
That's why every day, all of them – if available – are re-imported into the "merged" collection. This is done by upsert to make sure that e.g. yesterdays documents, that have been imported and also processed and sometimes updated, won't be overwritten.

The input data from different sources is sort by date, but there is not a single unique field.

The merged collection where all the data becomes imported into, 5 indexes, which seems to make the upsert/import even slower.

Each document has a (non-unique) unix timestamp value that has an index, too (not the mongodb date/timestamp, but a number).
It feels like there was no advantage by the ordered data and the upsert looks up the entire index although a unix timestamp exists in each document.

Is there a better practise or at least an option to increase the speed of imports of this non-incremental but sorted data by taking advantage of the unix_timestamp field/index instead of seeking for former documents in the entire collection?

Best Answer

Since you are using mongoimport to do these upserts without --upsertFields, it will be using the _id index to do the upsert (see the note in the docs). That means that it will be scanning that index, and if you are including the _id field in the json dump that should be fine in terms of that particular search as long as that index is in memory. If you are not including the _id, then that will mean a full scan.

You can alter this behavior by using the aforementioned upsertFields option, though you will want to make sure those fields you pick are indexed (a compound index of the fields used rather than _id).

The overhead from the other indexes is actually probably just the overhead that occurs when you are updating several indexes as part of a data load.

Finally, if you are doing this repeatedly and intend to keep going, I would recommend creating your own tool to do this and having more control over the entire process. While mongoimport is a fine choice for simple tasks, more complex imports are better off handled by a more complex and customizable tool.