MongoDB merge two collections and drop documents with same value in field

mongodbmongorestore

In my database I have two collections, but some documents were added (possibly) at different times to both collections.

I could use mongodump and then mongorestore for merging. But then I have the same documents that were added to both collections as duplicates in my new collection. mongorestore --drop does not help neither, because the documents not necessarily have the same _id.

How to drop a document when a document with userid exists already?

Best Answer

There might be more elegent solutions for this, but let me answer how I merged a new collection into an original collection without importing the duplicates.

  1. Compare userid fields in both collections and take the difference.
  2. Use mongos $out operator to save documents to a difference collection. This new collection has all documents of the new collection, except of the duplicates.
  3. mongodump the difference collection and mongorestore it into the original collection.
from pymongo import MongoClient
import subprocess

# collection names
db_name = 'db'
col_original = 'col_original'
col_new = 'col_new'
field = 'userid'


# Get ids from collection
def get_ids(db, col):
    docs = db[col].aggregate([
        {'$project':
         {'_id': 0,
          'id': '$' + field}}
    ])
    docs = list(docs)
    ids = [x['id'] for x in docs]
    return ids


# Connect to MongoDB
client = MongoClient('mongodb://localhost')
db = client[db_name]

# Get difference in ids
ids_new = get_ids(db, col_new)
ids_original = get_ids(db, col_original)
ids_diff = list(set(ids_new).difference(ids_original))

# Get all documents with userid that are in col_new, but not in
# col_original. Hence, all duplicates are skipped.
db[col_new].aggregate([
    {'$match': {
        field: {'$in': ids_diff}}},
    {'$out': 'col_diff'}])

# Use mongodump to save col_diff
subprocess.check_output(['mongodump',
                         '-d',
                         db_name,
                         '-c',
                         'col_diff'])

# Merge col_diff into col_original
subprocess.check_output(['mongorestore',
                         '-d',
                         db_name,
                         '-c',
                         col_original,
                         'dump/' + db_name + '/col_diff.bson'])
Related Question