Mongodb – How to define shardKey on GridFS collections to achieve Location/Data Centre affinity (MongoDB)

mongodbmongodb-3.4sharding

I have set up a mongo cluster as follows:

  1. 3 shards, one per global region APAC/EMEA/AMER (in different data centres)
  2. Shard Key is REGION + 'MONOTONICALLY INCREASING PRIMARY KEY'
  3. The "writing" clients for the cluster [typically] connect to the mongos in their region, and will be writing documents with their region's region.
  4. The "reading" clients for the cluster [typically] connect to the mongos in their region, and will be reading documents from across multiple regions (broadcast query).

The system is highly write intensive, with very few reads. Thus, overall I believe the above setup is optimal, since writers get fast "local" writes, while readers get convenience of single query to read across the shard, albeit slightly slow [and just to underline, for reading, I value convenience over speed].

OK so that's the background. But then suddenly I realized: I'm also using GridFS to store files, and I want the files to get stored in same way (i.e. write to local region).

Hopefully in the context of the above, my original question makes sense:

How to define shardKey on GridFS collections (fs.files/fs.chunks) to achieve Location/Data Centre affinity?

This is what I've tried:

I've noticed that I can add meta-data in during the upload operation, such that the fs.FILES collection has the region in as an accessible field that could be used as part of the sharding key for fs.files (i.e. sharding key REGION + "FILE_ID").

However, what about fs.chunks? will they follow around their "parent" fs.file record and get routed to the same shard as the "parent" fs.file?

Thanks in advance!

Best Answer

However, what about fs.chunks? will they follow around their "parent" fs.file record and get routed to the same shard as the "parent" fs.file?

The answer is no!

You could use files.aliases or files.metadata to store that REGION.

How ever maybe best solution (both chunks and files) is NOT to shard your GridFS. Create three different GridFS databases (APAC/EMEA/AMER) and then use command movePrimary to change "primary shard" to point right shard in your cluster. Then change your application so that it selects GridFS database base on the user's location. This, of course, works with all other collections what are REGION specific.