Mongodb: Determining shard key strategy on compound index

mongodbmongodb-3.4sharding

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.

The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.

Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1}
because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.

So looking to get a picture as to

  1. How the data will be distributed to shards during insertion
    and how the range of a chunks are calculated during insertion?
  2. Since the PropertyId is random value. Does the compound key fits the
    definition of monotonically increasing value?

I am a newbie to MongoDB. And wanted to know if I am on the right path?

Best Answer

MongoDB supports compound indexes, where a single index structure holds references to multiple fields ( As MongoDB Limits and Thresholds MongoDB imposes a limit of 31 fields for any compound index.) within a collection’s documents. The following diagram illustrates an example of a compound index on two fields:

enter image description here

How the data will be distributed to shards during insertion and how the range of a chunks are calculated during insertion?

Starting in version 2.6, MongoDB can use index intersection to fulfill queries. The choice between creating compound indexes that support your queries or relying on index intersection depends on the specifics of your system.

Index intersection does not eliminate the need for creating compound indexes. However, because both the list order (i.e. the order in which the keys are listed in the index) and the sort order (i.e. ascending or descending), matter in compound indexes, a compound index may not support a query condition that does not include the index prefix keys or that specifies a different sort order.

For example, if a collection orders has the following compound index, with the status field listed before the ord_date field:

{ status: 1, ord_date: -1 }

The compound index can support the following queries:

db.orders.find( { status: { $in: ["A", "P" ] } } )
db.orders.find(
   {
     ord_date: { $gt: new Date("2014-02-01") },
     status: {$in:[ "P", "A" ] }
   }
)

But not the following two queries:

db.orders.find( { ord_date: { $gt: new Date("2014-02-01") } } )
db.orders.find( { } ).sort( { ord_date: 1 } )

However, if the collection has two separate indexes:

{ status: 1 }
{ ord_date: -1 }

The two indexes can, either individually or through index intersection, support all four aforementioned queries.

Since the PropertyId is random value. Does the compound key fits the definition of monotonically increasing value?

For the fastest processing, ensure that your indexes fit entirely in RAM so that the system can avoid reading the index from disk.

To check the size of your indexes, use the db.collection.totalIndexSize() helper, which returns data in bytes:

> db.collection.totalIndexSize() 
4294976499

The above example shows an index size of almost 4.3 gigabytes. To ensure this index fits in RAM, you must not only have more than that much RAM available but also must have RAM available for the rest of the working set.

If you have and use multiple collections, you must consider the size of all indexes on all collections. The indexes and the working set must be able to fit in memory at the same time.

For further your ref Indexing Strategies