Mongodb – Huge index for covered queries (indexOnly: true) or (indexOnly: false) with collection scan

mongodb

I have two questions, I'll give you the facts, then my questions.

The actual index I'm using for one of my queries has the following characteristics :

  • Multikey (contains 4 keys).
  • The average key length is 14 characters.
  • 2 indexes values contains 2 letters (country code and state code).
  • 1 index values contain numerical values (between 0 and 100).
  • The index size is : ~395 MB.

Here's an example of what should be indexed (I don't know how mongodb actually is storing it's indexes, so I'm going to represent it like if it was a collection document):

{
    "geolocation.statecode": "AL",
    "personnu_field": 50,
    "geolocation.countrycode": "US"
    "field1.sufield": "Awesome value"
}

Informations about the collection concerned by the index :

  • The collection is a 6M documents.
  • Fast growing.
  • It's actually a collection of Twitter users with some additional business related fields.
  • field1 (see the example given above) is an array of subdocuments.
  • avgObjSize: 3 KB.
  • totalIndexSize: ~2.9 GB.
  • storageSize: ~19 GB.

I've done an explain to this long query :

db.crawler_users.find(
    {
        "geolocation.statecode": "AL",
        personnu_field: { "$lte": 65, "$gte": 30 },
        "geolocation.countrycode": "US"
    },
    {
        personnu_field:1, _id:0
    }
).hint(
    {
        "geolocation.countrycode" : 1,
        "personnu_field" : -1,
        "geolocation.statecode" : 1,
        "field1.sufield" : 1
    }
).explain()

And here's the result:

{
    "cursor" : "BtreeCursor geolocation.countrycode_1_personnu_field_-1_geolocation.statecode_1_field1.sufield_1",
    "isMultiKey" : true,
    "n" : 216,
    "nscannedObjects" : 788609,
    "nscanned" : 788609,
    "nscannedObjectsAllPlans" : 788609,
    "nscannedAllPlans" : 788609,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 128,
    "nChunkSkips" : 0,
    "millis" : 127451,
    "indexBounds" : {
        "geolocation.countrycode" : [
            [
                "US",
                "US"
            ]
        ],
        "personnu_field" : [
            [
                65,
                -1.7976931348623157e+308
            ]
        ],
        "geolocation.statecode" : [
            [
                    {
                    "$minElement" : 1
                },
                    {
                    "$maxElement" : 1
                }
            ]
        ],
        "search.keyword" : [
            [
                    {
                    "$minElement" : 1
                },
                    {
                    "$maxElement" : 1
                }
            ]
        ]
    },
}

As you can see the query takes much time to be executed (>2 minutes). And it's hitting the collection. even if the chosen fields already exists on the index.

I have two questions somehow related :

  1. Why is indexOnly: false. Isn't it supposed to be a covered index query? (see the explain later)
  2. I need to retrieve some additional fields from the collection (the id and the profile_picture url). Should I add them to the index to avoid hitting the collection, even if I'll never have to query them?

Best Answer

Why is indexOnly: false. Isn't it supposed to be a covered index query? (see the explain later)

I believe this is a result of the isMultiKey : true field in the explain results. Basically, currently indexOnly is never true when isMultiKey is true.

This is a known problem in general with multi key indexes. You can find the relevant bug here:

https://jira.mongodb.org/browse/SERVER-3173

As well as some decent explanation in the linked/dupe bug here:

https://jira.mongodb.org/browse/SERVER-7595

I think you have done some manual munging of the fields here for some reason, but I would guess that search.keywords is the problem here. Try an index without that as the final field and see if that performs better.

I need to retrieve some additional fields from the collection (the id and the profile_picture url). Should I add them to the index to avoid hitting the collection, even if I'll never have to query them?

I'd recommend a separate index for those queries rather than massive single index. If you end up with too many fields in the index you are going to lose most of the benefit by simply having to scan through a massive index instead of a collection. An index that big will also likely have performance issues for updates/writes.