Mongodb – Speeding up MongoBD query { $ne: [] }

mongodb

I'm currently doing a MongoDB aggregation but then have a query looking for all documents where a certain field, whose value is an array, is empty.

obs = db.collection.count({'things.titles': {'$ne': [] }})

To ensure this query uses an index, do I just need to do something like this?:

db.collection.ensureIndex({'things.titles': 1})

I understand this creates a multikey index, but it still takes a very long time (more than an hour) on a collection sized at 4739208 documents.

Best Answer

Your question specifies:

... a query looking for all documents where a certain field, whose value is an array, is empty.

Your query is count all documents where the array is not empty, is this intended?

Depending upon the cardinality of the values of this field, this could be causing these values to be expensive.

To query for empty arrays, the operation would be:

 db.collection.explain().count( 'things.titles' : [ ] } )

Note that there are edge cases to this query. For example, the following would be returned:

  { "_id" : ..., "things.titles" : [ [ ] ] }

The index usage of the count operation can be determined using the collection.explain() function.

For example:

 db.collection.explain().count({'things.titles': {'$ne': [] }})

The output will provide information showing the index bounds being used (if they exist).

Related Solutions

Mongodb – Huge index for covered queries (indexOnly: true) or (indexOnly: false) with collection scan

Why is indexOnly: false. Isn't it supposed to be a covered index query? (see the explain later)

I believe this is a result of the isMultiKey : true field in the explain results. Basically, currently indexOnly is never true when isMultiKey is true.

This is a known problem in general with multi key indexes. You can find the relevant bug here:

https://jira.mongodb.org/browse/SERVER-3173

As well as some decent explanation in the linked/dupe bug here:

https://jira.mongodb.org/browse/SERVER-7595

I think you have done some manual munging of the fields here for some reason, but I would guess that search.keywords is the problem here. Try an index without that as the final field and see if that performs better.

I need to retrieve some additional fields from the collection (the id and the profile_picture url). Should I add them to the index to avoid hitting the collection, even if I'll never have to query them?

I'd recommend a separate index for those queries rather than massive single index. If you end up with too many fields in the index you are going to lose most of the benefit by simply having to scan through a massive index instead of a collection. An index that big will also likely have performance issues for updates/writes.

MongoDB – Searching for Array Elements Nested in Documents

Remember, MongoDB has a dynamic schema. So it is perfectly ok to store this document:

{
  "JobNumber" : "50001-01",
  "CustomerId" : "joe",
  "IdentifierNumber" : NumberLong(8812739),
  "TimesPrinted" : 0,
  "Packaging" : {"bundle":1200,"box":120,"pallet":3}
}

and this document

{
  "JobNumber" : "50001-02",
  "CustomerId" : "jane",
  "IdentifierNumber" : NumberLong(8812739),
  "TimesPrinted" : 0,
  "Packaging" : {"sack":200}
}

in the same collection.

Since, I wouldn't query for the Nth document, but for a given field in the subdocument, for example

 db.collection.find({"packaging.bundle":1200})

which would run just fine with MongoDB. The reason behind that is that if a field isn't present in a document, it is evaluated as null for a query. And null is definitely not equal to 1200.

As for the performance. It really depends on who big your collection is and how your queries look like. While the query as shown above may be rather slow in a collection containing hundred of thousands of documents (or even more) without an index, it can be extremely fast when you created an index on it, e.g.

    db.collection.ensureIndex({"packaging.bundle":1,"packaging.box":1,"packaging.pallet":1});

If you can create an index like this obviously depends on the question wether you really have arbitrary packaging or if you simply have a variety of packaging options. If the latter is the case, I'd create an index for each of the packaging options, utilizing sparse indices, e.g.

 db.collection.ensureIndex({"packaging.sack":1},{sparse:true})

This would reduce the index size, as only documents which hold the field "packaging.sack" would be contained in this index.

If you really have arbitrary fields in the documents, I wonder how you create a model for it ;)

When talking of just some ten thousands of documents, you might even get satisfying result without an index.

Best Answer

Related Solutions

Mongodb – Huge index for covered queries (indexOnly: true) or (indexOnly: false) with collection scan

MongoDB – Searching for Array Elements Nested in Documents

Related Question