Mongodb – In MongoDB, $setIsSubset does not make efficient use of index

mongodb

Data:
student        class
s1          ["english"]
s2          ["maths", "geography"]
...
sN          ["history", "english"]

The field "class" is indexed, and values of it are all in array.

Query1:
db.students.explain('executionStats').aggregate([
  { $match: { $expr: { $setIsSubset:  [ "$class", [ "maths" ] ] } } }
])

,

Query2:
db.students.explain('executionStats').aggregate([
  { $match: { "$class": "maths" } }
])

The first one does a COLLSCAN or IXSCAN with indexBounds class : [MinKey, MaxKey], depending on some conditions (e.g. the number of documents in the collection), while the second does a IXSCAN with indexBounds class : ["maths", "maths"].

Given that the array of the "class" field can have multiple values, I need the $setIsSubset semantics here. I found this. However, the double negation does a COLLSCAN instead of using the index as well. The corresponding query is as follows:

Query3:
db.students.explain('executionStats').aggregate([
  { $match: { class: { $not: { $elemMatch: { $nin: ["maths" ] } } } } }
])

It seems like MongoDB only does simple matching during the IXSCAN. Otherwise, it should be able to avoid doing a IXSCAN with [MIN, MAX] which essentially fetches all keys in the index. It can be possibly even slower than the COLLSCAN.

Wondering if there is anyway to achieve the $setIsSubset functionality, which can utilize the index efficiently.

Best Answer

An index is basically a mapping from a field value to a document. In MongoDB each element of the array is listed in the index separately.

Using that example data, the index might look something like:

"geography" => s2
"english" => s1
"english" => sN
"history" => sN
"maths" => s2

This lends itself well to equality and inclusion searches. In a query for{ $match: { "$class": "maths" } }, it would simply start at the first occurrence of "maths", scan through the index entries until it reach a value that is not "maths", and pass the list of matching on to the next part of the query. The same works for an $in search with a separate scanned range for each value being tested.

This can also work for $nin by scanning the ranges of the index that fall between each value in the query.

The query for {$setIsSubset: {"$class", [input]}} is essentially

${and:[
   {class:{$in:[input]}}, 
   {class:{$not:{$nin:[input]}}}
]}

In order to efficiently use an index with the curren structure, the MongoDB would need to scan all of the matching ranges to find documents that match, and scan all non-matching ranges for documents that do not match, sort one or both lists, and then compare the two lists to eliminate documents that have both matching and non-matching values.

I am not sure if that would perform better or worse than just scanning all of the documents.

It may be worth submitting a feature request to see if that can be implemented.

Also node that the following from the $expr documentation:

$expr does not support multikey indexes.

In MongoDB "multikey" means at least one of the fields in one of the indexed documents contains an array.

As a workaround, you might try doing separating that test into multiple aggregation stages, first test the array with $in which will use and index to reduce the number of documents examined, then test the with $setIsSubset will be able examine only the documents that have at least one matching value.

I would be very curious how this query performs compared with the two you have tried already:

db.students.explain('executionStats').aggregate([
  { $match: { class: { $in: [ "maths" ] }} },
  { $match: { $expr: { $setIsSubset:  [ "$class", [ "maths" ] ] } } }
])