Data:
student class
s1 ["english"]
s2 ["maths", "geography"]
...
sN ["history", "english"]
The field "class" is indexed, and values of it are all in array.
Query1:
db.students.explain('executionStats').aggregate([
{ $match: { $expr: { $setIsSubset: [ "$class", [ "maths" ] ] } } }
])
,
Query2:
db.students.explain('executionStats').aggregate([
{ $match: { "$class": "maths" } }
])
The first one does a COLLSCAN or IXSCAN with indexBounds class : [MinKey, MaxKey]
, depending on some conditions (e.g. the number of documents in the collection), while the second does a IXSCAN with indexBounds class : ["maths", "maths"]
.
Given that the array of the "class" field can have multiple values, I need the $setIsSubset semantics here. I found this. However, the double negation does a COLLSCAN instead of using the index as well. The corresponding query is as follows:
Query3:
db.students.explain('executionStats').aggregate([
{ $match: { class: { $not: { $elemMatch: { $nin: ["maths" ] } } } } }
])
It seems like MongoDB only does simple matching during the IXSCAN. Otherwise, it should be able to avoid doing a IXSCAN with [MIN, MAX] which essentially fetches all keys in the index. It can be possibly even slower than the COLLSCAN.
Wondering if there is anyway to achieve the $setIsSubset functionality, which can utilize the index efficiently.
Best Answer
An index is basically a mapping from a field value to a document. In MongoDB each element of the array is listed in the index separately.
Using that example data, the index might look something like:
This lends itself well to equality and inclusion searches. In a query for
{ $match: { "$class": "maths" } }
, it would simply start at the first occurrence of "maths", scan through the index entries until it reach a value that is not "maths", and pass the list of matching on to the next part of the query. The same works for an$in
search with a separate scanned range for each value being tested.This can also work for
$nin
by scanning the ranges of the index that fall between each value in the query.The query for
{$setIsSubset: {"$class", [input]}}
is essentiallyIn order to efficiently use an index with the curren structure, the MongoDB would need to scan all of the matching ranges to find documents that match, and scan all non-matching ranges for documents that do not match, sort one or both lists, and then compare the two lists to eliminate documents that have both matching and non-matching values.
I am not sure if that would perform better or worse than just scanning all of the documents.
It may be worth submitting a feature request to see if that can be implemented.
Also node that the following from the $expr documentation:
In MongoDB "multikey" means at least one of the fields in one of the indexed documents contains an array.
As a workaround, you might try doing separating that test into multiple aggregation stages, first test the array with
$in
which will use and index to reduce the number of documents examined, then test the with$setIsSubset
will be able examine only the documents that have at least one matching value.I would be very curious how this query performs compared with the two you have tried already: