Mongodb – Why aggregate query with lookup is extremely slow

aggregatemongodb

I have a mongodb query that works but takes too long to execute and causes my CPU to spike to 100% while it is executing. It is this query here:

  db.logs.aggregate([
    {
      $lookup:
      {
        from: 'graphs',
        let: { logId : '$_id' },
        as: 'matched_docs',
        pipeline:
          [
            {
              $match: {
                $expr: {
                  $and: [
                    { $eq: ['$$logId', '$lId'] },
                    { $gte: [ '$d', new Date('2020-12-21T00:00:00.000Z') ] },
                    { $lt: [ '$d', new Date('2020-12-23T00:00:00.000Z') ] }
                  ]
                }
              }
            }
          ],
      }
    },
    {
      $match: {
        $expr: {
          $and: [
            { $eq: [ '$matched_docs', [] ] },
            { $gte: [ '$createDate', new Date('2020-12-21T00:00:00.000Z') ] },
            { $lt: [ '$createDate', new Date('2020-12-23T00:00:00.000Z') ] }
          ]
        }
      }
    },
    { $limit: 5 }
  ]);

This query looks for all records in the db.logs collection for which they have not been transformed and loaded into db.graphs. It's analogous to this SQL approach:

WHERE db.logs._id NOT IN (
        SELECT lId FROM db.graphs
        WHERE db.graphs.d >= @startTime
        AND db.graphs.d < @endTime
    )
    AND db.logs.createDate >= @startTime
    AND db.logs.createDate < @endTime
)

The db.logs has over 1 Million records and here are the indexes:

db.logs.getIndexes();
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_"
        },
        {
                "v" : 2,
                "key" : {
                        "createDate" : 1
                },
                "name" : "createDate_1"
        }
]

And db.reportgraphs has fewer than 100 records with indexes on every property/column.

In my attempt to analyze why the mongo query is so slow and CPU intensive, I suffixed my mongo query with a .explain(). But mongo gave me the error saying db.logs.aggregate(...).explain() is not a function. I also tried adding , {$explain: 1} immediately after { $limit: 5} and got an error saying Unrecognized pipeline stage name $explain.

So I guess I have two questions:

  1. Can someone give feedback on why my mongo query is so slow or possible solutions?
  2. Is there a way to see the execution plan of my mongo query so that I can review where the performance bottle necks are?

UPDATE

A possible solution I'm considering is to have a property db.logs.isGraphed:boolean . Then use a simple db.logs.find({isGraphed:false, createDate:{...date filter...}}).limit(5). Wasn't sure if this is the approach most people would have considered in the first place?

Best Answer

It is slow because it is not using an index. For each document in the logs collection, it is doing a full collection scan on the graphs collection.

From the $expr documentation page:

$expr only uses indexes on the from collection for equality matches in a $match stage.