Mongodb – Why aggregate query with lookup is extremely slow

aggregatemongodb

I have a mongodb query that works but takes too long to execute and causes my CPU to spike to 100% while it is executing. It is this query here:

  db.logs.aggregate([
    {
      $lookup:
      {
        from: 'graphs',
        let: { logId : '$_id' },
        as: 'matched_docs',
        pipeline:
          [
            {
              $match: {
                $expr: {
                  $and: [
                    { $eq: ['$$logId', '$lId'] },
                    { $gte: [ '$d', new Date('2020-12-21T00:00:00.000Z') ] },
                    { $lt: [ '$d', new Date('2020-12-23T00:00:00.000Z') ] }
                  ]
                }
              }
            }
          ],
      }
    },
    {
      $match: {
        $expr: {
          $and: [
            { $eq: [ '$matched_docs', [] ] },
            { $gte: [ '$createDate', new Date('2020-12-21T00:00:00.000Z') ] },
            { $lt: [ '$createDate', new Date('2020-12-23T00:00:00.000Z') ] }
          ]
        }
      }
    },
    { $limit: 5 }
  ]);

This query looks for all records in the db.logs collection for which they have not been transformed and loaded into db.graphs. It's analogous to this SQL approach:

WHERE db.logs._id NOT IN (
        SELECT lId FROM db.graphs
        WHERE db.graphs.d >= @startTime
        AND db.graphs.d < @endTime
    )
    AND db.logs.createDate >= @startTime
    AND db.logs.createDate < @endTime
)

The db.logs has over 1 Million records and here are the indexes:

db.logs.getIndexes();
[
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_"
        },
        {
                "v" : 2,
                "key" : {
                        "createDate" : 1
                },
                "name" : "createDate_1"
        }
]

And db.reportgraphs has fewer than 100 records with indexes on every property/column.

In my attempt to analyze why the mongo query is so slow and CPU intensive, I suffixed my mongo query with a .explain(). But mongo gave me the error saying db.logs.aggregate(...).explain() is not a function. I also tried adding , {$explain: 1} immediately after { $limit: 5} and got an error saying Unrecognized pipeline stage name $explain.

So I guess I have two questions:

Can someone give feedback on why my mongo query is so slow or possible solutions?
Is there a way to see the execution plan of my mongo query so that I can review where the performance bottle necks are?

UPDATE

A possible solution I'm considering is to have a property db.logs.isGraphed:boolean . Then use a simple db.logs.find({isGraphed:false, createDate:{...date filter...}}).limit(5). Wasn't sure if this is the approach most people would have considered in the first place?

Best Answer

It is slow because it is not using an index. For each document in the logs collection, it is doing a full collection scan on the graphs collection.

From the $expr documentation page:

$expr only uses indexes on the from collection for equality matches in a $match stage.

Related Solutions

SQL Server – Optimize Slow Running Aggregate of Aggregate Query

As you are repeating this query for multiple months then you will be continually re-aggregating the same rows.

For example the rows in the first month will always be brought back by the t1.post_date < @report_date criteria so will be re-processed for every month.

To avoid this I'd probably consider working through it in an iterative way a month at a time from the start. Dependent on the volatility of historic data I might also consider storing the pre-calculated results in the database rather than re-calculating these each month.

To calculate this at run time you could create a temporary table with the following structure.

CREATE TABLE #balance
  (
     department_id   INT NOT NULL,
     location_id     INT NOT NULL,
     account_id      INT NOT NULL,
     balance_to_date MONEY NOT NULL,
     PRIMARY KEY (department_id, location_id, account_id)
  );

You could also consider adding the following index on your transactions table

ALTER TABLE transactions
  ADD post_date_year_month AS (10000 * YEAR(post_date) + MONTH(post_date))

CREATE INDEX ix
  ON transactions(post_date_year_month, department_id, location_id, account_id)
  INCLUDE (amount)

Then extract a month at a time from transactions and merge into #balance (with a when matched then increment, when not matched insert).

The leading post_date_year_month column means that as long as you write the query sargably the extraction of each month can be done efficiently and the extracted rows for a month will be ordered by department_id, location_id, account_id making a merge join against #balance possible without a sort.

Whilst that could benefit this particular query you'd need to assess the utility of this index against your overall workload.

Then calculate the department_id, location_id totals from #balance (can leverage the PK order to avoid a sort) and store those somewhere and move onto the next month.

(Or possibly instead of #balance you could use a "temporary" permanent table balance and create an indexed view on that to avoid the separate explicit aggregation step and just copy the values straight from that before moving on)

PostgreSQL – Aggregate Query with MIN Function

This can be much simpler, yet, with DISTINCT ON:

SELECT DISTINCT ON (product_id)
       product_id
     , CASE WHEN stock = 0 THEN NULL ELSE warehouse_id END AS warehouse_id
     , stock
     , CASE WHEN stock = 0 THEN NULL ELSE price END AS price
FROM   product_stock
ORDER  BY product_id, (stock = 0), price;

Assuming stock to be NOT NULL.

SQL Fiddle.

About DISTINCT ON:

Select first row in each GROUP BY group?

Postgres has a proper boolean type and one can ORDER BY any boolean expression. FALSE sorts before TRUE sorts before NULL. So rows with (stock = 0) sort behind rows with any other value for stock - except NULL, which would sort last.

Best Answer

Related Solutions

SQL Server – Optimize Slow Running Aggregate of Aggregate Query

PostgreSQL – Aggregate Query with MIN Function

Related Question