Postgresql – How to efficiently get “the most recent corresponding row”

greatest-n-per-groupperformancepostgresqlquery-performance

I have a query pattern that must be very common, but I don't know how to write an efficient query for it. I want to look up the rows of a table that correspond to "the most recent date not after" the rows of another table.

I have a table, inventory say, which represents the inventory I hold on a certain day.

date       | good | quantity
------------------------------
2013-08-09 | egg  | 5
2013-08-09 | pear | 7
2013-08-02 | egg  | 1
2013-08-02 | pear | 2

and a table, "price" say, which holds the price of a good on a given day

date       | good | price
--------------------------
2013-08-07 | egg  | 120
2013-08-06 | pear | 200
2013-08-01 | egg  | 110
2013-07-30 | pear | 220

How can I efficiently get the "most recent" price for each row of the inventory table, i.e.

date       | pricing date | good | quantity | price
----------------------------------------------------
2013-08-09 | 2013-08-07   | egg  | 5        | 120
2013-08-09 | 2013-08-06   | pear | 7        | 200
2013-08-02 | 2013-08-01   | egg  | 1        | 110
2013-08-02 | 2013-07-30   | pear | 2        | 220

I know one way of doing this:

select inventory.date, max(price.date) as pricing_date, good
from inventory, price
where inventory.date >= price.date
and inventory.good = price.good
group by inventory.date, good

and then join this query again to inventory. For large tables even doing the first query (without joining again to inventory) is very slow. However, the same problem is quickly solved if I simply use my programming language to issue one max(price.date) ... where price.date <= date_of_interest ... order by price.date desc limit 1 query for each date_of_interest from the inventory table, so I know there is no computational impediment. I would, however, prefer to solve the whole problem with a single SQL query, because it would allow me to do further SQL processing on the result of the query.

Is there a standard way to do this efficiently? It feels like it must come up often and that there should be a way to write a fast query for it.

I'm using Postgres, but an SQL-generic answer would be appreciated.

Best Answer

It very much depends on circumstances and exact requirements. Consider my comment.

Simple solution

With DISTINCT ON in Postgres:

SELECT DISTINCT ON (i.good, i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good, i.the_date, p.the_date DESC;

Returned rows are ordered. See:

Select first row in each GROUP BY group?

Or with NOT EXISTS in standard SQL (works with every RDBMS I know):

SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM   inventory  i
LEFT   JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE  NOT EXISTS (
   SELECT FROM price p1
   WHERE  p1.good = p.good
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

Same result, but with arbitrary sort order - unless you add ORDER BY.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

With only few rows per good, DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.

Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)

Plain views (like proposed by another answer) do not help performance at all in Postgres.

db<>fiddle here
_{Old sqlfiddle}

Proper solution

Strings and collation

First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.

Sorting by character types (text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:

SHOW lc_collate;

This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C":

SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good COLLATE "C", i.the_date, p.the_date DESC;

Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').

Index

If your query is supposed to use an index, columns with character data have to use a matching collation (good in the example):

CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);

Read the last two chapters of the related answer I linked above.

You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.

Normalize

Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:

CREATE TABLE good (
  good_id serial PRIMARY KEY
, good    text   NOT NULL
);

CREATE TABLE inventory (
  good_id  int  REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int  NOT NULL
, PRIMARY KEY(good_id, the_date)
);

CREATE TABLE price (
  good_id  int     REFERENCES good (good_id)
, the_date date    NOT NULL
, price    numeric NOT NULL
, PRIMARY KEY(good_id, the_date));

The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price with descending order on the second column may improve performance:

CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);

Again, the collation must match your query (see above).

Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.

These resulting queries are much faster:

DISTINCT ON

SELECT DISTINCT ON (i.the_date)
       i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER  BY i.the_date, p.the_date DESC;

NOT EXISTS

SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND    NOT EXISTS (
   SELECT 1 FROM price p1
   WHERE  p1.good_id = p.good_id
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

db<>fiddle here
_{OLD sqliddle}

Faster solutions

If that still is not fast enough, there may be faster solutions.

Recursive CTE / `JOIN LATERAL` / correlated subquery

Especially for data distributions with many prices per good:

Optimize GROUP BY query to retrieve latest record per user

Materialized view

If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.

Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.

Related Solutions

Sql-server – How to speed up query on table with millions of rows

On reason this can happen is that you're using local variables.

The problem is that this query takes so much time to go, despite all of the indexes i've made on different columns.

Here's an example using a similar setup. In the Stack Overflow schema there's a narrow-ish table called Votes that looks like this.

With no index on CreationDate, our only option would be to scan the Clustered Index. But if we create one only on CreationDate, the optimizer can choose to use that index if it thinks doing a Key Lookup for the rest of the columns is cheaper than scanning the Clustered Index and applying a predicate.

CREATE INDEX ix_yourmom ON dbo.Votes(CreationDate)

For this query:

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate;
GO

The cardinality estimate for unknown variables using between is 16.4317%. That leads to a clustered index scan and a missing index request for an index that covers the entire query.

If you run the query with RECOMPILE, you allow for the parameter embedding optimization.

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate
OPTION ( RECOMPILE );

Which gives us a different query plan, and a more accurate estimate.

Hope this helps!

PostgreSQL – Using GROUP BY and Latest Date Filter

The WHERE clause goes before the GROUP BY:

SELECT 
    other_key, 
    SUM(quantity) AS sum_quantity,
    MAX(date)     AS max_date 
FROM demo 
WHERE key = ?
GROUP BY other_key 
ORDER BY max_date ;

By the way key is a reserved keyword in SQL - although it isn't in Postgres. It would be best to be avoided as a column or table name.

For the additional question, to also calculate the cumulative sums (order by date) and then find the (oldest) date that these sums went positive and stayed positive, it's easier done with some window functions:

SELECT 
    other_key,  
    total_sum_quantity, max_date,
    CASE WHEN cumulative_sum > 0 THEN cumulative_sum END AS cumulative_sum,
    CASE WHEN cumulative_sum > 0 THEN date END AS oldest_positive_strike_date
FROM
  ( SELECT 
        *,
        ROW_NUMBER()
            OVER (PARTITION BY other_key
                  ORDER BY date DESC)  AS rn   
    FROM 
      ( SELECT 
            other_key, quantity, date,
            SUM(quantity) OVER (PARTITION BY key, other_key) AS total_sum_quantity,
            MAX(date) OVER (PARTITION BY key, other_key)     AS max_date,
            SUM(quantity) OVER (PARTITION BY key, other_key
                                ORDER BY date)               AS cumulative_sum,
            LAG(quantity) OVER (PARTITION BY key, other_key
                                ORDER BY date)               AS prev_quantity
        FROM demo 
        WHERE key = '0A19882'
      ) AS t
    WHERE (cumulative_sum  > 0 AND cumulative_sum-quantity <= 0)
       OR (cumulative_sum <= 0 AND cumulative_sum-quantity  > 0)
       OR (prev_quantity IS NULL)
  ) AS t2
WHERE rn = 1 ;

Test at rextester.com.

Some notes:

The cumulative_sum returned is the cumulative sum at the point of the oldest_positive_strike_date. If the total cumulative sum is not positive, both these columns will show NULL.
The PARTITION BY key, other_key can be replaced by PARTITION BY other_key. I left as it is, in case you need to run the query not just with one key value but with more, eg. for the whole table or with WHERE key IN (...).
The ORDER BY date will be deterministic if the (key, other_key, date) has a UNIQUE constraint/index. If there is a chance that you have two rows with same key, other_key and date, replace that with something that can identify a row, eg. ORDER BY date, id.
The "obvious" index that will benefit the query would be on (key, other_key, date, quantity). Postgres may choose a different plan though, scanning the table or using the index and also checking the values against the table. this depends on various factors. Experiment with various table sizes and with the work load you expect.
Since the initial WHERE key = ? condition will restrict the rows to about 100 (from a 100K table), it might be more efficient to use a CTE that gets these rows first, using something like the following. You may get away with a simple index on (key) and have good performance:
```
WITH a AS
  ( SELECT * 
    FROM demo
    WHERE key = ?
  ) 
SELECT ... ;          --- the query as it is, without the `WHERE`
```