Postgresql – How to efficiently get “the most recent corresponding row”

greatest-n-per-groupperformancepostgresqlquery-performance

I have a query pattern that must be very common, but I don't know how to write an efficient query for it. I want to look up the rows of a table that correspond to "the most recent date not after" the rows of another table.

I have a table, inventory say, which represents the inventory I hold on a certain day.

date       | good | quantity
------------------------------
2013-08-09 | egg  | 5
2013-08-09 | pear | 7
2013-08-02 | egg  | 1
2013-08-02 | pear | 2

and a table, "price" say, which holds the price of a good on a given day

date       | good | price
--------------------------
2013-08-07 | egg  | 120
2013-08-06 | pear | 200
2013-08-01 | egg  | 110
2013-07-30 | pear | 220

How can I efficiently get the "most recent" price for each row of the inventory table, i.e.

date       | pricing date | good | quantity | price
----------------------------------------------------
2013-08-09 | 2013-08-07   | egg  | 5        | 120
2013-08-09 | 2013-08-06   | pear | 7        | 200
2013-08-02 | 2013-08-01   | egg  | 1        | 110
2013-08-02 | 2013-07-30   | pear | 2        | 220

I know one way of doing this:

select inventory.date, max(price.date) as pricing_date, good
from inventory, price
where inventory.date >= price.date
and inventory.good = price.good
group by inventory.date, good

and then join this query again to inventory. For large tables even doing the first query (without joining again to inventory) is very slow. However, the same problem is quickly solved if I simply use my programming language to issue one max(price.date) ... where price.date <= date_of_interest ... order by price.date desc limit 1 query for each date_of_interest from the inventory table, so I know there is no computational impediment. I would, however, prefer to solve the whole problem with a single SQL query, because it would allow me to do further SQL processing on the result of the query.

Is there a standard way to do this efficiently? It feels like it must come up often and that there should be a way to write a fast query for it.

I'm using Postgres, but an SQL-generic answer would be appreciated.

Best Answer

It very much depends on circumstances and exact requirements. Consider my comment.

Simple solution

With DISTINCT ON in Postgres:

SELECT DISTINCT ON (i.good, i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good, i.the_date, p.the_date DESC;

Returned rows are ordered. See:

Or with NOT EXISTS in standard SQL (works with every RDBMS I know):

SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM   inventory  i
LEFT   JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE  NOT EXISTS (
   SELECT FROM price p1
   WHERE  p1.good = p.good
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

Same result, but with arbitrary sort order - unless you add ORDER BY.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:

With only few rows per good, DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.

Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)

Plain views (like proposed by another answer) do not help performance at all in Postgres.

db<>fiddle here
Old sqlfiddle

Proper solution

Strings and collation

First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.

Sorting by character types (text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:

SHOW lc_collate;

This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C":

SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good COLLATE "C", i.the_date, p.the_date DESC;

Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').

Index

If your query is supposed to use an index, columns with character data have to use a matching collation (good in the example):

CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);

Read the last two chapters of the related answer I linked above.

You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.

Normalize

Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:

CREATE TABLE good (
  good_id serial PRIMARY KEY
, good    text   NOT NULL
);

CREATE TABLE inventory (
  good_id  int  REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int  NOT NULL
, PRIMARY KEY(good_id, the_date)
);

CREATE TABLE price (
  good_id  int     REFERENCES good (good_id)
, the_date date    NOT NULL
, price    numeric NOT NULL
, PRIMARY KEY(good_id, the_date));

The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price with descending order on the second column may improve performance:

CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);

Again, the collation must match your query (see above).

Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.

These resulting queries are much faster:

DISTINCT ON

SELECT DISTINCT ON (i.the_date)
       i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER  BY i.the_date, p.the_date DESC;

NOT EXISTS

SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND    NOT EXISTS (
   SELECT 1 FROM price p1
   WHERE  p1.good_id = p.good_id
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

db<>fiddle here
OLD sqliddle


Faster solutions

If that still is not fast enough, there may be faster solutions.

Recursive CTE / JOIN LATERAL / correlated subquery

Especially for data distributions with many prices per good:

Materialized view

If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.

Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.