I have a query pattern that must be very common, but I don't know how to write an efficient query for it. I want to look up the rows of a table that correspond to "the most recent date not after" the rows of another table.
I have a table, inventory
say, which represents the inventory I hold on a certain day.
date | good | quantity
------------------------------
2013-08-09 | egg | 5
2013-08-09 | pear | 7
2013-08-02 | egg | 1
2013-08-02 | pear | 2
and a table, "price" say, which holds the price of a good on a given day
date | good | price
--------------------------
2013-08-07 | egg | 120
2013-08-06 | pear | 200
2013-08-01 | egg | 110
2013-07-30 | pear | 220
How can I efficiently get the "most recent" price for each row of the inventory table, i.e.
date | pricing date | good | quantity | price
----------------------------------------------------
2013-08-09 | 2013-08-07 | egg | 5 | 120
2013-08-09 | 2013-08-06 | pear | 7 | 200
2013-08-02 | 2013-08-01 | egg | 1 | 110
2013-08-02 | 2013-07-30 | pear | 2 | 220
I know one way of doing this:
select inventory.date, max(price.date) as pricing_date, good
from inventory, price
where inventory.date >= price.date
and inventory.good = price.good
group by inventory.date, good
and then join this query again to inventory. For large tables even doing the first query (without joining again to inventory) is very slow. However, the same problem is quickly solved if I simply use my programming language to issue one max(price.date) ... where price.date <= date_of_interest ... order by price.date desc limit 1
query for each date_of_interest
from the inventory table, so I know there is no computational impediment. I would, however, prefer to solve the whole problem with a single SQL query, because it would allow me to do further SQL processing on the result of the query.
Is there a standard way to do this efficiently? It feels like it must come up often and that there should be a way to write a fast query for it.
I'm using Postgres, but an SQL-generic answer would be appreciated.
Best Answer
It very much depends on circumstances and exact requirements. Consider my comment.
Simple solution
With
DISTINCT ON
in Postgres:Returned rows are ordered. See:
Or with
NOT EXISTS
in standard SQL (works with every RDBMS I know):Same result, but with arbitrary sort order - unless you add
ORDER BY
.Depending on data distribution, exact requirements and indices, either one of these may be faster. See:
With only few rows per good,
DISTINCT ON
is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)
Plain views (like proposed by another answer) do not help performance at all in Postgres.
db<>fiddle here
Old sqlfiddle
Proper solution
Strings and collation
First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.
Sorting by character types (
text
,varchar
, ...) is done according to currentCOLLATION
. Typically, your DB would use some local set of rules, like in my case:de_AT.UTF-8
. Find out with:This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with
COLLATE "C"
:Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').
Index
If your query is supposed to use an index, columns with character data have to use a matching collation (
good
in the example):Read the last two chapters of the related answer I linked above.
You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.
Normalize
Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:
The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on
price
with descending order on the second column may improve performance:Again, the collation must match your query (see above).
Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.
These resulting queries are much faster:
DISTINCT ON
NOT EXISTS
db<>fiddle here
OLD sqliddle
Faster solutions
If that still is not fast enough, there may be faster solutions.
Recursive CTE /
JOIN LATERAL
/ correlated subqueryEspecially for data distributions with many prices per good:
Materialized view
If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.
Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.