It very much depends on circumstances and exact requirements. Consider my comment.
Simple solution
With DISTINCT ON
in Postgres:
SELECT DISTINCT ON (i.good, i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good, i.the_date, p.the_date DESC;
Returned rows are ordered. See:
Or with NOT EXISTS
in standard SQL (works with every RDBMS I know):
SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM inventory i
LEFT JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE NOT EXISTS (
SELECT FROM price p1
WHERE p1.good = p.good
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);
Same result, but with arbitrary sort order - unless you add ORDER BY
.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:
With only few rows per good, DISTINCT ON
is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.
Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)
Plain views (like proposed by another answer) do not help performance at all in Postgres.
db<>fiddle here
Old sqlfiddle
Proper solution
Strings and collation
First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.
Sorting by character types (text
, varchar
, ...) is done according to current COLLATION
. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8
. Find out with:
SHOW lc_collate;
This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C"
:
SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good COLLATE "C", i.the_date, p.the_date DESC;
Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').
Index
If your query is supposed to use an index, columns with character data have to use a matching collation (good
in the example):
CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);
Read the last two chapters of the related answer I linked above.
You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.
Normalize
Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:
CREATE TABLE good (
good_id serial PRIMARY KEY
, good text NOT NULL
);
CREATE TABLE inventory (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int NOT NULL
, PRIMARY KEY(good_id, the_date)
);
CREATE TABLE price (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, price numeric NOT NULL
, PRIMARY KEY(good_id, the_date));
The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price
with descending order on the second column may improve performance:
CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);
Again, the collation must match your query (see above).
Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.
These resulting queries are much faster:
DISTINCT ON
SELECT DISTINCT ON (i.the_date)
i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER BY i.the_date, p.the_date DESC;
NOT EXISTS
SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND NOT EXISTS (
SELECT 1 FROM price p1
WHERE p1.good_id = p.good_id
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);
db<>fiddle here
OLD sqliddle
Faster solutions
If that still is not fast enough, there may be faster solutions.
Recursive CTE / JOIN LATERAL
/ correlated subquery
Especially for data distributions with many prices per good:
Materialized view
If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.
Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.
There are probably many ways to do this. The first that comes to mind is to use window functions:
SELECT
id, postcode
FROM
( SELECT id, postcode,
ROW_NUMBER() OVER (PARTITION BY id
ORDER BY MAX(date_created) DESC
) AS rn
FROM tablename
GROUP BY id, postcode
) AS t
WHERE
rn <= 5
ORDER BY
id, rn ;
Test at SQLfiddle.
If there are ties, say the 5th, 6th and 7th postcode
for an id
have the same date_created
, only one of them (choice will be arbitrary) will be in the results. If you want all the tied postcodes in those cases, use RANK()
instead of ROW_NUMBER()
.
Another option is to use the LATERAL
syntax. I'm not sure which will be more efficient, it wil probably depend on the values distribution of the two columns (id
and postcode
), i.e. how many distinct ids in the whole table, how many distinct postcodes per id and how many rows per (id, postcode) combinations.
SELECT
t.id, ti.postcode
FROM
( SELECT DISTINCT id
FROM tablename
) AS t
CROSS JOIN LATERAL
( SELECT tt.postcode,
MAX(tt.date_created) AS date_created
FROM tablename AS tt
WHERE tt.id = t.id
GROUP BY tt.postcode
ORDER BY date_created DESC
LIMIT 5
) AS ti
ORDER BY
t.id, ti.date_created DESC;
Adding an index on (id, postcode, date_created)
would be a good idea, too - or on (id, postcode, date_created DESC)
.
Best Answer
I went to my favourite MySQL "tips and tricks" site here and went to the common queries link and looked for the Top N per group section. The great thing about this site is that it tells you how to do stuff in MySQL for all versions - well, going back at least to MySQL 5.5 - and if you're still running that, well...
I came up with the following adapted from above (all of the DDL, DML and SQL below is available on the fiddle here):
I used the DDL and DML from @nbk, kudos to him (and +1):
and populate it:
MySQL allows the use of user variables which are a godsend when you don't have capabilities such as the
ROW_NUMBER()
window function which would have made this query trivial. I would strongly urge you to upgrade to version 8, it has many other goodies - CTEs, CHECK constraints...Anyway, I'll demonstrate the steps, partly to explain them to you, and partly to explain them to myself! :-)
Result:
So, we have the items ('A', 'B') order by date
DESC
(most recent first) with the price. Note that 12 lines of that query could be replaced by oneROW_NUMBER()
function line!So, now we wrap that in a query, pulling out those results whose
my_rank
value is <= 2 - which gives us the two most recent dates!Result:
I would suggest that you spend some time browsing the artful softare site!