It very much depends on circumstances and exact requirements. Consider my comment.
Simple solution
With DISTINCT ON
in Postgres:
SELECT DISTINCT ON (i.good, i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good, i.the_date, p.the_date DESC;
Returned rows are ordered. See:
Or with NOT EXISTS
in standard SQL (works with every RDBMS I know):
SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM inventory i
LEFT JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE NOT EXISTS (
SELECT FROM price p1
WHERE p1.good = p.good
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);
Same result, but with arbitrary sort order - unless you add ORDER BY
.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:
With only few rows per good, DISTINCT ON
is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.
Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)
Plain views (like proposed by another answer) do not help performance at all in Postgres.
db<>fiddle here
Old sqlfiddle
Proper solution
Strings and collation
First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.
Sorting by character types (text
, varchar
, ...) is done according to current COLLATION
. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8
. Find out with:
SHOW lc_collate;
This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C"
:
SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good COLLATE "C", i.the_date, p.the_date DESC;
Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').
Index
If your query is supposed to use an index, columns with character data have to use a matching collation (good
in the example):
CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);
Read the last two chapters of the related answer I linked above.
You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.
Normalize
Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:
CREATE TABLE good (
good_id serial PRIMARY KEY
, good text NOT NULL
);
CREATE TABLE inventory (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int NOT NULL
, PRIMARY KEY(good_id, the_date)
);
CREATE TABLE price (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, price numeric NOT NULL
, PRIMARY KEY(good_id, the_date));
The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price
with descending order on the second column may improve performance:
CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);
Again, the collation must match your query (see above).
Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.
These resulting queries are much faster:
DISTINCT ON
SELECT DISTINCT ON (i.the_date)
i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER BY i.the_date, p.the_date DESC;
NOT EXISTS
SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND NOT EXISTS (
SELECT 1 FROM price p1
WHERE p1.good_id = p.good_id
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);
db<>fiddle here
OLD sqliddle
Faster solutions
If that still is not fast enough, there may be faster solutions.
Recursive CTE / JOIN LATERAL
/ correlated subquery
Especially for data distributions with many prices per good:
Materialized view
If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.
Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.
Best Answer
I am assuming data type
text
for the relevant columns."Simple" Solution
Key elements:
DISTINCT ON
is a Postgres extension of the SQL standardDISTINCT
. Find a detailed explanation for the used query technique in this related answer on SO.ORDER BY p.code DESC
picks the longest match, because'1234'
sorts after'123'
(in ascending order).Simple SQL Fiddle.
Without index, the query would run for a very long time (didn't wait to see it finish). To make this fast, you need index support. The trigram indexes you mentioned, supplied by the additional module
pg_trgm
are a good candidate. You have to choose between GIN and GiST index. The first character of the numbers is just noise and can be excluded from the index, making it a functional index in addition.In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):
Advanced dbfiddle here.
All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:
Much faster yet
Failed attempt with
text_pattern_ops
Once we ignore the distracting first noise character, it comes down to basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class
text_pattern_ops
(assuming column typetext
).This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:
However, the query planner will not consider this index for joining two tables. I have seen this limitation before. I don't have a meaningful explanation for this, yet.
Partial / functional B-tree indexes
The alternative it to use equality checks on partial strings with partial indexes. This can be used in a
JOIN
.Since we typically only have a limited number of
different lengths
for prefixes, we can build a solution similar to the one presented here with partial indexes.Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:
Since these are partial indexes, all of them together are barely larger than a single complete index.
Add matching indexes for numbers (taking the leading noise character into account):
While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.
If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...
If numbers are never shorter then
n
characters, drop redundantWHERE
clauses from some or all, and also drop the correspondingWHERE
clause from all following queries.Recursive CTE
With all the setup so far I was hoping for very elegant solution with a recursive CTE:
However, while this query isn't bad - it performs about as good as the simple version with a trigram GIN index - it doesn't deliver what I was aiming for. The recursive term is planned once only, so it can't use the best indexes. Only the non-recursive term can.
UNION ALL
Since we are dealing with a small number of recursions we can just spell them out iteratively. This allows optimized plans for each of them. (We lose the recursive exclusion of already successful numbers, though. So there is still some room for improvement, especially for a wider range of prefix lengths)):
A breakthrough, finally!
SQL function
Wrapping this into an SQL function removes the query planning overhead for repeated use:
Call:
PL/pgSQL function with dynamic SQL
This plpgsql function is much like the recursive CTE above, but the dynamic SQL with
EXECUTE
forces the query to be re-planned for every iteration. Now it makes use of all the tailored indexes.Additionally this works for any range of prefix lengths. The function takes two parameters for the range, but I prepared it with
DEFAULT
values, so it works without explicit parameters, too:The final step cannot be wrapped into the one function easily. Either just call it like this:
Or use another SQL function as wrapper:
Call:
A bit slower due to required planning overhead. But more versatile than SQL and shorter for longer prefixes.