I have a table as below:
SELECT [EffectiveDate]
,[Rate]
[ModUser]
FROM [Vision]
there are many rows with the ModUsers name. I need to provide the ModUsers name and get the latest rates.
How do I do that?
greatest-n-per-groupsql server
I have a table as below:
SELECT [EffectiveDate]
,[Rate]
[ModUser]
FROM [Vision]
there are many rows with the ModUsers name. I need to provide the ModUsers name and get the latest rates.
How do I do that?
It very much depends on circumstances and exact requirements. Consider my comment.
With DISTINCT ON
in Postgres:
SELECT DISTINCT ON (i.good, i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good, i.the_date, p.the_date DESC;
Returned rows are ordered. See:
Or with NOT EXISTS
in standard SQL (works with every RDBMS I know):
SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM inventory i
LEFT JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE NOT EXISTS (
SELECT FROM price p1
WHERE p1.good = p.good
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);
Same result, but with arbitrary sort order - unless you add ORDER BY
.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:
With only few rows per good, DISTINCT ON
is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.
Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)
Plain views (like proposed by another answer) do not help performance at all in Postgres.
First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.
Sorting by character types (text
, varchar
, ...) is done according to current COLLATION
. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8
. Find out with:
SHOW lc_collate;
This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C"
:
SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good COLLATE "C", i.the_date, p.the_date DESC;
Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').
If your query is supposed to use an index, columns with character data have to use a matching collation (good
in the example):
CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);
Read the last two chapters of the related answer I linked above.
You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.
Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:
CREATE TABLE good (
good_id serial PRIMARY KEY
, good text NOT NULL
);
CREATE TABLE inventory (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int NOT NULL
, PRIMARY KEY(good_id, the_date)
);
CREATE TABLE price (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, price numeric NOT NULL
, PRIMARY KEY(good_id, the_date));
The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price
with descending order on the second column may improve performance:
CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);
Again, the collation must match your query (see above).
Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.
These resulting queries are much faster:
DISTINCT ON
SELECT DISTINCT ON (i.the_date)
i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER BY i.the_date, p.the_date DESC;
NOT EXISTS
SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND NOT EXISTS (
SELECT 1 FROM price p1
WHERE p1.good_id = p.good_id
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);
If that still is not fast enough, there may be faster solutions.
JOIN LATERAL
/ correlated subqueryEspecially for data distributions with many prices per good:
If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.
Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.
If you aim is to have queries with maximum efficiency, none of the above queries is really the best. Not always at least.
Efficiency depends on many different things, like the specific DBMS, the specific version (different versions have different improvements on the optimizer and the available syntax), the type of columns, the indexes available, the size of the tables and distribution of values, the hardware the server is running, the configuration settings etc.
You should always test various different ways of writing the queries, on your tables, with the sizes and distribution you expect to have on production, with your hardware and configuration settings, to decide which rewritings of the queries should be kept.
This specific kind of query is often called greatest-n-per-group
(there is even a tag for it!) and under certain assumptions, one of the many ways to write them, is often quite efficient in both MySQL and PostgreSQL. It uses a LATERAL
join in Postgres, which is available in 9.3+ versions (in SQL Server lingo CROSS/OUTER APPLY
) and a simulation of this join in MySQL.
The assumptions are that the number of authors (the attribute we group by on) is small, compared to the number of posts (the table where we apply the group by). It's also best if there is an index or a table to find all the distinct author_id
values and an additional index on the posts
table for the group by.
This solution to the greatest-n-per-group problem matches also your request about ties, as it returns always one result per group. If you want to be precise about which one (of the tied) will be returned, the ORDER BY
in the subquery can be modified (to ORDER BY pi.date DESC, pi.id DESC
or ORDER BY pi.date DESC, a.name
for example).
Query in PostgreSQL:
SELECT p.*
FROM authors AS a
, LATERAL
( SELECT pi.*
FROM posts AS pi
WHERE pi.author_id = a.author_id
ORDER BY pi.date DESC
LIMIT 1
) AS p ;
Query in MySQL:
SELECT p.*
FROM authors AS a
JOIN posts AS p
ON p.id =
( SELECT pi.id
FROM posts AS pi
WHERE pi.author_id = a.author_id
ORDER BY pi.date DESC
LIMIT 1
) ;
The useful index is on posts (author_id, date, id)
for MySQL and or on posts (author_id, date DESC)
for Postgres.
Needless to say again but before using any of the above, they should be tested in your environment and cross tested against all the many other versions/rewritings of the query. In Postgres for example, the DISTINCT ON
syntax can be used in version older than 9.3. The resulting query is more compact than the LATERAL
and might be more efficient, under different data distributions. Query:
SELECT DISTINCT ON (author_id) p.*
FROM posts AS p
ORDER BY p.author_id,
p.date DESC ;
Best Answer
There are many ways to do this. Here are some of them:
common table expression with
row_number()
version:cross
apply
version:top with ties
version:inner join
version: