Sql-server – Get The Most Recent Record In A Table By Date And User

greatest-n-per-groupsql server

I have a table as below:

SELECT [EffectiveDate]
      ,[Rate]
      [ModUser]
      FROM [Vision]

there are many rows with the ModUsers name. I need to provide the ModUsers name and get the latest rates.

How do I do that?

Best Answer

There are many ways to do this. Here are some of them:

common table expression with row_number() version:

with cte as (
  select  *
      , rn = row_number() over (
              partition by ModUser
              order by EffectiveDate desc
            )

    from Vision
)
select ModUser, EffectiveDate, Rate
  from cte
  where rn = 1;

cross apply version:

 select distinct
     t.ModUser
   , x.EffectiveDate
   , x.Rate

  from Vision t
    cross apply (
      select top 1
            ModUser
          , EffectiveDate
          , Rate
        from Vision i
        where i.ModUser = t.ModUser
        order by i.EffectiveDate desc
       ) as x;

top with ties version:

select top 1 with ties
    *
  from Vision
  where 
  order by 
    row_number() over (
      partition by ModUser
          order by EffectiveDate desc
      );

inner join version:

select 
      v.ModUser
    , v.EffectiveDate
    , v.Rate
  from Vision v
    inner join (
      select 
          ModUser
        , MaxEffectiveDate = max(i.EffectiveDate) 
        from Vision i 
        group by ModUser
       ) as x on x.ModUser          = v.ModUser
             and x.MaxEffectiveDate = v.EffectiveDate;

Simple solution

With DISTINCT ON in Postgres:

SELECT DISTINCT ON (i.good, i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good, i.the_date, p.the_date DESC;

Returned rows are ordered. See:

Select first row in each GROUP BY group?

Or with NOT EXISTS in standard SQL (works with every RDBMS I know):

SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM   inventory  i
LEFT   JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE  NOT EXISTS (
   SELECT FROM price p1
   WHERE  p1.good = p.good
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

Same result, but with arbitrary sort order - unless you add ORDER BY.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

With only few rows per good, DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.

Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)

Plain views (like proposed by another answer) do not help performance at all in Postgres.

db<>fiddle here
_{Old sqlfiddle}

Proper solution

Strings and collation

First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.

Sorting by character types (text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:

SHOW lc_collate;

This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C":

SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good COLLATE "C", i.the_date, p.the_date DESC;

Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').

Index

If your query is supposed to use an index, columns with character data have to use a matching collation (good in the example):

CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);

Read the last two chapters of the related answer I linked above.

You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.

Normalize

Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:

CREATE TABLE good (
  good_id serial PRIMARY KEY
, good    text   NOT NULL
);

CREATE TABLE inventory (
  good_id  int  REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int  NOT NULL
, PRIMARY KEY(good_id, the_date)
);

CREATE TABLE price (
  good_id  int     REFERENCES good (good_id)
, the_date date    NOT NULL
, price    numeric NOT NULL
, PRIMARY KEY(good_id, the_date));

The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price with descending order on the second column may improve performance:

CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);

Again, the collation must match your query (see above).

Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.

These resulting queries are much faster:

DISTINCT ON

SELECT DISTINCT ON (i.the_date)
       i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER  BY i.the_date, p.the_date DESC;

NOT EXISTS

SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND    NOT EXISTS (
   SELECT 1 FROM price p1
   WHERE  p1.good_id = p.good_id
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

db<>fiddle here
_{OLD sqliddle}

Faster solutions

If that still is not fast enough, there may be faster solutions.

Recursive CTE / `JOIN LATERAL` / correlated subquery

Especially for data distributions with many prices per good:

Optimize GROUP BY query to retrieve latest record per user

Materialized view

If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.

Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.

MySQL/PostgreSQL – How to Select One Most Recent Post for Each Author

If you aim is to have queries with maximum efficiency, none of the above queries is really the best. Not always at least.

Efficiency depends on many different things, like the specific DBMS, the specific version (different versions have different improvements on the optimizer and the available syntax), the type of columns, the indexes available, the size of the tables and distribution of values, the hardware the server is running, the configuration settings etc.

You should always test various different ways of writing the queries, on your tables, with the sizes and distribution you expect to have on production, with your hardware and configuration settings, to decide which rewritings of the queries should be kept.

This specific kind of query is often called greatest-n-per-group (there is even a tag for it!) and under certain assumptions, one of the many ways to write them, is often quite efficient in both MySQL and PostgreSQL. It uses a LATERAL join in Postgres, which is available in 9.3+ versions (in SQL Server lingo CROSS/OUTER APPLY) and a simulation of this join in MySQL.

The assumptions are that the number of authors (the attribute we group by on) is small, compared to the number of posts (the table where we apply the group by). It's also best if there is an index or a table to find all the distinct author_id values and an additional index on the posts table for the group by.

This solution to the greatest-n-per-group problem matches also your request about ties, as it returns always one result per group. If you want to be precise about which one (of the tied) will be returned, the ORDER BY in the subquery can be modified (to ORDER BY pi.date DESC, pi.id DESC or ORDER BY pi.date DESC, a.name for example).

Query in PostgreSQL:

SELECT p.* 
FROM authors AS a
   , LATERAL 
       ( SELECT pi.*
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) AS p ;

Query in MySQL:

SELECT p.* 
FROM authors AS a
  JOIN posts AS p
    ON p.id =
       ( SELECT pi.id
         FROM posts AS pi
         WHERE pi.author_id = a.author_id
         ORDER BY pi.date DESC
         LIMIT 1
       ) ;

The useful index is on posts (author_id, date, id) for MySQL and or on posts (author_id, date DESC) for Postgres.

Needless to say again but before using any of the above, they should be tested in your environment and cross tested against all the many other versions/rewritings of the query. In Postgres for example, the DISTINCT ON syntax can be used in version older than 9.3. The resulting query is more compact than the LATERAL and might be more efficient, under different data distributions. Query:

SELECT DISTINCT ON (author_id) p.*
FROM posts AS p
ORDER BY p.author_id,
         p.date DESC ;