Postgresql – Efficient partial DISTINCT ON

distinctgroup byjoin;performancepostgresqlquery-performance

I would like to get the most efficient query to give me all the rows that are distinct on one field (ID) but are identical on another field (validity).

Let's set a concrete example: you have a table of meds, each having an ID, some values and a validity date. Of course, in the ideal world each single ID is an individual meds; unfortunately, in reality this is not the case as you might have two variants of the same medicine, with the same ID, some different values in the middle of the table and a validity date again the same. In this real-world example, you want to retrieve all the lines in which the medicine is "still valid" (for whatever definition of still, out of the scope of my question).

Consider the following table:

# ID # VALUE # VALIDITY
# 1  # foo   # 2018/01/10
# 2  # bar   # 2018/01/03
# 2  # flo   # 2018/01/10
# 2  # duv   # 2018/01/10
# 3  # pas   # 2018/01/10
# 4  # cip   # 2018/01/08
# 4  # mao   # 2018/01/10

I'd like the query to return the following lines:

# ID # VALUE # VALIDITY
# 1  # foo   # 2018/01/10
# 2  # flo   # 2018/01/10
# 2  # duv   # 2018/01/10
# 3  # pas   # 2018/01/10
# 4  # mao   # 2018/01/10

notice that (2, bar) and (4, cip) are both "outdated" by (2, flo; 2, duv; 4, cip) respectively.

You can imagine that this is a rather trivial exercise that you can solve using an INNER JOIN:

SELECT * FROM TABLE T
INNER JOIN (SELECT ID, MAX(VALIDITY) FROM T GROUP BY ID) AS TT
ON T.ID = TT.ID AND T.VALIDITY = TT.VALIDITY

However, I noticed that this query becomes quite expensive (essentially there are two full scans, plus an additional group by.
I am not sure whether there is a less expensive way to do this (I tried with DISTINCT ON but that will kill the two entries with ID = 2).
Possibly, recommend also indexes that will work on improve this use case.

Thank you so much!

Best Answer

The same result als ypercube, but I find a simple max windowing function easier to understand than dense ranks and such.

select mm.id, mm.value, mm.validity from (
   select m.*, max(m.validity) over (partition by m.id) maxval
      from meds m ) mm
   where mm.validity = mm.maxval;

see http://sqlfiddle.com/#!17/e478c/4 for the results.