Postgresql – Efficient partial DISTINCT ON

distinctgroup byjoin;performancepostgresqlquery-performance

I would like to get the most efficient query to give me all the rows that are distinct on one field (ID) but are identical on another field (validity).

Let's set a concrete example: you have a table of meds, each having an ID, some values and a validity date. Of course, in the ideal world each single ID is an individual meds; unfortunately, in reality this is not the case as you might have two variants of the same medicine, with the same ID, some different values in the middle of the table and a validity date again the same. In this real-world example, you want to retrieve all the lines in which the medicine is "still valid" (for whatever definition of still, out of the scope of my question).

Consider the following table:

# ID # VALUE # VALIDITY
# 1  # foo   # 2018/01/10
# 2  # bar   # 2018/01/03
# 2  # flo   # 2018/01/10
# 2  # duv   # 2018/01/10
# 3  # pas   # 2018/01/10
# 4  # cip   # 2018/01/08
# 4  # mao   # 2018/01/10

I'd like the query to return the following lines:

# ID # VALUE # VALIDITY
# 1  # foo   # 2018/01/10
# 2  # flo   # 2018/01/10
# 2  # duv   # 2018/01/10
# 3  # pas   # 2018/01/10
# 4  # mao   # 2018/01/10

notice that (2, bar) and (4, cip) are both "outdated" by (2, flo; 2, duv; 4, cip) respectively.

You can imagine that this is a rather trivial exercise that you can solve using an INNER JOIN:

SELECT * FROM TABLE T
INNER JOIN (SELECT ID, MAX(VALIDITY) FROM T GROUP BY ID) AS TT
ON T.ID = TT.ID AND T.VALIDITY = TT.VALIDITY

However, I noticed that this query becomes quite expensive (essentially there are two full scans, plus an additional group by.
I am not sure whether there is a less expensive way to do this (I tried with DISTINCT ON but that will kill the two entries with ID = 2).
Possibly, recommend also indexes that will work on improve this use case.

Thank you so much!

Best Answer

The same result als ypercube, but I find a simple max windowing function easier to understand than dense ranks and such.

select mm.id, mm.value, mm.validity from (
   select m.*, max(m.validity) over (partition by m.id) maxval
      from meds m ) mm
   where mm.validity = mm.maxval;

see http://sqlfiddle.com/#!17/e478c/4 for the results.

Addition by @Erwin

^{As per request in comment}

In principal (default) b-tree indexes can be scanned forward and backward at the same speed. But sorting can make a difference in multi-column indexes where you combine the sort order of multiple columns. The query starts with:

SELECT  DISTINCT ON (songs.rank, songs.shared_id)

In combination with ORDER BY rank DESC this dictates that the result be ordered by rank DESC, shared_id effectively. After the (simplified) WHERE clause WHERE processed AND working has been applied and before LIMIT can be applied.
I have my doubts if the DISTINCT clause is actually useful. But while it is there, the optimal index for the query should be (just as @ypercube suspected):

CREATE INDEX songs_special_idx
ON songs (processed, working, rank DESC, shared_id);

Looks like one of the rare cases where explicit ordering of index columns would benefit the query. There is an excellent explanation in the chapter Indexes and ORDER BY of the manual.

If the WHERE condition is stable (always WHERE processed AND working), a partial multi-column index would be smaller and faster, yet:

CREATE INDEX songs_special_idx
ON songs (rank DESC, shared_id)
WHERE processed AND working;

Sql-server – Finding distinct rows across two tables: Full Outer Join more efficient than Union

The semantics of the two queries are not the same - UNION removes duplicates, whereas the FULL OUTER JOIN will not:

DECLARE @T1 AS table (id bigint NULL, val integer NULL);
DECLARE @T2 AS table (id bigint NULL, val integer NULL);

INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);

SELECT COALESCE(t1.id, t2.id) AS id, COALESCE(t1.val, t2.val) AS val
FROM @t1 t1
FULL OUTER JOIN @t2 t2
    ON t2.id = t1.id
    AND t2.val = t1.val;

SELECT t1.id, t1.val
FROM @t1 t1
UNION 
SELECT t2.id, t2.val
FROM @t2 t2;

Output:

╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║  1 ║   1 ║
║  1 ║   1 ║
║  1 ║   1 ║
║  1 ║   1 ║
╚════╩═════╝

╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║  1 ║   1 ║
╚════╩═════╝

That said, the optimizer does not know many FOJN tricks, so it is always possible that there is a better way to express the query than the natural UNION. Only commonly-useful and always-correct transformations are implemented.

Note that with a unique constraint only on the larger table, the optimizer chooses a hash union, without expensive duplicate-removal on the probe input, that makes it choose Concat Union All in the question example:

ALTER TABLE #t2 
ADD CONSTRAINT UQ2 
UNIQUE CLUSTERED (id);

SELECT COUNT(*), AVG(x.id), AVG(x.val)
FROM (
    SELECT t1.id, t1.val
    FROM #t1 t1
    UNION
    SELECT t2.id, t2.val
    FROM #t2 t2
) AS x;

The FOJN rewrite may well be a useful one in cases where you know there cannot be duplicates within each input set, but this condition is not enforced with a unique constraint or index (particularly on the large input).

If such a uniqueness guarantee does exist, and yet the optimizer does not select a Hash Union, you might try an OPTION (HASH UNION) hint, to see how it compares.

Best Answer

Related Solutions

Postgresql – Really slow DISTINCT ON query with multiple joins

Addition by @Erwin

Sql-server – Finding distinct rows across two tables: Full Outer Join more efficient than Union

Related Question