Postgresql – SELECT DISTINCT on multiple columns

distinctperformancepostgresqlpostgresql-9.4postgresql-performance

Supposing we have a table with four columns (a,b,c,d) of the same data type.

Is it possible to select all distinct values within the data in the columns and return them as a single column or do I have to create a function to achieve this?

Best Answer

Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).

A very simple query would be to use UNION DISTINCT. I think it would be most efficient if there is a separate index on each of the four columns It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):

-- Query 1. (334 ms, 368ms) 
SELECT a AS abcd FROM tablename 
UNION                           -- means UNION DISTINCT
SELECT b FROM tablename 
UNION 
SELECT c FROM tablename 
UNION 
SELECT d FROM tablename ;

Another would be to first UNION ALL and then use DISTINCT. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:

-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
  ( SELECT a FROM tablename 
    UNION ALL 
    SELECT b FROM tablename 
    UNION ALL
    SELECT c FROM tablename 
    UNION ALL
    SELECT d FROM tablename 
  ) AS x ;

The other answers have provided with more options using array functions or the LATERAL syntax. Jack's query (187 ms, 261 ms) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms). Both of them do one sequential scan of the table and do not use any index.

Actually Jack's query results are a bit better than shown above (if we remove the order by) and can be further improved by removing the 4 internal distinct and leaving only the external one.


Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the WITH RECURSIVE hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:

-- Query 3.  (2 ms, 368ms)
WITH RECURSIVE 
    da AS (
       SELECT min(a) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(a) FROM observations
               WHERE  a > s.n)
       FROM   da AS s  WHERE s.n IS NOT NULL  ),
    db AS (
       SELECT min(b) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(b) FROM observations
               WHERE  b > s.n)
       FROM   db AS s  WHERE s.n IS NOT NULL  ),
   dc AS (
       SELECT min(c) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(c) FROM observations
               WHERE  c > s.n)
       FROM   dc AS s  WHERE s.n IS NOT NULL  ),
   dd AS (
       SELECT min(d) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(d) FROM observations
               WHERE  d > s.n)
       FROM   db AS s  WHERE s.n IS NOT NULL  )
SELECT n 
FROM 
( TABLE da  UNION 
  TABLE db  UNION 
  TABLE dc  UNION 
  TABLE dd
) AS x 
WHERE n IS NOT NULL ;

SQLfiddle


To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.


Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:

-- Query 1b.  (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations 
UNION 
SELECT DISTINCT b FROM observations 
UNION 
SELECT DISTINCT c FROM observations 
UNION 
SELECT DISTINCT d FROM observations ;

and Jack's improved:

-- Query 4b.  (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
                        array_agg(b)||
                        array_agg(c)||
                        array_agg(d) )
from t ;