Supposing we have a table with four columns (a,b,c,d)
of the same data type.
Is it possible to select all distinct values within the data in the columns and return them as a single column or do I have to create a function to achieve this?
distinctperformancepostgresqlpostgresql-9.4postgresql-performance
Supposing we have a table with four columns (a,b,c,d)
of the same data type.
Is it possible to select all distinct values within the data in the columns and return them as a single column or do I have to create a function to achieve this?
Best Answer
Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).
A very simple query would be to use
UNION DISTINCT
.I think it would be most efficient if there is a separate index on each of the four columnsIt would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):Another would be to first
UNION ALL
and then useDISTINCT
. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:The other answers have provided with more options using array functions or the
LATERAL
syntax. Jack's query (187 ms, 261 ms
) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms
). Both of them do one sequential scan of the table and do not use any index.Actually Jack's query results are a bit better than shown above (if we remove the
order by
) and can be further improved by removing the 4 internaldistinct
and leaving only the external one.Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the
WITH RECURSIVE
hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:SQLfiddle
To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.
Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:
and Jack's improved: