Postgresql – Select distinct multiple columns with one result column

postgresql

I have a table like the following:

CREATE TABLE aschema.atable
 (
  id       BIGSERIAL,
  col_a    aschema.anenum,
  col_b    aschema.anenum,
  col_c    aschema.anenum
)

and I would like to do something like

SELECT DISTINCT unnest(array_agg(col_a, col_b, col_c)) as anenum_value
  FROM aschema.atable
  WHERE anenum_value IS NOT NULL

but I do not know if it makes sense.
Maybe with an index over each of the three column I can simply select independently the three columns like that:

SELECT DISTINCT value FROM (
SELECT col_a as value
  FROM aschema.atable  WHERE col_a IS NOT NULL
UNION
SELECT col_b as value
  FROM aschema.atable  WHERE col_b IS NOT NULL
UNION
SELECT col_c as value
  FROM aschema.atable  WHERE col_c IS NOT NULL
)

But I am not sure about the performance.

Actually I have rows like

id  col_a  col_b  col_c
1   ABD    CDE    XYZ
2   CDE    null   null
3   ABD    null   null
3   FGH    LMN   null

And I expect as a result

ABC
ABD
CDE
FGH
LMN
XYZ

Any recommendation and good example?

Best Answer

There another several solutions:

Using lateral join and values:

select distinct
  x
from
  aschema.atable cross join lateral
    (values(col_a),(col_b),(col_c)) as t(x)
where
  x is not null
order by
  x;

Using arrays:

select distinct
  x
from
  (select unnest(array[col_a,col_b,col_c]) from aschema.atable) as     t(x)
where
  x is not null
order by
  x;

Normalize data structure:

CREATE TABLE aschema.atable
(
  id         bigserial,      -- Just an ID
  trinity_id bigint not null -- It is ID from your original table
  what       int check(what in (1,2,3)), -- 1 - a, 2 - b, 3 - c 
  value      aschema.anenum,
  unique (trinity_id,what),  
  unique (trinity_id,value)  -- To be sure that each trinity have distinct values
);

And for now your data could be:

id  trinity_id what value
1   1          1    ABD
2   1          2    CDE
3   1          3    XYZ
4   2          1    CDE
5   3          1    ABD
6   4          1    FGH
7   4          2    LMN

Consistent rows

The important question which does not seem to be on your radar yet:
From each set of rows for the same seriesName, do you want the columns of one row, or just any values from multiple rows (which may or may not go together)?

Your answer does the latter, you combine the maximum dbid with the maximum retreivaltime, which may come from a different row.

To get consistent rows, use DISTINCT ON and wrap it in a subquery to order the result differently:

SELECT * FROM (
   SELECT DISTINCT ON (seriesName)
          dbid, seriesName, retreivaltime
   FROM   FileItems
   WHERE  sourceSite = 'mk' 
   ORDER  BY seriesName, retreivaltime DESC NULLS LAST  -- latest retreivaltime
   ) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT  100;

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Aside: should probably be retrievalTime, or better yet: retrieval_time. Unquoted mixed case identifiers are a common source of confusion in Postgres.

Better Performance with rCTE

Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for WHERE sourceSite = 'mk')

On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.

Related answer on SO with more advanced solutions, explanation, fiddle:

Optimize GROUP BY query to retrieve latest record per user

Your case is more complex, though. But I think I found a variant to make it work for you. Building on this index (without WHERE sourceSite = 'mk')

CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

Or (with WHERE sourceSite = 'mk')

CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

The first index can be used for both queries, but is not fully efficient with the additional WHERE condition. The second index is of very limited use for the first query. Since you have both variants of the query consider creating both indexes.

I added dbid at the end to allow Index Only scans.

This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:

WITH RECURSIVE cte AS (
   (
   SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
   FROM   MangaItems
   WHERE  sourceSite = 'mk'
   ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
   FROM   cte c
   ,      LATERAL (
      SELECT dbid, seriesName, retreivaltime
      FROM   MangaItems
      WHERE (retreivaltime, seriesName) < (c.retreivaltime, c.seriesName)
      AND    sourceSite = 'mk'  -- repeat condition!
      AND    seriesName <> ALL(c.arr)
      ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
      LIMIT  1
      ) i
   WHERE  c.rn < 101
   )
SELECT dbid
FROM   cte
ORDER  BY rn;

You need to include seriesName in ORDER BY, since retreivaltime is not unique. "Almost" unique is still not unique.

Explain

The non-recursive query starts with the latest row.
The recursive query adds the next-latest row with a seriesName that's not in the list, yet etc., until we have 100 rows.
Essential parts are the JOIN condition (b.retreivaltime, b.seriesName) < (c.retreivaltime, c.seriesName) and the ORDER BY clause ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST. Both match the sort order of the index, which allows for the magic to happen.
Collecting seriesName in an array to rule out duplicates. The cost for b.seriesName <> ALL(c.foo_arr) grows progressively with the number of rows, but for just 100 rows it is still cheap.
Just returning dbid as clarified in the comments.

Alternative with partial indexes:

We have been dealing with similar problems before. Here is a highly optimized complete solution based on partial indexes and a looping function:

Can spatial index help a "range - order by - limit" query

Probably the fastest way (except for a materialized view) if done right. But more complex.

Materialized View

Since you do not have a lot of write operations and they are not performance-critical as stated in the comments (should be in the question), save the top n pre-computed rows in a materialized view and refresh it after relevant changes to the underlying table. Base your performance-critical queries on the materialized view instead.

Could just be a "thin" mv of the latest 1000 dbid or so. In the query, join to the original table. For instance, if content is sometimes updated, but the top n rows can remain unchanged.
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.

Details in the manual here and here.

Postgresql – SELECT DISTINCT on multiple columns

Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).

A very simple query would be to use UNION DISTINCT. ~~I think it would be most efficient if there is a separate index on each of the four columns~~ It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):

-- Query 1. (334 ms, 368ms) 
SELECT a AS abcd FROM tablename 
UNION                           -- means UNION DISTINCT
SELECT b FROM tablename 
UNION 
SELECT c FROM tablename 
UNION 
SELECT d FROM tablename ;

Another would be to first UNION ALL and then use DISTINCT. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:

-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
  ( SELECT a FROM tablename 
    UNION ALL 
    SELECT b FROM tablename 
    UNION ALL
    SELECT c FROM tablename 
    UNION ALL
    SELECT d FROM tablename 
  ) AS x ;

The other answers have provided with more options using array functions or the LATERAL syntax. Jack's query (187 ms, 261 ms) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms). Both of them do one sequential scan of the table and do not use any index.

Actually Jack's query results are a bit better than shown above (if we remove the order by) and can be further improved by removing the 4 internal distinct and leaving only the external one.

Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the WITH RECURSIVE hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:

-- Query 3.  (2 ms, 368ms)
WITH RECURSIVE 
    da AS (
       SELECT min(a) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(a) FROM observations
               WHERE  a > s.n)
       FROM   da AS s  WHERE s.n IS NOT NULL  ),
    db AS (
       SELECT min(b) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(b) FROM observations
               WHERE  b > s.n)
       FROM   db AS s  WHERE s.n IS NOT NULL  ),
   dc AS (
       SELECT min(c) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(c) FROM observations
               WHERE  c > s.n)
       FROM   dc AS s  WHERE s.n IS NOT NULL  ),
   dd AS (
       SELECT min(d) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(d) FROM observations
               WHERE  d > s.n)
       FROM   db AS s  WHERE s.n IS NOT NULL  )
SELECT n 
FROM 
( TABLE da  UNION 
  TABLE db  UNION 
  TABLE dc  UNION 
  TABLE dd
) AS x 
WHERE n IS NOT NULL ;

SQLfiddle

To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.

Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:

-- Query 1b.  (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations 
UNION 
SELECT DISTINCT b FROM observations 
UNION 
SELECT DISTINCT c FROM observations 
UNION 
SELECT DISTINCT d FROM observations ;

and Jack's improved:

-- Query 4b.  (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
                        array_agg(b)||
                        array_agg(c)||
                        array_agg(d) )
from t ;