Postgresql – How to remove duplicates from any array and preserve ordering in PostgreSQL

arrayduplicationorder-bypostgresql

I'm looking for a way to eliminate duplicates from a PostgreSQL array
while preserving the ordering of its elements. What I currently
have are the following functions:

create function array_unique( anyarray ) 
  returns anyarray immutable strict language sql as $$
  select array( select distinct unnest( $1 ) ); $$;

create function array_unique_sorted( anyarray ) 
  returns anyarray immutable strict language sql as $$
  select array( select distinct unnest( $1 ) order by 1 ); $$;

/* ### TAINT there ought to be a simpler, declarative solution */
create function array_unique_stable( text[] )
  returns text[] immutable strict parallel safe language plpgsql as $$
  declare
    R         text[] = '{}';
    ¶element  text;
  begin
    foreach ¶element in array $1 loop
      if not array[ ¶element ] && R then
        R :=  R || array[ ¶element ];
        end if;
      end loop;
    return R; end; $$;

In the above, array_unique takes an array of any type and returns a copy
with all duplicates removed; their relative ordering is arbitrary.
array_unique_sorted is like array_unique, but the elements are sorted
relative to each other; this is sometimes useful as all arrays with the
same set of of distinct elements will compare equal after being
normalized by this function.

array_unique_stable already does what I'm looking for: given an array
(which in this exampe must be a text[] array), it scans elements from
left to right; whenever it encounters a previously unseen element, it adds
that one to the result. Thus, only the first occurrence of each value
is kept.

However, the implementation has some drawbacks: First, there seems to be no
way to write it so it accepts the pseudo-type anyarray.

Second, while
the first two functions are written in SQL, they may presumably be inlined,
array_unique_stable is written in PL/pgSQL, and so it cannot be inlined.

Third, it bugs me that I couldn't come up with a solution in pure SQL…

Best Answer

This can indeed be done using pure SQL:

create function array_unique_stable(p_input anyarray)
  returns anyarray immutable strict parallel safe 
  language sql
as 
$$
select array_agg(t order by x)
from (
  select distinct on (t) t,x
  from unnest(p_input) with ordinality as p(t,x)
  order by t,x
) t2;
$$

The unnest(p_input) with ordinality will return the original index of the element in the array which is then used to aggregate them back in the outer query.

select array_unique_stable(array['a','x','x','b']) as text_array, 
       array_unique_stable(array[10,1,1,5,8,8]) as int_array

returns

text_array | int_array 
-----------+-----------
{a,x,b}    | {10,1,5,8}

and

Related Solutions

Postgresql – Reuse SELECT query by adding results in array

It's a textbook case for a CTE, like @Daniel commented.
The example can be simplified some more. And you need to be aware of how LIMIT works in a UNION query.

CREATE OR REPLACE FUNCTION featured_products(valid_to_in      timestamp
                                           , taxonomy_id_in   integer
                                           , product_limit_in integer)
  RETURNS SETOF integer AS
$func$
BEGIN
   RETURN QUERY
   WITH featured AS (SELECT supplier_id FROM products LIMIT 2)
   SELECT supplier_id
   FROM   featured

   UNION ALL
   (
   SELECT p.supplier_id
   FROM   products p
   LEFT   JOIN featured f USING (supplier_id)
   WHERE  f.supplier_id IS NULL
   LIMIT  product_limit_in
   )  -- parens required - or not?
END
$func$  LANGUAGE plpgsql VOLATILE;

LIMIT can only be applied once in a UNION (ALL) query, unless you enclose the leg of the query in parentheses. You may or may not want to add parentheses.
- The way I have it, a maximum of product_limit_in rows are returned in addition to the "featured" rows from the CTE.
- If you remove the parentheses you get a maximum of product_limit_in rows total - meaning that even "featured" products may be discarded.
  Related: Optimize a query on two big tables
Either way, don't ORDER BY the outer (combined) result before you LIMIT, if you can avoid it. Postgres can optimize the query very efficiently and just stop evaluating once enough rows have been returned (possibly fetching tuples from the top of a matching index). That would not be possible any more, which can make a huge difference in performance.
Using LEFT JOIN / NOT NULL to exclude featured rows from the second SELECT, which is probably faster than NOT IN and does not carry "surprises" when dealing with NULL values or empty results.
- Select rows which are not present in other table
In Postgres (as opposed to some other RDBMS), you can refer to p.supplier_id and f.supplier_id after joining with USING (supplier_id).

And yes, the CTE is only evaluated once:

A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries.

Bold emphasis mine.

Postgresql: Pattern match against array elements

You can accomplish what you're asking for by doing something like this.

I created a table and data to help better illustrate what I'm doing.

CREATE TABLE logs (id serial NOT NULL PRIMARY KEY, lognum int, flightnum int);

INSERT INTO logs (lognum, flightnum) VALUES (1,6);
INSERT INTO logs (lognum, flightnum) VALUES (1,7);
INSERT INTO logs (lognum, flightnum) VALUES (1,8);
INSERT INTO logs (lognum, flightnum) VALUES (2,80);
INSERT INTO logs (lognum, flightnum) VALUES (3,12);
INSERT INTO logs (lognum, flightnum) VALUES (4,8008);

postgres@[local]:5432:postgres:=# SELECT * FROM logs;
 id | lognum | flightnum 
----+--------+-----------
 13 |      1 |         6
 14 |      1 |         7
 15 |      1 |         8
 16 |      2 |        80
 17 |      3 |        12
 18 |      4 |      8008
(6 rows)

Time: 0.188 ms
postgres@[local]:5432:postgres:=#

Note that 1, 2, and 4 have flights with 8, 80, and 8008.

Now, using this query, find the lognums with 8 as a flightnum. This takes your original query with an array_agg, wraps another query around it to generates subscripts for each member in the array, which can be arbitrarily large. Finally, an outer query wraps that, which uses the generated subscript to allow you to do a comparison against each of the members of the flightnums array to see if they are LIKE '8%'.

SELECT lognum FROM ( 
                      SELECT lognum, 
                      flightnums, 
                      generate_subscripts(flightnums, 1) AS s 
                      FROM ( 
                             SELECT lognum, 
                                    array_agg(flightnum) AS flightnums 
                                    FROM logs 
                                         GROUP BY lognum
                           ) AS t1
                   ) AS t2 
            WHERE flightnums[s]::text LIKE '8%' ORDER BY lognum;

Which gives you the following output

postgres@[local]:5432:postgres:=# SELECT lognum FROM (SELECT lognum, flightnums, generate_subscripts(flightnums, 1) AS s FROM ( SELECT lognum, array_agg(flightnum) AS flightnums FROM logs GROUP BY lognum) AS t1) AS t2 WHERE flightnums[s]::text LIKE '8%' ORDER BY lognum;
 lognum 
--------
      1
      2
      4
(3 rows)

Time: 0.338 ms
postgres@[local]:5432:postgres:=#

As would be expected from the data above.

For further array manipulation needs, I would recommend reading up on their chapter in the PostgreSQL documentation arrays

Best Answer

Related Solutions

Postgresql – Reuse SELECT query by adding results in array

Postgresql: Pattern match against array elements

Related Question