PostgreSQL – Efficiently Merge and Remove Duplicates from Arrays

aggregatearraypostgresqlpostgresql-9.3

I have two tables, left2 and right2. Both tables will be large (1-10M rows).

CREATE TABLE left2(id INTEGER, t1 INTEGER, d INTEGER);
ALTER TABLE left2 ADD PRIMARY KEY (id,t1);

CREATE TABLE right2( t1 INTEGER, d INTEGER, arr INTEGER[] );
ALTER TABLE right2 ADD PRIMARY KEY(t1,d);

I will perform this type of query:

SELECT l.d + r.d,
       UNIQ(SORT((array_agg_mult(r.arr)))
FROM left2 l,
     right2 r
WHERE l.t1 = r.t1
GROUP BY l.d + r.d
ORDER BY l.d + r.d;

Where for aggregation of arrays I use the function:

CREATE AGGREGATE array_agg_mult(anyarray) (
SFUNC=array_cat,
STYPE=anyarray,
INITCOND='{}');

After concatenating the arrays, I use the UNIQ function of the intarray module. Is there a more efficient way of doing this? Is there any index on the arr field to speed up the merging (with removing duplicates)? Can the aggregate function remove duplicates directly? Original arrays may be considered sorted (and they are unique) if that helps.

The SQL Fiddle is here:

Best Answer

Correct results?

First off: correctness. You want to produce an array of unique elements? Your current query does not do that. The function uniq() from the intarray module only promises to:

remove adjacent duplicates

Like instructed in the manual, you would need:

SELECT l.d + r.d, uniq(sort(array_agg_mult(r.arr)))
FROM   ...

Also gives you sorted arrays - assuming you want that, you did not clarify.

I see you have sort() in your fiddle, so this may just be a typo in your question.

Postgres 9.5 or later

Either way,since Postgres 9.5 array_agg() has the capabilities of my array_agg_mult() built-in out of the box, and much faster, too:

There have also been other performance improvements for array handling.

Query

The main purpose of array_agg_mult() is to aggregate multi-dimensional arrays, but you only produce 1-dimensional arrays anyway. So I would at least try this alternative query:

SELECT l.d + r.d AS d_sum, array_agg(DISTINCT elem) AS result_arr
FROM   left2  l
JOIN   right2 r USING (t1)
     , unnest(r.arr) elem
GROUP  BY 1
ORDER  BY 1;

Which also addresses your question:

Can the aggregate function remove duplicates directly?

Yes, it can, with DISTINCT. But that's not faster than uniq() for integer arrays, which has been optimized for integer arrays, while DISTINCT is generic for all qualifying data types.

Doesn't require the intarray module. However, the result is not necessarily sorted. Postgres uses varying algorithms for DISTINCT. Big sets are typically hashed, which leaves the result unsorted unless you add explicit ORDER BY. If you need sorted arrays, you could add ORDER BY to the aggregate function directly:

array_agg(DISTINCT elem ORDER BY elem)

But that's typically slower than feeding pre-sorted data to array_agg() (one big sort versus many small sorts). So I would sort in a subquery and then aggregate:

SELECT d_sum, uniq(array_agg(elem)) AS result_arr
FROM  (
   SELECT l.d + r.d AS d_sum, elem
   FROM   left2  l
   JOIN   right2 r USING (t1)
        , unnest(r.arr) elem
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1
ORDER  BY 1;

This was the fastest variant in my cursory test on Postgres 9.4.

SQL Fiddle based on the one you provided.

Index

I don't see much potential for any index here. The only option would be:

CREATE INDEX ON right2 (t1, arr);

Only makes sense if you get index-only scans out of this - which will happen if the underlying table right2 is substantially wider than just these two columns and your setup qualifies for index-only scans. Details in the Postgres Wiki.

Related Solutions

Postgresql table with one integer column, sorted index, with duplicate primary key

I think you're asking how to impliment a solution you'e already decided on for a more general problem you don't describe. If you were to outline the actual problem that this is supposed to solve you might get better suggestions about how to solve it.

Working within the very limited information provided:

Update: I found your other question, which you really should've linked to. You seem to be trying to roll your own message queue. Don't do that. Read these:

Have I convinced you that you shouldn't try to do this yourself yet? Look into:

RabbitMQ
ZeroMQ
Job::Machine
ActiveMQ
PGQ
Celery

Some of what you want isn't available in current PostgreSQL versions. For example:

INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between two other rows, ordered by ID.`

That'd require an index-organized table, which PostgreSQL doesn't have yet. The closest you'll get would be a one-column table with a PRIMARY KEY. With regular VACUUM on PostgreSQL 9.2 you'd be able to use index-only scans to access it most of the time.

As for allowing duplicates, you don't really seem to want to permit them at all, you're just saying you want to work around concurrency issues by temporarily permitting them.

You can remove such duplicates during INSERT so the table its self doesn't need to permit them. However, that'll cause issues with:

INSERTs will happen in bulk (about 1000 per transaction) and must not fail, expect for disc full, etc. There must not be any chance for deadlocks.

... assuming that those inserts occur concurrently from multiple transactions. You'll have races between the checks for existence and the insert that can cause insert batches to fail and have to be re-tried.

I suspect that your best bet is to have a one-column table without a PRIMARY KEY. Just create an ordinary b-tree index on it, and leave the table without a PRIMARY KEY. Since it genuinely has no primary key (the only column may have duplicates) this is entirely reasonable.

(BTW, given that SQL is supposedly all about sets, it astounds me how awful it is at "add this entry to the set if not already present").

PostgreSQL – How to Filter and Append Data to Row

You could use an inner join to filter for rows that share a (p_id, k_id) value with rows that have a (t_id, pos) of (1,1) (sqlfiddle):

select  base.p_id
,       base.k_id
,       json_agg(base order by base.p_id, base.k_id, base.t_id, base.pos) as stats
from    (
        select  p_id
        ,       k_id
        ,       t_id
        ,       pos
        from    two
        ) base
join    two as filter
on      filter.t_id = 1
        and filter.pos = 1
        and filter.p_id = base.p_id
        and filter.k_id = base.k_id
group by
        base.p_id
,       base.k_id;

Or an equivalent way using an exists subquery (sqlfiddle):

select  base.p_id
,       base.k_id
,       json_agg(base order by base.p_id, base.k_id, base.t_id, base.pos) as stats
from    (
        select  p_id
        ,       k_id
        ,       t_id
        ,       pos
        from    two
        ) base
where   exists
        (
        select  *
        from    two as filter
        where   filter.t_id = 1
                and filter.pos = 1
                and filter.p_id = base.p_id
                and filter.k_id = base.k_id
        )
group by
        base.p_id
,       base.k_id;

Updated to use json_agg from @ErwinBrandstetter's answer.