I have two tables, left2
and right2
. Both tables will be large (1-10M rows).
CREATE TABLE left2(id INTEGER, t1 INTEGER, d INTEGER);
ALTER TABLE left2 ADD PRIMARY KEY (id,t1);
CREATE TABLE right2( t1 INTEGER, d INTEGER, arr INTEGER[] );
ALTER TABLE right2 ADD PRIMARY KEY(t1,d);
I will perform this type of query:
SELECT l.d + r.d,
UNIQ(SORT((array_agg_mult(r.arr)))
FROM left2 l,
right2 r
WHERE l.t1 = r.t1
GROUP BY l.d + r.d
ORDER BY l.d + r.d;
Where for aggregation of arrays I use the function:
CREATE AGGREGATE array_agg_mult(anyarray) (
SFUNC=array_cat,
STYPE=anyarray,
INITCOND='{}');
After concatenating the arrays, I use the UNIQ
function of the intarray
module. Is there a more efficient way of doing this? Is there any index on the arr
field to speed up the merging (with removing duplicates)? Can the aggregate function remove duplicates directly? Original arrays may be considered sorted (and they are unique) if that helps.
The SQL Fiddle is here:
Best Answer
Correct results?
First off: correctness. You want to produce an array of unique elements? Your current query does not do that. The function
uniq()
from the intarray module only promises to:Like instructed in the manual, you would need:
Also gives you sorted arrays - assuming you want that, you did not clarify.
I see you have
sort()
in your fiddle, so this may just be a typo in your question.Postgres 9.5 or later
Either way,since Postgres 9.5
array_agg()
has the capabilities of myarray_agg_mult()
built-in out of the box, and much faster, too:There have also been other performance improvements for array handling.
Query
The main purpose of
array_agg_mult()
is to aggregate multi-dimensional arrays, but you only produce 1-dimensional arrays anyway. So I would at least try this alternative query:Which also addresses your question:
Yes, it can, with
DISTINCT
. But that's not faster thanuniq()
for integer arrays, which has been optimized for integer arrays, whileDISTINCT
is generic for all qualifying data types.Doesn't require the
intarray
module. However, the result is not necessarily sorted. Postgres uses varying algorithms forDISTINCT
. Big sets are typically hashed, which leaves the result unsorted unless you add explicitORDER BY
. If you need sorted arrays, you could addORDER BY
to the aggregate function directly:But that's typically slower than feeding pre-sorted data to
array_agg()
(one big sort versus many small sorts). So I would sort in a subquery and then aggregate:This was the fastest variant in my cursory test on Postgres 9.4.
SQL Fiddle based on the one you provided.
Index
I don't see much potential for any index here. The only option would be:
Only makes sense if you get index-only scans out of this - which will happen if the underlying table
right2
is substantially wider than just these two columns and your setup qualifies for index-only scans. Details in the Postgres Wiki.