Postgresql – Group and count array elements using intarray

aggregatearraypostgresql

I am working on a Postgres 9.4 project with the intarray extension enabled. We have a table that looks like this:

items
-------------------------------------
id    name                  tag_ids  
--------------------------------------
1     a car                 {1,4}
2     a room to rent        {1}
3     a boat                {1,2,4,11}
4     a wine                {2}
5     emily                 {3}

I'd like to group the tag ids if possible. Like get a count of all the elements that have a tag_id of '{1,2,4,11}'

tag_id  count
1       3
2       2
4       2
11      1

Is this possible? I would think an intersection like this:

select * from items where tag_ids && '{1,2,4,11}'

But I need to group by the array elements inside the intersection result. If I group by tag_ids, it is just the unique value.

How would I do it?

Best Answer

Keep the basic query you already have to cheaply identify rows that have any relevant array elements using an index.

Then unnest only the intersection (tag_ids & '{1,2,4,11}') in a LATERAL join. Finally, aggregate:

SELECT tag_id, count(*) AS ct
FROM   items i
     , unnest(tag_ids & '{1,2,4,11}'::int[]) tag_id
WHERE  tag_ids && '{1,2,4,11}'::int[]
GROUP  BY tag_id
ORDER  BY count(*) DESC;

Once again, the intersection operator & of the intarray module is instrumental.

Previous related answer:

Order result by count of common array elements

Without intarray

If you don't have intarray installed, or for any other array type, we need another join instead:

SELECT tag_id, count(*) AS ct
FROM  (
   SELECT *
   FROM   items
   WHERE  tag_ids && '{1,2,4,11}'::int[]
   ) i, unnest(tag_ids) t(tag_id)
JOIN unnest('{1,2,4,11}'::int[]) x(tag_id) USING (tag_id)
GROUP BY tag_id
ORDER BY count(*) DESC;

Subtle difference: the intersection operator folds duplicates to produce distinct elements, while this query does not. Only matters if there can be duplicate array elements ...

SQL Fiddle _{demonstrating both and with an additional row that has a mix of matching and not matching elements to demonstrate the need of the intersection or additional join to eliminate unwanted elements.}

Related Solutions

Postgresql: Pattern match against array elements

You can accomplish what you're asking for by doing something like this.

I created a table and data to help better illustrate what I'm doing.

CREATE TABLE logs (id serial NOT NULL PRIMARY KEY, lognum int, flightnum int);

INSERT INTO logs (lognum, flightnum) VALUES (1,6);
INSERT INTO logs (lognum, flightnum) VALUES (1,7);
INSERT INTO logs (lognum, flightnum) VALUES (1,8);
INSERT INTO logs (lognum, flightnum) VALUES (2,80);
INSERT INTO logs (lognum, flightnum) VALUES (3,12);
INSERT INTO logs (lognum, flightnum) VALUES (4,8008);

postgres@[local]:5432:postgres:=# SELECT * FROM logs;
 id | lognum | flightnum 
----+--------+-----------
 13 |      1 |         6
 14 |      1 |         7
 15 |      1 |         8
 16 |      2 |        80
 17 |      3 |        12
 18 |      4 |      8008
(6 rows)

Time: 0.188 ms
postgres@[local]:5432:postgres:=#

Note that 1, 2, and 4 have flights with 8, 80, and 8008.

Now, using this query, find the lognums with 8 as a flightnum. This takes your original query with an array_agg, wraps another query around it to generates subscripts for each member in the array, which can be arbitrarily large. Finally, an outer query wraps that, which uses the generated subscript to allow you to do a comparison against each of the members of the flightnums array to see if they are LIKE '8%'.

SELECT lognum FROM ( 
                      SELECT lognum, 
                      flightnums, 
                      generate_subscripts(flightnums, 1) AS s 
                      FROM ( 
                             SELECT lognum, 
                                    array_agg(flightnum) AS flightnums 
                                    FROM logs 
                                         GROUP BY lognum
                           ) AS t1
                   ) AS t2 
            WHERE flightnums[s]::text LIKE '8%' ORDER BY lognum;

Which gives you the following output

postgres@[local]:5432:postgres:=# SELECT lognum FROM (SELECT lognum, flightnums, generate_subscripts(flightnums, 1) AS s FROM ( SELECT lognum, array_agg(flightnum) AS flightnums FROM logs GROUP BY lognum) AS t1) AS t2 WHERE flightnums[s]::text LIKE '8%' ORDER BY lognum;
 lognum 
--------
      1
      2
      4
(3 rows)

Time: 0.338 ms
postgres@[local]:5432:postgres:=#

As would be expected from the data above.

For further array manipulation needs, I would recommend reading up on their chapter in the PostgreSQL documentation arrays

Postgresql – Preserve order of array elements after join

The problem with your query is the join condition id = ANY(ancestors). Not only does it not preserve original order, it also eliminates duplicate elements in the array. (An id could match 10 elements in ancestors, it would still be picked once only.) Not sure if the logic of your query would allow duplicate elements, but if it does I am pretty sure you want to preserve all instances - you want to keep "original order" after all.

Assuming current Postgres 9.4+ for lack of information, I suggest a completely different approach:

SELECT n.entity_id, p.ancestors
FROM   tree t
JOIN   nodes n ON n.id = t.node_id
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.ancestors) WITH ORDINALITY a(id, ord)
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS ancestors
   ) p ON true;

You query only works as intended if nodes.id is defined as primary key and nodes.entity_id is unique as well. Information is missing in the question.

Normally, this simpler query without explicit ORDER BY works as well, but there are no guarantees (Postgres 9.3+)...

SELECT n.entity_id, p.ancestors
FROM   tree t
JOIN   nodes n ON n.id = t.node_id
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.ancestors) id
      JOIN   entity.nodes p USING (id)
      ) AS ancestors
   ) p ON true;

You can make this safe as well. Detailed explanation:

PostgreSQL unnest() with element number

SQL Fiddle demo for Postgres 9.3.

Opional optimization

You join to entity.nodes twice - to substitute for node_id and ancestors alike. An alternative would be to fold both into one array or one set and join only once. Might be faster, but you have to test.
For these alternatives we need the ORDER BY in any case:

Add node_id to the ancestors array before we unnest ...

SELECT p.arr[1] AS entity_id, p.arr[2:2147483647] AS ancestors
FROM   tree t
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.node_id || t.ancestors) WITH ORDINALITY a(id, ord)
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS arr
   ) p ON true;

Or add node_id to the unnested elements of ancestors before we join ...

SELECT p.arr[1] AS entity_id, p.arr[2:2147483647] AS ancestors
FROM   tree t
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM  (
         SELECT t.node_id AS id, 0 AS ord
         UNION ALL
         SELECT * FROM unnest(t.ancestors) WITH ORDINALITY
         ) x
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS arr
   ) p ON true;

You did not show our CTE, this might be optimized further ...