Postgresql – Checking for multiple identical values in a Posgresql array

arraypostgresql

I have a simple table in Postgresql:

CREATE TABLE data (id integer, values integer[]);
INSERT INTO data VALUES (1, '{1,2,3,4,5}');
INSERT INTO data VALUES (2, '{1,1,2,3,4,5}');
INSERT INTO data VALUES (3, '{1,1,2,1,3,4,5}');

My basic query for values is something like:

SELECT id FROM data WHERE values >@ ARRAY[1,2];

I am trying to select rows with multiple copies of the same value, e.g.

SELECT id FROM data WHERE values >@ ARRAY[1,1,3];

Since values are compared one by one, the query above will match all 3 rows, while I would like to match only IDs 2 and 3, so ones where there are at least two copies of 1 in the 'values' array. Similarly

SELECT id FROM data WHERE values >@ ARRAY[1,1,1,2];

would match only ID 3.

Any pointers on how to proceed, or which functions to look into?

Thanks.

Best Answer

Note: this will only work if you make data.id a PRIMARY KEY

SELECT
    data.id
FROM 
    data, 
    LATERAL (SELECT DISTINCT unnest(values)) no_duplicates
GROUP BY 
    data.id 
HAVING
    array_length(values, 1) > COUNT(no_duplicates)

Here's an SQL Fiddle.

This works by converting your array into a recordset/table (which I've called "no_duplicates") using unnest(), and removing duplicates using DISTINCT:

LATERAL (SELECT DISTINCT unnest(values)) no_duplicates

Then I GROUP BY the original data table's ID, and compare the length of the new, filtered recordset with the old, unfiltered table. If the original, unfiltered array is bigger, then we removed duplicates so we should select that row:

array_length(values, 1) > COUNT(no_duplicates)

Related Solutions

PostgreSQL – Select into Specific Array Positions with array_agg()

Your answer basically gets the job done:

SELECT b.id, array_agg(b.stock) AS stock
FROM  (
   SELECT i.id, COALESCE(i_s.stock, 0) AS stock
   FROM   item i
   CROSS  JOIN unnest('{1,2}'::int[]) n
   LEFT   JOIN item_stock i_s ON i.id = i_s.item_id AND n.n = i_s.shop_id
   ORDER  BY i.id, n.n
   ) b
GROUP  BY b.id;

Two notable changes:

Order is not guaranteed without ORDER BY in the subquery or as additional clause to array_aggregate() (typically more expensive). And that's the core element of your question.
unnest('{1,2}'::int[]) instead of generate_series(1,2) as requested shop IDs will hardly be sequential all the time.

I also moved the set-returning function from the SELECT list to a separate table expression attached with CROSS JOIN. Standard SQL form, but that's just a matter of clarity and taste, not a necessity. At least in Postgres 10 or later. See:

What is the expected behaviour for multiple set-returning functions in SELECT clause?

Doing the same with LEFT JOIN LATERAL and an ARRAY constructor might be a bit faster as we don't need the outer GROUP BY and the ARRAY constructor is typically faster, too:

SELECT i.id, s.stock
FROM   item i
CROSS  JOIN LATERAL (
   SELECT ARRAY(
      SELECT COALESCE(i_s.stock, 0)
      FROM   unnest('{1,2}'::int[]) n
      LEFT   JOIN item_stock i_s ON i_s.shop_id = n.n
                                AND i_s.item_id = i.id
      ORDER  BY n.n
      ) AS stock
   ) s;

And if you have more than just the two shops, a nested crosstab() should provide top performance:

SELECT i.id, COALESCE(stock, '{0,0}') AS stock
FROM   item i
LEFT   JOIN (
   SELECT id, ARRAY[COALESCE(shop1, 0), COALESCE(shop2, 0)] AS stock
   FROM   crosstab(
     $$SELECT item_id, shop_id, stock
       FROM   item_stock
       WHERE  shop_id = ANY ('{1,2}'::int[])
       ORDER  BY 1,2$$

     , $$SELECT unnest('{1,2}'::int[])$$
      ) AS ct (id int, shop1 int, shop2 int)
   ) i_s USING (id);

Needs to be adapted in more places to cater for different shop IDs.

PostgreSQL Crosstab Query

db<>fiddle here

Index

Make sure you have at least an index on item_stock (shop_id, item_id) - typically provided by a PRIMARY KEY on those columns. For the crosstab query, it also matters that shop_id comes first. See:

Is a composite index also good for queries on the first field?

Adding stock as another index expression may allow faster index-only scans. In Postgres 11 or later consider an INCLUDE item to the PK like so:

PRIMARY KEY (shop_id, item_id) INCLUDE (stock)

But only if you need it a lot, as it makes the index a bit bigger and possibly more susceptible to bloat from updates.

Postgresql – How does postgres store array values

Internal representation of larger attributes will be sometimes compressed. More specifically, what works here is the TOAST (Oversized Attribute Storage component used in PostgreSQL). The threshold when values are considered for compression is 2000 bytes.

pg_column_size() is not a logical length, but the size (in bytes) of actual internal representation of the column/variable. It is documented.

PostgreSQL stores array values in a custom, internal, binary format. Command line example below. Details also here.

filip=# CREATE TABLE a(x text, a text[][]);
CREATE TABLE
filip=# insert into a select 'MARK', '{{ENE,DUE},{LIKE,FAKE}}';
INSERT 0 1
filip=# insert into a select 'MARK', '{{ENE,DUE},{LIKE,FAKE}}';
INSERT 0 1
filip=# checkpoint ;
CHECKPOINT
filip=# SELECT pg_relation_filepath('a');
 pg_relation_filepath 
----------------------
 base/16385/16576
(1 row)

filip@szary:~$ sudo hexdump -C ~postgres/9.5/main/base/16385/16576
00000000  00 00 00 00 f0 99 b6 02  00 00 00 00 20 00 40 1f  |............ .@.|
00000010  00 20 04 20 00 00 00 00  a0 9f b4 00 40 9f b4 00  |. . ........@...|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001f40  c8 07 08 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001f50  02 00 02 00 02 08 18 00  0b 4d 41 52 4b 7b 02 00  |.........MARK{..|
00001f60  00 00 00 00 00 00 19 00  00 00 02 00 00 00 02 00  |................|
00001f70  00 00 01 00 00 00 01 00  00 00 1c 00 00 00 45 4e  |..............EN|
00001f80  45 00 1c 00 00 00 44 55  45 00 20 00 00 00 4c 49  |E.....DUE. ...LI|
00001f90  4b 45 20 00 00 00 46 41  4b 45 00 00 00 00 00 00  |KE ...FAKE......|
00001fa0  c7 07 08 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001fb0  01 00 02 00 02 08 18 00  0b 4d 41 52 4b 7b 02 00  |.........MARK{..|
00001fc0  00 00 00 00 00 00 19 00  00 00 02 00 00 00 02 00  |................|
00001fd0  00 00 01 00 00 00 01 00  00 00 1c 00 00 00 45 4e  |..............EN|
00001fe0  45 00 1c 00 00 00 44 55  45 00 20 00 00 00 4c 49  |E.....DUE. ...LI|
00001ff0  4b 45 20 00 00 00 46 41  4b 45 00 00 00 00 00 00  |KE ...FAKE......|
00002000

answer written by @filiprem acting to extend basic info provided by @a-horse-with-no-name

Best Answer

Related Solutions

PostgreSQL – Select into Specific Array Positions with array_agg()

Index

Postgresql – How does postgres store array values

Related Question