Order Results by Count of Common Array Elements in PostgreSQL

arraypostgresql

Using Postgres 9.4, I'm interested in having an array of integers like user_ids_who_like and provide an array of users (like user_ids_i_am_following) to sort that intersection.

Something like:

select * 
from items 
where [there is an intersection between 
       user_ids_who_like with user_ids_i_am_following] 
order by intersection(user_ids_who_like).count

Is grouping and ordering by an array intersection possible?

Example data:

items
name          | user_ids_who_like
'birds'       | '{1,3,5,8}'
'planes'      | '{2,3,4,11}'
'spaceships'  | '{3,4,6}'

For a given user_ids_who_i_follow = [3,4,11], can I do something like:

select * from items
where <user_ids_who_like intersects with user_ids_who_i_follow>
order by <count of that intersection>

Desired result:

name          | user_ids_who_like  | count
'planes'      |  '{2,3,4,11}'      | 3
'spaceships'  |  '{3,4,6}'         | 2
'birds'       |  '{1,3,5,8}'       | 1

One possibility seems to be something like this:

select id, user_ids_who_like, (user_ids_who_like & '{514, 515}'::int[]) as jt  
from queryables 
where user_ids_who_like && '{514, 515}' 
order by icount(user_ids_who_like & '{514, 515}'::int[]) desc;

But I can't tell if this style (using the intarray extension rather than native array functions and operators) is outdated; any feedback from more sophisticated users here? It's not clear to me how to do the intersection of two arrays using the methods and operators.

Best Answer

With tools of the basic Postgres installation only, you might unnest() and count in a LATERAL subquery:

SELECT i.name, i.user_ids_who_like, x.ct
FROM   items i
     , LATERAL (
   SELECT count(*) AS ct
   FROM   unnest(i.user_ids_who_like) uid
   WHERE  uid = ANY('{3,4,11}'::int[])
   ) x
ORDER  BY x.ct DESC;  -- add PK as tiebreaker for stable sort order

We don't need a LEFT JOIN to preserve rows without match because count() always returns a row - 0 for "no match".

`intarray`

Assuming integer arrays without NULL values or duplicates, the intersection operator & of the intarray module would be much simpler:

SELECT name, user_ids_who_like
     , array_length(user_ids_who_like & '{3,4,11}', 1) AS ct
FROM   items
ORDER  BY 3 DESC NULLS LAST;

I added NULLS LAST to sort empty arrays last - after the reminder from your later question:

How to get 0 as array_length() result when there are no elements

Install intarray once per database for this.

Use the overlap opertaor && in the WHERE clause to rule out rows without any overlap:

SELECT ...
FROM   ...
WHERE user_ids_who_like && '{3,4,11}'
ORDER  BY ...

Why? Per documentation:

intarray provides index support for the &&, @>, <@, and @@ operators, as well as regular array equality.

Applies to standard array operators in a similar fashion. Details:

Can PostgreSQL index array columns?

Alternatively and more radically, a normalized schema with a separate table instead of the array column user_ids_who_like would occupy more disk space, but offer simple solutions with plain btree indexes for these problems.

Related Solutions

Mysql – Dealing with data stored as arrays in a MySQL DB

Just as I was about to post a comment that this looked like no serialization format I've ever seen, I had a lucky flash of insight. (What programming language would make it altogether too easy to stash an array in a string like that? Ah...)

I think you may find that what you're looking at is the output of the serialize() function in PHP... the companion function is unserialize().

I'm not aware of a mechanism for deserializing this natively in MySQL (but then again, I wasn't aware of the whole thing a few minutes ago)... but if I had to do it natively in MySQL, I would go to the source code of common_schema for ideas and helper functions like get_num_tokens(), which returns a count of the number of tokens found in a given string of delimited text. There is genuine outside-the-box genius lurking in common_schema.

Untwizzling scalar values into rows is not something easily done in MySQL but prettify_message() in common_schema provides an example of how it is technically possible... by splitting the strings and writing what you find in to a new table, it seems possible and maybe even borderline practical if you're not dealing with a massive data set. Figure that part out and you could theoretically even build a trigger to keep your more-properly-structured table syncronized whenever one of those serialized columns is updated.

Or you could write something in php to read from the database, deserialize the arrays, iterate them, and stuff a table, but doing it natively in MySQL would be much more fun and at least offers the potential to keep another table always-current when you need it without an external script.

PostgreSQL – Preserve Order of Array Elements After Join

The problem with your query is the join condition id = ANY(ancestors). Not only does it not preserve original order, it also eliminates duplicate elements in the array. (An id could match 10 elements in ancestors, it would still be picked once only.) Not sure if the logic of your query would allow duplicate elements, but if it does I am pretty sure you want to preserve all instances - you want to keep "original order" after all.

Assuming current Postgres 9.4+ for lack of information, I suggest a completely different approach:

SELECT n.entity_id, p.ancestors
FROM   tree t
JOIN   nodes n ON n.id = t.node_id
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.ancestors) WITH ORDINALITY a(id, ord)
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS ancestors
   ) p ON true;

You query only works as intended if nodes.id is defined as primary key and nodes.entity_id is unique as well. Information is missing in the question.

Normally, this simpler query without explicit ORDER BY works as well, but there are no guarantees (Postgres 9.3+)...

SELECT n.entity_id, p.ancestors
FROM   tree t
JOIN   nodes n ON n.id = t.node_id
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.ancestors) id
      JOIN   entity.nodes p USING (id)
      ) AS ancestors
   ) p ON true;

You can make this safe as well. Detailed explanation:

PostgreSQL unnest() with element number

SQL Fiddle demo for Postgres 9.3.

Opional optimization

You join to entity.nodes twice - to substitute for node_id and ancestors alike. An alternative would be to fold both into one array or one set and join only once. Might be faster, but you have to test.
For these alternatives we need the ORDER BY in any case:

Add node_id to the ancestors array before we unnest ...

SELECT p.arr[1] AS entity_id, p.arr[2:2147483647] AS ancestors
FROM   tree t
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.node_id || t.ancestors) WITH ORDINALITY a(id, ord)
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS arr
   ) p ON true;

Or add node_id to the unnested elements of ancestors before we join ...

SELECT p.arr[1] AS entity_id, p.arr[2:2147483647] AS ancestors
FROM   tree t
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM  (
         SELECT t.node_id AS id, 0 AS ord
         UNION ALL
         SELECT * FROM unnest(t.ancestors) WITH ORDINALITY
         ) x
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS arr
   ) p ON true;

You did not show our CTE, this might be optimized further ...