PostgreSQL Array Sorting – How to Preserve Original Order of Elements in Unnested Array

arrayparsepostgresqlsorting

Given the string:

'I think that PostgreSQL is nifty'

I would like to operate on the individual words found within that string. Essentially, I have a separate from which I can get word details and would like to join an unnested array of that string on this dictionary.

So far I have:

select word, meaning, partofspeech
from unnest(string_to_array('I think that PostgreSQL is nifty',' ')) as word
from table t
join dictionary d
on t.word = d.wordname;

This accomplishes the fundamentals of what I was hoping to do, but it does not preserve the original word order.

Related question:
PostgreSQL unnest() with element number

Best Answer

`WITH ORDINALITY` in Postgres 9.4 or later

The new feature simplifies this class of problems. The above query can now simply be:

SELECT *
FROM   regexp_split_to_table('I think Postgres is nifty', ' ') WITH ORDINALITY x(word, rn);

Or, applied to a table:

SELECT *
FROM   tbl t, regexp_split_to_table(t.my_column, ' ') WITH ORDINALITY x(word, rn);

Details:

PostgreSQL unnest() with element number

About the implicit LATERAL join:

What is the difference between LATERAL and a subquery in PostgreSQL?

Postgres 9.3 or older - and more general explanation

For a single string

You can apply the window function row_number() to remember the order of elements. However, with the usual row_number() OVER (ORDER BY col) you get numbers according to the sort order, not the original position in the string.

You could simply omit ORDER BY to get the position "as is":

SELECT *, row_number() OVER () AS rn
FROM   regexp_split_to_table('I think Postgres is nifty', ' ') AS x(word);

Performance of regexp_split_to_table() degrades with long strings. unnest(string_to_array(...)) scales better:

SELECT *, row_number() OVER () AS rn
FROM   unnest(string_to_array('I think Postgres is nifty', ' ')) AS x(word);

However, while this normally works and I have never seen it break in simple queries, Postgres asserts nothing as to the order of rows without an explicit ORDER BY.

To guarantee ordinal numbers of elements in the original string, use generate_subscript() (improved with comment by @deszo):

SELECT arr[rn] AS word, rn
FROM   (
   SELECT *, generate_subscripts(arr, 1) AS rn
   FROM   string_to_array('I think Postgres is nifty', ' ') AS x(arr)
   ) y;

For a table of strings

Add PARTITION BY id to the OVER clause ...

Demo table:

CREATE TEMP TABLE strings(string text);
INSERT INTO strings VALUES
  ('I think Postgres is nifty')
 ,('And it keeps getting better');

I use ctid as ad-hoc substitute for a primary key. If you have one (or any unique column) use that instead.

SELECT *, row_number() OVER (PARTITION BY ctid) AS rn
FROM  (
   SELECT ctid, unnest(string_to_array(string, ' ')) AS word
   FROM   strings
   ) x;

This works without any distinct ID:

SELECT arr[rn] AS word, rn
FROM  (
   SELECT *, generate_subscripts(arr, 1) AS rn
   FROM  (
      SELECT string_to_array(string, ' ') AS arr
      FROM   strings
      ) x
   ) y;

SQL Fiddle.

Answer to question

SELECT z.arr, z.rn, z.word, d.meaning   -- , partofspeech -- ?
FROM  (
   SELECT *, arr[rn] AS word
   FROM  (
      SELECT *, generate_subscripts(arr, 1) AS rn
      FROM  (
         SELECT string_to_array(string, ' ') AS arr
         FROM   strings
         ) x
      ) y
   ) z
JOIN   dictionary d ON d.wordname = z.word
ORDER  BY z.arr, z.rn;

Related Solutions

PostgreSQL – Preserve Order of Array Elements After Join

The problem with your query is the join condition id = ANY(ancestors). Not only does it not preserve original order, it also eliminates duplicate elements in the array. (An id could match 10 elements in ancestors, it would still be picked once only.) Not sure if the logic of your query would allow duplicate elements, but if it does I am pretty sure you want to preserve all instances - you want to keep "original order" after all.

Assuming current Postgres 9.4+ for lack of information, I suggest a completely different approach:

SELECT n.entity_id, p.ancestors
FROM   tree t
JOIN   nodes n ON n.id = t.node_id
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.ancestors) WITH ORDINALITY a(id, ord)
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS ancestors
   ) p ON true;

You query only works as intended if nodes.id is defined as primary key and nodes.entity_id is unique as well. Information is missing in the question.

Normally, this simpler query without explicit ORDER BY works as well, but there are no guarantees (Postgres 9.3+)...

SELECT n.entity_id, p.ancestors
FROM   tree t
JOIN   nodes n ON n.id = t.node_id
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.ancestors) id
      JOIN   entity.nodes p USING (id)
      ) AS ancestors
   ) p ON true;

You can make this safe as well. Detailed explanation:

PostgreSQL unnest() with element number

SQL Fiddle demo for Postgres 9.3.

Opional optimization

You join to entity.nodes twice - to substitute for node_id and ancestors alike. An alternative would be to fold both into one array or one set and join only once. Might be faster, but you have to test.
For these alternatives we need the ORDER BY in any case:

Add node_id to the ancestors array before we unnest ...

SELECT p.arr[1] AS entity_id, p.arr[2:2147483647] AS ancestors
FROM   tree t
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM   unnest(t.node_id || t.ancestors) WITH ORDINALITY a(id, ord)
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS arr
   ) p ON true;

Or add node_id to the unnested elements of ancestors before we join ...

SELECT p.arr[1] AS entity_id, p.arr[2:2147483647] AS ancestors
FROM   tree t
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT p.entity_id
      FROM  (
         SELECT t.node_id AS id, 0 AS ord
         UNION ALL
         SELECT * FROM unnest(t.ancestors) WITH ORDINALITY
         ) x
      JOIN   entity.nodes p USING (id)
      ORDER  BY ord
      ) AS arr
   ) p ON true;

You did not show our CTE, this might be optimized further ...

Order Results by Count of Common Array Elements in PostgreSQL

With tools of the basic Postgres installation only, you might unnest() and count in a LATERAL subquery:

SELECT i.name, i.user_ids_who_like, x.ct
FROM   items i
     , LATERAL (
   SELECT count(*) AS ct
   FROM   unnest(i.user_ids_who_like) uid
   WHERE  uid = ANY('{3,4,11}'::int[])
   ) x
ORDER  BY x.ct DESC;  -- add PK as tiebreaker for stable sort order

We don't need a LEFT JOIN to preserve rows without match because count() always returns a row - 0 for "no match".

`intarray`

Assuming integer arrays without NULL values or duplicates, the intersection operator & of the intarray module would be much simpler:

SELECT name, user_ids_who_like
     , array_length(user_ids_who_like & '{3,4,11}', 1) AS ct
FROM   items
ORDER  BY 3 DESC NULLS LAST;

I added NULLS LAST to sort empty arrays last - after the reminder from your later question:

How to get 0 as array_length() result when there are no elements

Install intarray once per database for this.

Use the overlap opertaor && in the WHERE clause to rule out rows without any overlap:

SELECT ...
FROM   ...
WHERE user_ids_who_like && '{3,4,11}'
ORDER  BY ...

Why? Per documentation:

intarray provides index support for the &&, @>, <@, and @@ operators, as well as regular array equality.

Applies to standard array operators in a similar fashion. Details:

Can PostgreSQL index array columns?

Alternatively and more radically, a normalized schema with a separate table instead of the array column user_ids_who_like would occupy more disk space, but offer simple solutions with plain btree indexes for these problems.