PostgreSQL Array Sorting – How to Preserve Original Order of Elements in Unnested Array

arrayparsepostgresqlsorting

Given the string:

'I think that PostgreSQL is nifty'

I would like to operate on the individual words found within that string. Essentially, I have a separate from which I can get word details and would like to join an unnested array of that string on this dictionary.

So far I have:

select word, meaning, partofspeech
from unnest(string_to_array('I think that PostgreSQL is nifty',' ')) as word
from table t
join dictionary d
on t.word = d.wordname;

This accomplishes the fundamentals of what I was hoping to do, but it does not preserve the original word order.

Related question:
PostgreSQL unnest() with element number

Best Answer

WITH ORDINALITY in Postgres 9.4 or later

The new feature simplifies this class of problems. The above query can now simply be:

SELECT *
FROM   regexp_split_to_table('I think Postgres is nifty', ' ') WITH ORDINALITY x(word, rn);

Or, applied to a table:

SELECT *
FROM   tbl t, regexp_split_to_table(t.my_column, ' ') WITH ORDINALITY x(word, rn);

Details:

About the implicit LATERAL join:

Postgres 9.3 or older - and more general explanation

For a single string

You can apply the window function row_number() to remember the order of elements. However, with the usual row_number() OVER (ORDER BY col) you get numbers according to the sort order, not the original position in the string.

You could simply omit ORDER BY to get the position "as is":

SELECT *, row_number() OVER () AS rn
FROM   regexp_split_to_table('I think Postgres is nifty', ' ') AS x(word);

Performance of regexp_split_to_table() degrades with long strings. unnest(string_to_array(...)) scales better:

SELECT *, row_number() OVER () AS rn
FROM   unnest(string_to_array('I think Postgres is nifty', ' ')) AS x(word);

However, while this normally works and I have never seen it break in simple queries, Postgres asserts nothing as to the order of rows without an explicit ORDER BY.

To guarantee ordinal numbers of elements in the original string, use generate_subscript() (improved with comment by @deszo):

SELECT arr[rn] AS word, rn
FROM   (
   SELECT *, generate_subscripts(arr, 1) AS rn
   FROM   string_to_array('I think Postgres is nifty', ' ') AS x(arr)
   ) y;

For a table of strings

Add PARTITION BY id to the OVER clause ...

Demo table:

CREATE TEMP TABLE strings(string text);
INSERT INTO strings VALUES
  ('I think Postgres is nifty')
 ,('And it keeps getting better');

I use ctid as ad-hoc substitute for a primary key. If you have one (or any unique column) use that instead.

SELECT *, row_number() OVER (PARTITION BY ctid) AS rn
FROM  (
   SELECT ctid, unnest(string_to_array(string, ' ')) AS word
   FROM   strings
   ) x;

This works without any distinct ID:

SELECT arr[rn] AS word, rn
FROM  (
   SELECT *, generate_subscripts(arr, 1) AS rn
   FROM  (
      SELECT string_to_array(string, ' ') AS arr
      FROM   strings
      ) x
   ) y;

SQL Fiddle.

Answer to question

SELECT z.arr, z.rn, z.word, d.meaning   -- , partofspeech -- ?
FROM  (
   SELECT *, arr[rn] AS word
   FROM  (
      SELECT *, generate_subscripts(arr, 1) AS rn
      FROM  (
         SELECT string_to_array(string, ' ') AS arr
         FROM   strings
         ) x
      ) y
   ) z
JOIN   dictionary d ON d.wordname = z.word
ORDER  BY z.arr, z.rn;