Compare Words in String Without Considering Positions in PostgreSQL

postgresqlpostgresql-9.6string manipulation

In Postgres 9.6 I want to test whether two strings like these are considered the same:

'this is a test number 01', 'number this is 01 a test'

So I have created this function:

CREATE OR REPLACE FUNCTION sort_text(a text) RETURNS text AS $$
    declare t1 text;
    BEGIN
      select(array_to_string (
        array(
          select * from unnest(string_to_array(a, ' ')) order by 1), ' ')) into t1;
      RETURN t1;
    END;

$$ LANGUAGE plpgsql;

select (sort_text('this is a test number 01') = sort_text('number this is 01 a test'));

which actually looks to be working correctly.

I was wondering, is there any better way to do this?


'this and this' and 'and this' are considered to be different.

All the strings are already stripped out (spaces and punctuation) and duplication is not a problem. String length 50 characters as max estimation.

Best Answer

I suggest a single SELECT in a plain SQL function:

CREATE OR REPLACE FUNCTION strings_equivalent(a text, b text)
  RETURNS bool AS
$func$
   SELECT a1 = b1
   FROM  (
      SELECT string_agg(w, ' ') AS a1
      FROM  (
         SELECT w
         FROM   unnest(string_to_array(a, ' ')) w
         ORDER  BY w
         ) a1
      ) a2
   , (
      SELECT string_agg(w, ' ') AS b1
      FROM  (
         SELECT w
         FROM   unnest(string_to_array(b, ' ')) w
         ORDER  BY w
         ) b1
      ) b2
   WHERE  length(a) = length(b)
   UNION ALL
   SELECT FALSE
   LIMIT 1;  -- for clarity, not needed
$func$ LANGUAGE sql IMMUTABLE;

Call:

SELECT strings_equivalent('this is a test number 01', 'number this is 01 a test');

This is assuming:

  • Duplicate words count like any other words
  • Separator is a single space
  • No leading or trailing spaces

The function returns NULL for any empty string or NULL input.

The UNION ALL construct is a shortcut to return FALSE immediately if the input strings don't have the same length and avoid more expensive processing. Related:

LIMIT 1 is not needed because the function only returns the first column of the first row anyway and ignores the rest if there are more rows.

IMMUTABLE (since the result never changes for the same input) helps performance with repeated evaluation and allows indexes on functional expressions.

You could use regexp_split_to_table(a, ' ') instead of unnest(string_to_array(a, ' ')), but regular expression functions are typically more expensive. (You can cover more sophisticated separator characters with the regex, though, like '\s+' for any white space). Related:


BTW, your simple function sort_text() looks good. But use string_agg() instead of array_to_string(ARRAY(...)) in a simple SQL function. No variable assignment needed:

CREATE OR REPLACE FUNCTION sort_text(a text)
  RETURNS text AS
$$
   SELECT string_agg(w, ' ')   
   FROM  (
      SELECT w
      FROM   unnest(string_to_array(a, ' ')) w
      ORDER  BY 1
      ) sub
$$ LANGUAGE sql IMMUTABLE;