Postgresql – Longest matching suffix while using aggregate functions

aggregatelimitspostgresqlxml

Background

Looking to find the longest matching string suffix.

Setup

Consider the following fiddle:

CREATE TABLE noun
    ("label" varchar(10))
;

INSERT INTO noun
    ("label")
VALUES
    ('bar'),
    ('blue bar'),
    ('red bar'),
    ('green bar'),
    ('purple bar'),
    ('handlebar')
;

CREATE TABLE noun_inflection
    ("label_singular" varchar(9), "label_plural" varchar(9))
;

INSERT INTO noun_inflection
    ("label_singular", "label_plural")
VALUES
    ('bar', 'bars'),
    ('handlebar', 'handlebar')
;

And the following query:

select * from noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular;

This returns two rows:

LABEL       | LABEL_SINGULAR | LABEL_PLURAL
------------+----------------+-------------
handlebar   | bar            | bars
handlebar   | handlebar      | handlebar

The first row is correct, but not desired. For this particular purpose, the Levenshtein distance can be used to eliminate the duplicate:

select * from noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular
order by
  levenshtein( n.label, ni.label_singular )
limit 1;

This re-orders the rows based on the label similarity. In this example, "handlebar" exactly matches "handlebar", and has a distance of 0. Adding the limit 1 restricts the query to a single row.

Problem

The setup works, except PostgreSQL 9.1 does not respect LIMIT modifiers in aggregate functions. That is, the following doesn't work:

SELECT
  xmlagg( xmlement( ... ) ORDER BY levenshtein( ... ) LIMIT 1 )
FROM
  noun n, noun_inflection ni

The problem remains. The word 'handlebar' matches '%bar' and '%handlebar', and so this returns two rows, which, in turn, injects two xmlelements into the resulting XML document when only one is expected.

Update #1

To clarify:

select
  xmlagg(
    xmlelement(
      name "noun",
      trim( TRAILING label_singular FROM n.label ) || ni.label_plural
    )
  )
from
  noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular;

That should return a single 'handlebar' XML element. Currently, it returns 'handlebars' and 'handlebar':

{ "Value": "<noun>handlebars</noun><noun>handlebar</noun>", "Type": "xml" }

The desired output is:

{ "Value": "<noun>handlebar</noun>", "Type": "xml" }

Update #2

Even though the following code solves the handlebar/handlebars problem, it prevents multiple different nouns from being returned:

select
  xmlagg(
    xmlelement(
      name "noun",
      trim( TRAILING label_singular FROM n.label ) || ni.label_plural
    )
  )
from
  noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular
group by n.label, ni.label_singular
order by levenshtein( n.label, ni.label_singular )
limit 1

Update #3

This looks like it will require a stored function. Something along the lines of:

  SELECT
    trim( TRAILING label_singular FROM p_noun ) || ni.label_plural
  FROM
    noun_inflection ni
  WHERE
    p_noun ILIKE '%'||ni.label_singular
  ORDER BY
    levenshtein( p_noun, ni.label_singular )
  LIMIT 1;

Question

How would you match and return only the longest substring?

Best Answer

What is wrong with the (maybe too obvious?):

select * from noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular
order by
  char_length(ni.label_singular) DESC
limit 1;