Postgresql – Algorithm for finding the longest prefix

pattern matchingperformancepostgresqlpostgresql-9.1query-performance

I have two tables.

First one is a table with prefixes

code name price
343  ek1   10
3435 nt     4
3432 ek2    2

Second is call records with phone numbers

number        time
834353212     10
834321242     20
834312345     30

I need write a script which find longest prefix from prefixes for each record, and write all this data to third table, like this:

 number        code   ....
 834353212     3435
 834321242     3432
 834312345     343

For number 834353212 we must trim '8', and then find the longest code from prefix table, its 3435.
We must always drop first '8' and prefix must be in the beginning.

I solved this task long time ago, with very bad way.
Its was terrible perl script which do a lot of queries for each record.
This script:

Take a number from calls table, do substring from length(number) to
1 => $prefix in the loop
Do the query : select count(*) from prefixes where code like '$prefix'
If count>0 then take first prefixes and write into table

First problem is query counts – it's call_records * length(number).
Second problem is LIKE expressions. I am afraid those are slow.

I tried to solve the second problem by:

CREATE EXTENSION pg_trgm;
CREATE INDEX prefix_idx ON prefix USING gist (code gist_trgm_ops);

That speeds up each query, but did not solve problem in general.

I have 20k prefixes and 170k numbers now, and my old solution is bad. Looks like I need some new solution without loops.

Only one query for each call record or something like this.

Best Answer

I am assuming data type text for the relevant columns.

CREATE TABLE prefix (code text, name text, price int);
CREATE TABLE num (number text, time int);

"Simple" Solution

SELECT DISTINCT ON (1)
       n.number, p.code
FROM   num n
JOIN   prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER  BY n.number, p.code DESC;

Key elements:

DISTINCT ON is a Postgres extension of the SQL standard DISTINCT. Find a detailed explanation for the used query technique in this related answer on SO.
ORDER BY p.code DESC picks the longest match, because '1234' sorts after '123' (in ascending order).

Simple SQL Fiddle.

Without index, the query would run for a very long time (didn't wait to see it finish). To make this fast, you need index support. The trigram indexes you mentioned, supplied by the additional module pg_trgm are a good candidate. You have to choose between GIN and GiST index. The first character of the numbers is just noise and can be excluded from the index, making it a functional index in addition.
In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):

CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);

Advanced dbfiddle here.

All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:

Total runtime: 1719.552 ms (trigram GiST)
Total runtime: 912.329 ms (trigram GIN)

Much faster yet

Failed attempt with `text_pattern_ops`

Once we ignore the distracting first noise character, it comes down to basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class text_pattern_ops (assuming column type text).

CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);

This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:

SELECT * FROM num WHERE right(number, -1) LIKE '2345%'

Total runtime: 3.816 ms (trgm_gin_idx)
Total runtime: 0.147 ms (text_pattern_idx)

However, the query planner will not consider this index for joining two tables. I have seen this limitation before. I don't have a meaningful explanation for this, yet.

Partial / functional B-tree indexes

The alternative it to use equality checks on partial strings with partial indexes. This can be used in a JOIN.

Since we typically only have a limited number of different lengths for prefixes, we can build a solution similar to the one presented here with partial indexes.

Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:

CREATE INDEX prefix_code_idx5 ON prefix(code) WHERE length(code) = 5;
CREATE INDEX prefix_code_idx4 ON prefix(code) WHERE length(code) = 4;
CREATE INDEX prefix_code_idx3 ON prefix(code) WHERE length(code) = 3;
CREATE INDEX prefix_code_idx2 ON prefix(code) WHERE length(code) = 2;
CREATE INDEX prefix_code_idx1 ON prefix(code) WHERE length(code) = 1;

Since these are partial indexes, all of them together are barely larger than a single complete index.

Add matching indexes for numbers (taking the leading noise character into account):

CREATE INDEX num_number_idx5 ON num(substring(number, 2, 5)) WHERE length(number) >= 6;
CREATE INDEX num_number_idx4 ON num(substring(number, 2, 4)) WHERE length(number) >= 5;
CREATE INDEX num_number_idx3 ON num(substring(number, 2, 3)) WHERE length(number) >= 4;
CREATE INDEX num_number_idx2 ON num(substring(number, 2, 2)) WHERE length(number) >= 3;
CREATE INDEX num_number_idx1 ON num(substring(number, 2, 1)) WHERE length(number) >= 2;

While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.

If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...

If numbers are never shorter then n characters, drop redundant WHERE clauses from some or all, and also drop the corresponding WHERE clause from all following queries.

Recursive CTE

With all the setup so far I was hoping for very elegant solution with a recursive CTE:

WITH RECURSIVE cte AS (
   SELECT n.number, p.code, 4 AS len
   FROM   num n
   LEFT    JOIN prefix p
            ON  substring(number, 2, 5) = p.code
            AND length(n.number) >= 6  -- incl. noise character
            AND length(p.code) = 5

   UNION ALL 
   SELECT c.number, p.code, len - 1
   FROM    cte c
   LEFT   JOIN prefix p
            ON  substring(number, 2, c.len) = p.code
            AND length(c.number) >= c.len+1  -- incl. noise character
            AND length(p.code) = c.len
   WHERE    c.len > 0
   AND    c.code IS NULL
   )
SELECT number, code
FROM   cte
WHERE  code IS NOT NULL;

Total runtime: 1045.115 ms

However, while this query isn't bad - it performs about as good as the simple version with a trigram GIN index - it doesn't deliver what I was aiming for. The recursive term is planned once only, so it can't use the best indexes. Only the non-recursive term can.

UNION ALL

Since we are dealing with a small number of recursions we can just spell them out iteratively. This allows optimized plans for each of them. (We lose the recursive exclusion of already successful numbers, though. So there is still some room for improvement, especially for a wider range of prefix lengths)):

SELECT DISTINCT ON (1) number, code
FROM  (
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 5) = p.code
            AND length(n.number) >= 6  -- incl. noise character
            AND length(p.code) = 5
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 4) = p.code
            AND length(n.number) >= 5
            AND length(p.code) = 4
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 3) = p.code
            AND length(n.number) >= 4
            AND length(p.code) = 3
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 2) = p.code
            AND length(n.number) >= 3
            AND length(p.code) = 2
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 1) = p.code
            AND length(n.number) >= 2
            AND length(p.code) = 1
   ) x
ORDER BY number, code DESC;

Total runtime: 57.578 ms (!!)

A breakthrough, finally!

SQL function

Wrapping this into an SQL function removes the query planning overhead for repeated use:

CREATE OR REPLACE FUNCTION f_longest_prefix()
  RETURNS TABLE (number text, code text) LANGUAGE sql AS
$func$
SELECT DISTINCT ON (1) number, code
FROM  (
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 5) = p.code
            AND length(n.number) >= 6  -- incl. noise character
            AND length(p.code) = 5
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 4) = p.code
            AND length(n.number) >= 5
            AND length(p.code) = 4
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 3) = p.code
            AND length(n.number) >= 4
            AND length(p.code) = 3
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 2) = p.code
            AND length(n.number) >= 3
            AND length(p.code) = 2
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 1) = p.code
            AND length(n.number) >= 2
            AND length(p.code) = 1
   ) x
ORDER BY number, code DESC
$func$;

Call:

SELECT * FROM f_longest_prefix_sql();

Total runtime: 17.138 ms (!!!)

PL/pgSQL function with dynamic SQL

This plpgsql function is much like the recursive CTE above, but the dynamic SQL with EXECUTE forces the query to be re-planned for every iteration. Now it makes use of all the tailored indexes.

Additionally this works for any range of prefix lengths. The function takes two parameters for the range, but I prepared it with DEFAULT values, so it works without explicit parameters, too:

CREATE OR REPLACE FUNCTION f_longest_prefix2(_min int = 1, _max int = 5)
  RETURNS TABLE (number text, code text) LANGUAGE plpgsql AS
$func$
BEGIN
FOR i IN REVERSE _max .. _min LOOP  -- longer matches first
   RETURN QUERY EXECUTE '
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(n.number, 2, $1) = p.code
            AND length(n.number) >= $1+1  -- incl. noise character
            AND length(p.code) = $1'
   USING i;
END LOOP;
END
$func$;

The final step cannot be wrapped into the one function easily. Either just call it like this:

SELECT DISTINCT ON (1)
       number, code
FROM   f_longest_prefix_prefix2() x
ORDER  BY number, code DESC;

Total runtime: 27.413 ms

Or use another SQL function as wrapper:

CREATE OR REPLACE FUNCTION f_longest_prefix3(_min int = 1, _max int = 5)
  RETURNS TABLE (number text, code text) LANGUAGE sql AS
$func$
SELECT DISTINCT ON (1)
       number, code
FROM   f_longest_prefix_prefix2($1, $2) x
ORDER  BY number, code DESC
$func$;

Call:

SELECT * FROM f_longest_prefix3();

Total runtime: 37.622 ms

A bit slower due to required planning overhead. But more versatile than SQL and shorter for longer prefixes.

Simple solution

With DISTINCT ON in Postgres:

SELECT DISTINCT ON (i.good, i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good, i.the_date, p.the_date DESC;

Returned rows are ordered. See:

Select first row in each GROUP BY group?

Or with NOT EXISTS in standard SQL (works with every RDBMS I know):

SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM   inventory  i
LEFT   JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE  NOT EXISTS (
   SELECT FROM price p1
   WHERE  p1.good = p.good
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

Same result, but with arbitrary sort order - unless you add ORDER BY.
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

With only few rows per good, DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.

Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)

Plain views (like proposed by another answer) do not help performance at all in Postgres.

db<>fiddle here
_{Old sqlfiddle}

Proper solution

Strings and collation

First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.

Sorting by character types (text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:

SHOW lc_collate;

This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with COLLATE "C":

SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
       i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM   inventory  i
LEFT   JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER  BY i.good COLLATE "C", i.the_date, p.the_date DESC;

Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').

Index

If your query is supposed to use an index, columns with character data have to use a matching collation (good in the example):

CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);

Read the last two chapters of the related answer I linked above.

You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.

Normalize

Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:

CREATE TABLE good (
  good_id serial PRIMARY KEY
, good    text   NOT NULL
);

CREATE TABLE inventory (
  good_id  int  REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int  NOT NULL
, PRIMARY KEY(good_id, the_date)
);

CREATE TABLE price (
  good_id  int     REFERENCES good (good_id)
, the_date date    NOT NULL
, price    numeric NOT NULL
, PRIMARY KEY(good_id, the_date));

The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on price with descending order on the second column may improve performance:

CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);

Again, the collation must match your query (see above).

Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.

These resulting queries are much faster:

DISTINCT ON

SELECT DISTINCT ON (i.the_date)
       i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER  BY i.the_date, p.the_date DESC;

NOT EXISTS

SELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM   inventory  i
JOIN   good       g USING (good_id)
LEFT   JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
AND    NOT EXISTS (
   SELECT 1 FROM price p1
   WHERE  p1.good_id = p.good_id
   AND    p1.the_date <= i.the_date
   AND    p1.the_date >  p.the_date
   );

db<>fiddle here
_{OLD sqliddle}

Faster solutions

If that still is not fast enough, there may be faster solutions.

Recursive CTE / `JOIN LATERAL` / correlated subquery

Especially for data distributions with many prices per good:

Optimize GROUP BY query to retrieve latest record per user

Materialized view

If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventories for past dates rarely change. Compute the result once and store a snapshot as materialized view.

Postgres 9.3+ has automated support for materialized views. You can easily implement a basic version in older versions.

Postgresql – Longest matching suffix while using aggregate functions

What is wrong with the (maybe too obvious?):

select * from noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular
order by
  char_length(ni.label_singular) DESC
limit 1;