Postgresql – Postgres, levenshtein distance and nested queries

explainpostgresqlsubquery

I have a DB Table with 3,000,000 rows or more like that:

|ID|T4|T16|T64|TL

where ID =INTEGER (PK)
and T4,T16,T64,TL are text fields. Each one is larger in length than the previous, i.e., LENGTH(T4)

I want to compare one row of this table with a SELF JOIN with all other rows based on the levenshtein distance (fuzzystrmatch extension) between TL of the compared rows and get the best-k results that correspond to smaller levenstein distance.

SELECT n1.*,n2.* 
FROM (SELECT * FROM dbtable WHERE ID =110) n1,
dbtable n2
ORDER BY levenshtein_less_equal(n1.TL,n2.TL,LENGTH(n1.TEXT)/4) LIMIT 100;

Of course this would be extremely slow because it will have to do 3,000,000 levenshtein calculations.

In order to accelerate queries, I have created the artificial "hashes" (fields T4,T16,T64). So, I want to check initially levenshtein distance between fields T4. If this distance between T4 is within some limit then I check levenshtein distance between T16 and so-on. This way in each step i would need to calculate fewer levenshtein distances that by the naive approach.

The nested query is like that:

SELECT n5.*,levenshtein_less_equal(n5.n1_tl,n5.n2_tl,LENGTH(n5.n1_tl)/4)
FROM
(SELECT n4.*
FROM
(SELECT n3.*
FROM
(SELECT 
n1.id AS n1_id,
n1.t4 AS n1_t4,
n1.t16 AS n1_t16,
n1.t64 AS n1_t64,
n1.tl AS n1_tl,

n2.id AS n2_id,
n2.t4 AS n2_t4,
n2.t16 AS n2_t16,
n2.t64 AS n2_t64,
n2.tl AS n2_tl
FROM (SELECT id,t4,t16,t64,tl FROM dbTable WHERE id=110) AS n1,dbTable AS n2
WHERE
n1.id<>n2.id
AND levenshtein_less_equal(n1.t4,n2.t4,LENGTH(n1.t4)) <= LENGTH(n1.t4)/2
) n3

WHERE levenshtein_less_equal(n3.n1_t16,n3.n2_t16,LENGTH(n3.n1_t16)/2) < LENGTH(n3.n1_t16)/2 
) AS n4

WHERE levenshtein_less_equal(n4.n1_t64,n4.n2_t64,LENGTH(n4.n1_t64)/2) < LENGTH(n4.n1_t64)/2)
AS n5
ORDER BY levenshtein_less_equal(n5.n1_tl,n5.n2_tl,LENGTH(n5.n1_tl)/4)
LIMIT 100;

The query is much faster than the original naive since it only needs to access 112851 rows for TL but:

The EXPLAIN PLAN:

Limit  (cost=872281.90..872282.15 rows=100 width=1168)
  ->  Sort  (cost=872281.90..872564.03 rows=112851 width=1168)
        Sort Key: (levenshtein_less_equal(dbTable.tl, n2.tl, (length(dbTable.tl) / 4)))
        ->  Nested Loop  (cost=0.00..867968.81 rows=112851 width=1168)
              Join Filter: ((dbTable.id <> n2.id) AND (levenshtein_less_equal(dbTable.t4, n2.t4, length(dbTable.t4)) <= (length(dbTable.t4) / 2)) AND (levenshtein_less_equal(dbTable.t16, n2.t16, (length(dbTable.t16) / 2)) < (length(dbTable.t16) / 2)) AND (levenshtein_less_equal(dbTable.t64, n2.t64, (length(dbTable.t64) / 2)) < (length(dbTable.t64) / 2)))
              ->  Index Scan using dbTable_pkey on dbTable  (cost=0.00..8.60 rows=1 width=584)
                    Index Cond: (id = 110)
              ->  Seq Scan on dbTable n2  (cost=0.00..699529.82 rows=3046982 width=584) 

As you see the problem is that the query optimizer joins / collapses the levenshtein distance calculations between T4, T16, T64, where without collapsing it would be probably faster, because fewer levenshtein distances would need to be made for T64. I used SET from_collapse_limit=1; to avoid collapsing at the same session but nothing changed. Is there a way to enforce this hierarchical functionality in a Postgres query; Keep in mind that none of the T4…TL are indexed because I do not think there is an index to accelerate levenshtein distances. Any suggestions to further improve performance?

Best Answer

I suggest to use a trigram index provided by the additional module pg_trgm. Combine that with the length of the string to get a valid pre-selection.

  1. Drop the columns T4,T16 and T64 (faster in a single statement), and run VACUUM FULL or CLUSTER.

  2. Install pg_trgm. Details here:

  3. Create a GiST index on tl and length(tl).

There are a details to be observed. Related answers on dba.SE involving pg_trgm: