Postgresql – Postgres pg_trgm JOIN multiple columns with large tables (~50 million rows)

full-text-searchindexjoin;postgresql

I'm trying to perform a query on Postgres using the pg_trgm extension that attempts to fuzzy match across multiple columns in two tables. I've let the query run for an hour or more and it hasn't finished. I'm able to get results after a few minutes by setting a LIMIT of 10 or 100, but anything beyond that takes forever and doesn't finish. The query is as follows:

SELECT * FROM schema1.X a
JOIN schema2.Y b
  ON (
    b.name % a.track_title
    AND b.artist_display_name % a.artist
    AND b.collection_display_name % a.release_title
  )

I've indexed each of the individual columns here on both tables with GIN, and I've added a composite GIN index on both tables for the columns in the same order as the query (composite of track_title,artist,release_title on schema1.X and composite of name,artist_display_name,collection_display_name on schema2.Y). EDIT: I should also probably mention that schema1.X contains about 50,000 rows and schema2.Y contains about 50 million rows. All three fields being compared in the JOIN are of type TEXT. The EXPLAIN ANALYZE output of this query is the following (run with a LIMIT of 10 rows so I could get it to finish in a reasonable amount of time):

  ->  Nested Loop  (cost=43.18..2460373.14 rows=2696 width=71) (actual time=4922.361..8640.818 rows=10 loops=1)
        ->  Seq Scan on schema1.X a  (cost=0.00..1261.91 rows=52091 width=48) (actual time=0.007..0.010 rows=2 loops=1)
        ->  Bitmap Heap Scan on schema2.Y b  (cost=43.18..47.20 rows=1 width=71) (actual time=4300.048..4320.303 rows=5 loops=2)
              Recheck Cond: (((name)::text % (a.track_title)::text) AND ((artist_display_name)::text % (a.artist)::text) AND ((collection_display_name)::text % (a.release_title)::text))
              Rows Removed by Index Recheck: 1390
              Heap Blocks: exact=2715
              ->  Bitmap Index Scan on idx_gin_name_artist_collection  (cost=0.00..43.18 rows=1 width=0) (actual time=4297.353..4297.353 rows=1414 loops=2)
                    Index Cond: (((name)::text % (a.track_title)::text) AND ((artist_display_name)::text % (a.artist)::text) AND ((collection_display_name)::text % (a.release_title)::text))
Planning time: 0.316 ms
Execution time: 8641.760 ms

`

It appears to be using my composite index from one table (the Bitmap Index Scan on idx_gin_name_artist_collection) but not from the other. Here are some of the links I reviewed:

In case it's important, I used this syntax to create the GIN indexes (filling in my table and column names of course):

CREATE INDEX  "table_idx" ON "schema"."table"
  USING gin(field_name COLLATE "default" "public"."gin_trgm_ops")
  WITH (FASTUPDATE = YES);

Am I approaching this the wrong way? Does pg_trgm work within a JOIN clause? Most of the examples I can find are using the % operator in the WHERE clause. Also, should I expect this to run in a reasonable amount of time (minutes) on tables this large, even with indexes? Thanks in advance for any help.

EDIT

Here is the EXPLAIN plan when I remove the LIMIT from the query:

Nested Loop  (cost=43.18..2460373.14 rows=2696 width=71)
  ->  Seq Scan on schema1.X a  (cost=0.00..1261.91 rows=52091 width=48)
  ->  Bitmap Heap Scan on schema2.Y b  (cost=43.18..47.20 rows=1 width=71)
        Recheck Cond: (((name)::text % (a.track_title)::text) AND ((artist_display_name)::text % (a.artist)::text) AND ((collection_display_name)::text % (a.release_title)::text))
        ->  Bitmap Index Scan on idx_gin_song_name_artist_collection  (cost=0.00..43.18 rows=1 width=0)
              Index Cond: (((name)::text % (a.track_title)::text) AND ((artist_display_name)::text % (a.artist)::text) AND ((collection_display_name)::text % (a.release_title)::text))

Follow-up questions:

  • Could I create a separate column of all three field concat'd or hashed for each table, and then index that column? Or is that what the composite GIN index is already doing?

  • Does casting play a role in slowing the query down? I notice in the EXPLAIN plan that the index is casting each field as ::text even though they are all varchar(256).

  • Alternative solutions: Would a columnar data store like Redshift be any more efficient at this? Does Redshift even have the ability to install extensions like pg_trgm? Are other fuzzy search tools better at this kind of multi-column searching (e.g. ElasticSearch)?

Best Answer

This type of join cannot effectively use an index-to-index join. One of the tables is going to need to be seq scanned. You might want to try forcing it to reverse which table gets seq scanned, for example by dropping the gin index on the larger table so it can't be used, or appending the empty string to each column of the large table, such as on ((b.name||'') % a.track_title and .... Doing this will probably only work if you are limited by IO, not CPU.

Also, should I expect this to run in a reasonable amount of time (minutes) on tables this large, even with indexes?

No, I don't think so. There might be some things you can do to improve performance, but you are fundamentally asking the database to do a massive amount of work and it will take a long time to do it.

Make sure you are using the latest version of pg_trgm, which is 1.3. This can help a lot for LIKE queries, but don't expect miracles for % queries.

Increase the value of pg_trgm.similarity_threshold as much as you can while still obtaining the number of results you want. The default value of 0.3 is really very low.

It is not clear that the multi-column index is actually going to be helpful. Try dropping it and see how it does with the individual column indexes.

Also, just be patient. This looks to me like a data-cleaning exercise, not something you need to run interactively. Maintain a materialized view, or a join table updated by triggers, to record the matches if need to be able to retrieve them interactively.