Postgresql – Slow query times for similarity searches with pg_trgm indices

performancepostgresqlpostgresql-9.6query-performance

We added two pg_trgm indices to a table, to enable fuzzy searching by either email address or name, as we need to find users by name, or email addresses which have been misspelled during signup (e.g. "@gmail.con"). ANALYZE was run after index creation.

However, doing a ranked search on either of these indices is very slow in the vast majority of cases. i.e. with an increased timeout, a query might return in 60 seconds, on very rare occasions as fast as 15 seconds, but usually queries will time out.

pg_trgm.similarity_threshold is the default value of 0.3, but bumping this up to 0.8 didn't seem to make a difference.

This particular table has over 25 million rows, and is constantly queried, updated, and inserted into (the mean time for each is under 2ms). The setup is PostgreSQL 9.6.6 running on an RDS db.m4.large instance with general purpose SSD storage, and more-or-less default parameters. The pg_trgm extension is version 1.3.

Queries:

  • SELECT *
    FROM users
    WHERE email % 'chris@example.com'
    ORDER BY email <-> 'chris@example.com' LIMIT 10;
    
  • SELECT *
    FROM users
    WHERE (first_name || ' ' || last_name) % 'chris orr'
    ORDER BY (first_name || ' ' || last_name) <-> 'chris orr' LIMIT 10;
    

These queries do not need to be run very often (dozens of times a day), but they should be based on the current table state, and ideally return within around 10 seconds.


Schema:

=> \d+ users
                                          Table "public.users"
          Column   |            Type             | Collation | Nullable | Default | Storage  
-------------------+-----------------------------+-----------+----------+---------+----------
 id                | uuid                        |           | not null |         | plain    
 email             | citext                      |           | not null |         | extended 
 email_is_verified | boolean                     |           | not null |         | plain    
 first_name        | text                        |           | not null |         | extended 
 last_name         | text                        |           | not null |         | extended 
 created_at        | timestamp without time zone |           |          | now()   | plain    
 updated_at        | timestamp without time zone |           |          | now()   | plain    
 …                 | boolean                     |           | not null | false   | plain    
 …                 | character varying(60)       |           |          |         | extended 
 …                 | character varying(6)        |           |          |         | extended 
 …                 | character varying(6)        |           |          |         | extended 
 …                 | boolean                     |           |          |         | plain    
Indexes:
  "users_pkey" PRIMARY KEY, btree (id)
  "users_email_key" UNIQUE, btree (email)
  "users_search_email_idx" gist (email gist_trgm_ops)
  "users_search_name_idx" gist (((first_name || ' '::text) || last_name) gist_trgm_ops)
  "users_updated_at_idx" btree (updated_at)
Triggers:
  update_users BEFORE UPDATE ON users FOR EACH ROW EXECUTE PROCEDURE update_modified_column()
Options: autovacuum_analyze_scale_factor=0.01, autovacuum_vacuum_scale_factor=0.05

(I'm aware that we should probably also add unaccent() to users_search_name_idx and the name query…)


Explains:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE (first_name || ' ' || last_name) % 'chris orr' ORDER BY (first_name || ' ' || last_name) <-> 'chris orr' LIMIT 10;:

Limit  (cost=0.42..40.28 rows=10 width=152) (actual time=58671.973..58676.193 rows=10 loops=1)
  Buffers: shared hit=66227 read=231821
  ->  Index Scan using users_search_name_idx on users  (cost=0.42..100264.13 rows=25153 width=152) (actual time=58671.970..58676.180 rows=10 loops=1)
        Index Cond: (((first_name || ' '::text) || last_name) % 'chris orr'::text)
        Order By: (((first_name || ' '::text) || last_name) <-> 'chris orr'::text"
        Buffers: shared hit=66227 read=231821
Planning time: 0.125 ms
Execution time: 58676.265 ms

The email search is more likely to time out than the name search, but that's presumably because the email addresses are so similar (e.g. a lot of @gmail.com addresses).

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email % 'chris@example.com' ORDER BY email <-> 'chris@example.com' LIMIT 10;:

Limit  (cost=0.42..40.43 rows=10 width=152) (actual time=58851.719..62181.128 rows=10 loops=1)
  Buffers: shared hit=83 read=428918
  ->  Index Scan using users_search_email_idx on users  (cost=0.42..100646.36 rows=25153 width=152) (actual time=58851.716..62181.113 rows=10 loops=1)
        Index Cond: ((email)::text % 'chris@example.com'::text)
        Order By: ((email)::text <-> 'chris@example.com'::text)
        Buffers: shared hit=83 read=428918
Planning time: 0.100 ms
Execution time: 62181.186 ms

What could be a reason for the slow query times? Something to do with the number of buffers being read? I couldn't find much information about optimising this particular kind of query, and the queries are very similar to those in the pg_trgm documentation anyway.

Is this something that we could optimize, or implement better in Postgres, or would looking to something like Elasticsearch be a better fit for this particular use case?

Best Answer

You might be able to get better performance with gin_trgm_ops rather than gist_trgm_ops. Which is better is pretty unpredictable, it is sensitive to the distribution of text patterns and lengths in your data and in your query terms. You pretty much just have to try it and see how it works for you. One thing is that the GIN method will be quite sensitive to pg_trgm.similarity_threshold, unlike the GiST method. It will also depend on what version of pg_trgm you have. If you started with an older version of PostgreSQL but updated it with pg_upgrade, you might not have the latest version. The planner does no better at predicting which index type is superior than we can do. So to test it, you can't just create both, you have to drop the other one, to force the planner to use the one you want.

In the specific case of the email column, you might be better to split them into username and domain, and then query for similar username with exact domain and vice versa. Then the extreme prevalence of the major cloud email providers is less likely to pollute the indexes with trigrams which add little information.

Finally what is the use case for this? Knowing why you need to run these queries could lead to better suggestions. In particular, why would you need to do a similarity search on emails, once they have been verified as being deliverable and going to the correct person? Perhaps you could build a partial index on only the subset of emails which have not been verified yet?