Postgresql – Inconsistent query execution times using trigram indexes

postgresql

I have a table residences

I installed trgm: create extension pg_trgm;

After adding trigram indexes to both grp and name columns using:

CREATE INDEX residences_name_trgm ON residences USING GIN (name gin_trgm_ops);
CREATE INDEX residences_grp_trgm ON residences USING GIN (grp gin_trgm_ops);

And checking performance of a simple query… it appears that the total runtime for a query on the name column runs ~100x faster. Why is this?

EXPLAIN ANALYZE SELECT * FROM residences WHERE name ILIKE '%Sutton%';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on residences  (cost=36.02..47.90 rows=3 width=1872) (actual time=0.390..0.720 rows=21 loops=1)
   Recheck Cond: ((name)::text ~~ '%Sutton%'::text)
   ->  Bitmap Index Scan on residences_name_trgm_gin  (cost=0.00..36.02 rows=3 width=0) (actual time=0.354..0.354 rows=21 loops=1)
         Index Cond: ((name)::text ~~ '%Sutton%'::text)
 Total runtime: 0.814 ms
(5 rows)

and

EXPLAIN ANALYZE SELECT * FROM residences WHERE grp ILIKE '%Sutton%';
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on residences  (cost=97.93..8879.41 rows=5927 width=1872) (actual time=5.516..115.634 rows=5968 loops=1)
   Recheck Cond: ((grp)::text ~~ '%Sutton%'::text)
   ->  Bitmap Index Scan on residences_grp_trgm_gin  (cost=0.00..96.45 rows=5927 width=0) (actual time=4.366..4.366 rows=5968 loops=1)
         Index Cond: ((grp)::text ~~ '%Sutton%'::text)
 Total runtime: 119.779 ms
(5 rows)

For reference, both name and grp columns are CHARACTER VARYING(50).

Futhermore… counting distinct values in each column yields:

SELECT COUNT(DISTINCT grp) FROM residences;
421

SELECT COUNT(DISTINCT name) FROM residences
7750

That is, the name column actually has ~20x more distinct values, but returns values much faster.

More information on this technique can be found on depesz's article WAITING FOR 9.1 – FASTER LIKE/ILIKE and another similar tutorial on the PalominoDB blog.

Best Answer

That is, the name column actually has ~20x more distinct values, but returns values much faster.

Exactly so. An index on name is more selective than an index on grp, so index use is more effective.

This is confirmed by the fact that your first query, on the name field, the bitmap index scan onresidences_grp_trgm_gin gets only 21 results as shown by actual( ... rows=21 ...) in the explain analyze output. On grp you get 5968 results.

In a bitmap index scan every heap read and re-check has a significant cost, so it makes perfect sense that finding 21 rows is going to be a lot faster than finding 5968.

BTW, please always show select version() and your server configuration, as well as anything else that's relevant from the wiki's guide to reporting problems.