Postgresql – Using ILIKE with unaccent and with only right end wildcard

indexindex-tuningperformancepostgresqlpostgresql-9.4postgresql-performance

I use Postgresql 9.4 and I have a big table named foo. I want to search on it but I get long execution times if the search text is very short (e.g. "v") or long (e.g. "This is a search example with gin on table foo%"). In this cases my index is ignored. Here my search query:

EXPLAIN (ANALYZE, TIMING)
SELECT  "foo".* FROM "foo" WHERE "foo"."locale" = 'de'
AND f_unaccent(foo.name) ILIKE f_unaccent('v%')
AND foo.configuration->'bar' @> '{"is":["a"]}'
LIMIT 100;

This is my index:

CREATE INDEX index_foo_on_name_de_gin ON foo USING gin(f_unaccent(name) gin_trgm_ops) WHERE locale = 'de';

Why is the index ignored and uses seq scan and/or Bitmap heap scan? How can I add other index to solve this problem?

Why it does a recheck?

Recheck Cond: ((f_unaccent((name)::text) ~~* 'v%'::text) AND ((locale)::text = 'de'::text))

Function f_unaccent:

CREATE OR REPLACE FUNCTION f_unaccent(text)
         RETURNS text AS
         $func$
         SELECT unaccent('unaccent', $1)
         $func$  LANGUAGE sql IMMUTABLE SET search_path = public, pg_temp;

Query Plan:

 Limit  (cost=24412.85..67568.91 rows=100 width=301) (actual time=21838.473..21838.473 rows=0 loops=1)
   Buffers: shared hit=1 read=749976
   ->  Bitmap Heap Scan on foo  (cost=24412.85..4595502.73 rows=10592 width=301) (actual time=21838.470..21838.470 rows=0 loops=1)
         Recheck Cond: ((f_unaccent((name)::text) ~~* 'v%'::text) AND ((locale)::text = 'de'::text))
         Rows Removed by Index Recheck: 5416739
         Filter: ((configuration -> 'bar'::text) @> '{"is": ["a"]}'::jsonb)
         Rows Removed by Filter: 2196
         Heap Blocks: exact=749172
         Buffers: shared hit=1 read=749976
         ->  Bitmap Index Scan on index_foo_on_name_de_gin  (cost=0.00..24410.20 rows=10591544 width=0) (actual time=641.532..641.532 rows=5418935 loops=1)
               Index Cond: (f_unaccent((name)::text) ~~* 'v%'::text)
               Buffers: shared hit=1 read=804
 Planning time: 0.767 ms
 Execution time: 21838.549 ms

Table definition:

    Column     |            Type             |                          Modifiers                           | Storage  | Stats target | Description 
---------------+-----------------------------+--------------------------------------------------------------+----------+--------------+-------------
 id            | integer                     | not null default nextval('foo_id_seq'::regclass)             | plain    |              | 
 locale        | character varying           | not null                                                     | extended |              | 
 name          | character varying           | not null                                                     | extended |              | 
 configuration | jsonb                       | not null default '{}'::jsonb                                 | extended |              | 

"index_foo_on_configuration" gin (configuration)
"index_foo_on_name_de_gin" gin (f_unaccent(name::text) gin_trgm_ops) WHERE locale::text = 'de'::text

Without foo.configuration filter the query is very fast (1.021 ms). But I need this filter. Here the query without the filter:

EXPLAIN (ANALYZE, BUFFERS)
SELECT  "foo".* FROM "foo" WHERE "foo"."locale" = 'de'
AND f_unaccent(foo.name) ILIKE f_unaccent('v%')
LIMIT 100;

RESULTS WITH CHANGES

  • Update f_unnacent function
  • Added btree index CREATE INDEX index_foo_on_name_de ON foo (f_unaccent(name) text_pattern_ops) WHERE locale = 'de';
  • Added gin index on configuration CREATE INDEX index_foo_on_configuration ON foo USING gin(configuration jsonb_path_ops);
  • Removed old index

A) Query:

EXPLAIN (ANALYZE, BUFFERS)
SELECT  "foo".* FROM "foo" WHERE "foo"."locale" = 'de' 
AND f_unaccent(foo.name) ILIKE f_unaccent('v%')
AND foo.configuration->'bar' @> '{"0":["s"]}' 
LIMIT 100;

A) Query Plan:

 Limit  (cost=0.00..121248.83 rows=100 width=301) (actual time=16319.267..16319.267 rows=0 loops=1)
   Buffers: shared hit=262079 read=1449294
   ->  Seq Scan on foo  (cost=0.00..12842675.96 rows=10592 width=301) (actual time=16319.261..16319.261 rows=0 loops=1)
         Filter: (((locale)::text = 'de'::text) AND ((configuration -> 'bar'::text) @> '{"is": ["a"]}'::jsonb) AND (f_unaccent((name)::text) ~~* 'v%'::text))
         Rows Removed by Filter: 41227048
         Buffers: shared hit=262079 read=1449294
 Planning time: 0.765 ms
 Execution time: 16319.313 ms and more!!!

B) Query without configuration:

EXPLAIN (ANALYZE, BUFFERS)
SELECT  "foo".* FROM "foo" WHERE "foo"."locale" = 'de' 
AND f_unaccent(foo.name) ILIKE f_unaccent('v%') LIMIT 100;

B) Query Plan:

 Limit  (cost=0.00..119.31 rows=100 width=301) (actual time=0.227..2.912 rows=100 loops=1)
   Buffers: shared read=31
   ->  Seq Scan on foo  (cost=0.00..12636540.72 rows=10591544 width=301) (actual time=0.221..2.864 rows=100 loops=1)
         Filter: (((locale)::text = 'de'::text) AND (f_unaccent((name)::text) ~~* 'v%'::text))
         Rows Removed by Filter: 691
         Buffers: shared read=31
 Planning time: 0.501 ms
 Execution time: 2.985 ms

C) Query without configuration and limit:

EXPLAIN (ANALYZE, BUFFERS)
SELECT  "foo".* FROM "foo" WHERE "foo"."locale" = 'de' 
AND f_unaccent(foo.name) ILIKE f_unaccent('v%');

C) Query Plan:

 Bitmap Heap Scan on foo  (cost=346203.46..4864616.26 rows=10591544 width=301) (actual time=23526.443..30050.008 rows=2196 loops=1)
   Recheck Cond: ((locale)::text = 'de'::text)
   Rows Removed by Index Recheck: 14094842
   Filter: (f_unaccent((name)::text) ~~* 'v%'::text)
   Rows Removed by Filter: 10781095
   Heap Blocks: exact=572873 lossy=847868
   Buffers: shared read=1494015
   ->  Bitmap Index Scan on index_foo_on_name_de  (cost=0.00..343555.58 rows=10592603 width=0) (actual time=1788.454..1788.454 rows=10783291 loops=1)
         Buffers: shared read=73274
 Planning time: 0.528 ms
 Execution time: 30050.168 ms

Best Answer

1. f_unaccent()

Seems like you are using my function as defined here:

Note the update I just made. This is better:

CREATE OR REPLACE FUNCTION f_unaccent(text)
  RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$  LANGUAGE sql IMMUTABLE;

Detailed explanation over there.

2. Recheck

Why it does a recheck?

The "Recheck Cond:" line is always in the EXPLAIN output for bitmap index scans. Not to worry. Detailed explanation:

3. Index and query plan

Why is the index ignored

That's a misunderstanding. Your index is obviously not ignored. If Postgres expects to find enough rows so that some data pages in the main relation would have to be visited more than once (obviously the case with rows=10591544), it switches from index scan to bitmap index scan - which is followed by a "Bitmap Heap Scan" to fetch actual tuples. Details:

What makes this query really expensive is a combination of multiple unfortunate factors:

  1. Neither index (Buffers: shared hit=1 read=804) nor table (Buffers: shared hit=1 read=749976) were cached. If you repeat that query right away, it will be much faster, since all of it is cached by then. This is the worst case possible

  2. The search pattern f_unaccent('v%') - or just 'v%' is a very bad case for a trigram index. Not very selective - but still selective enough to use it instead of an actual sequential scan. A text_pattern_ops index would be much faster for this. See below.
    More selective patterns (longer string) would also be much faster.

  3. You had LIMIT 100, so Postgres started out optimistically hoping to find 100 rows quickly. But the query returns with 0 rows (rows=0). This means that Postgres had to walk through all candidate rows unsuccessfully. Another worst case scenario. Your second predicate is to blame here:

    AND foo.configuration->'bar' @> '{"is":["a"]}'
    

    Postgres has only very limited statistics for jsonb columns. It has no idea how selective that condition is going to be. If you have many queries on configuration->'bar', you could improve the situation drastically with another expression index ...

    Possibly even a multicolumn index.

4. text_pattern_ops

For just left-anchored patterns ("right end wildcard"), you can make do without trigram indexes. But a plain btree index won't do, if you are using any locale in your DB other than the "C" locale (which is effectively "no locale"). Else you need special operator classes to ignore the locale. Like:

CREATE INDEX index_foo_name_pattern_ops_de ON foo (f_unaccent(name) text_pattern_ops)
WHERE locale = 'de';

Details: