Postgresql – Postgres LIKE query using a GiST index is just as slow as a full scan

gist-indexperformancepostgresqlpostgresql-9.5query-performance

What I have is a very simple database that stores paths, extensions and names of files from UNC shares. For testing, I inserted about 1,5 mio rows and the below query uses a GiST index as you see, but still it takes 5 seconds to return. Expected would be a few (like 100) milliseconds.

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE 'somevalue' 

enter image description here

When using %% in the query, it takes the not that long, even when using sequential scan (?!)

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE '%a%' 

enter image description here

I also have the same setup for the name (filename) column, when executing a similar query on that one, it only takes half of the time, even when using %%:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  name LIKE '%a%' 

enter image description here

What I already tried cannot be written here in short words. Whatever I do, it gets slow starting from about 1 mio rows. As there is basically never anything deleted, of course vacuuming and reindexing does not help at all.
I cannot really use any other type of search than LIKE %% and a GIN or GiST index because I need to be able to find any character in the columns of interest, not only "words for a specific human language".

Is my expectation that this should work in around 100 milliseconds even for many million more rows that wrong?

Further information

DB create script

Re-try without any text or other index at all, 1.7 mio unique entries

EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Seq Scan on residentfiles  (cost=0.00..78162.76 rows=33225 width=232) (actual time=0.076..3195.965 rows=45301 loops=1)
  Filter: ((name)::text ~~ '%12345%'::text)
  Rows Removed by Filter: 1604780+
Planning time: 0.596 ms
Execution time: 3318.595 ms

Try with Triagram gin index:

CREATE INDEX IF NOT EXISTS  name_gin_idx ON residentfiles USING gin (name gin_trgm_ops);
CREATE INDEX IF NOT EXISTS  parentpath_gin_idx ON residentfiles USING gin (parentpath gin_trgm_ops);
CREATE INDEX IF NOT EXISTS  ext_gin_idx ON residentfiles USING gin (extension gin_trgm_ops);
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'

Aggregate  (cost=53717.59..53717.60 rows=1 width=0) (actual time=1694.223..1694.224 rows=1 loops=1)
  ->  Bitmap Heap Scan on residentfiles  (cost=341.89..53631.82 rows=34308 width=0) (actual time=72.010..1615.007 rows=46532 loops=1)
        Recheck Cond: ((name)::text ~~ '%12345%'::text)
        Rows Removed by Index Recheck: 111
        Heap Blocks: exact=46372
        ->  Bitmap Index Scan on name_gin_idx  (cost=0.00..333.31 rows=34308 width=0) (actual time=52.287..52.287 rows=46643 loops=1)
              Index Cond: ((name)::text ~~ '%12345%'::text)
Planning time: 10.881 ms
Execution time: 1694.755 ms

Try with varchar_pattern:

CREATE INDEX idx_varchar_pattern_parentpath ON residentfiles (parentpath varchar_pattern_ops);
CREATE INDEX idx_varchar_pattern_name ON residentfiles (name varchar_pattern_ops);
CREATE INDEX idx_varchar_pattern_extension ON residentfiles (extension varchar_pattern_ops);
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'

Aggregate  (cost=89718.74..89718.75 rows=1 width=0) (actual time=1995.206..1995.207 rows=1 loops=1)
  ->  Seq Scan on residentfiles  (cost=0.00..89574.98 rows=57507 width=0) (actual time=0.060..1913.114 rows=52232 loops=1)
        Filter: ((name)::text ~~ '%12345%'::text)
        Rows Removed by Filter: 1852103
Planning time: 8.280 ms
Execution time: 1995.255 ms

Best Answer

In my opinion if you do not share how you did the test, it's very hard to give you an answer. Let see an example of what I mean. Sorry for I used a postgres 11 but the conclusions are the same:

This is a new db, there is nothing running against the instance:

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEX

Inserting 500000 rows using a very simple loop.

test=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
 count
--------
 500000
(1 row)

test=# select * from test_trgmidx limit 20;
col1        |    col2
------------+------------
 ABCD1EFGH  | abcd1efgh
 ABCD2EFGH  | abcd2efgh
 ABCD3EFGH  | abcd3efgh
 ABCD4EFGH  | abcd4efgh
 ABCD5EFGH  | abcd5efgh
 ABCD6EFGH  | abcd6efgh
 ABCD7EFGH  | abcd7efgh
 ABCD8EFGH  | abcd8efgh
 ABCD9EFGH  | abcd9efgh
 ABCD10EFGH | abcd10efgh
 ABCD11EFGH | abcd11efgh
 ABCD12EFGH | abcd12efgh
 ABCD13EFGH | abcd13efgh
 ABCD14EFGH | abcd14efgh
 ABCD15EFGH | abcd15efgh
 ABCD16EFGH | abcd16efgh
 ABCD17EFGH | abcd17efgh
 ABCD18EFGH | abcd18efgh
 ABCD19EFGH | abcd19efgh
 ABCD20EFGH | abcd20efgh
(20 rows)

Now I restart the instance to have a clean buffer cache, then I run the explain for the first select twice, to see how cache is "disturbing" our results:

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30   rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
  Output: col1, col2
  Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
  Buffers: shared hit=19 read=237
 Planning Time: 0.303 ms
 Execution Time: 4.934 ms
(6 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
 Output: col1, col2
 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)

It's clear that the first run need to retrive rows from disk (read=237) while the second one need only to access buffer cache (shared hit=232, NO reads). Now let's do the same for the second select, restart the instance and run the explain twice:

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared read=3676
 Planning Time: 0.719 ms
 Execution Time: 94.942 ms
(7 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=3676
 Planning Time: 0.081 ms
 Execution Time: 65.878 ms
(7 rows)

As you can see, now the reads come from disk the first run and from buffer the second one. Things are even more complicated because of OS buffers. Is is possible to clean OS cache and re-run everithing having back different results:

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      929920     4105056       93960     2879628     6748994
Swap:       4063228           0     4063228

# echo 3 > /proc/sys/vm/drop_caches

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      802204     6846392       93960      266008     6951156
Swap:       4063228           0     4063228
#

Look at column buff/cache as it drop down from 2879628 to 266008. Now run the explain again (twice for any select):

postgres=# \c test
You are now connected to database "test" as user "postgres".
test=#  explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=130.858..140.403 rows=1 loops=1)
   Output: col1, col2
   Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
   Buffers: shared hit=19 read=237
 Planning Time: 38.448 ms
 Execution Time: 140.466 ms
(6 rows)

test=#  explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=4.386..4.759 rows=1 loops=1)
   Output: col1, col2
   Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
   Buffers: shared hit=232
 Planning Time: 0.115 ms
 Execution Time: 4.787 ms
   (6 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=9.214..161.243 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=1 read=3675
 Planning Time: 0.090 ms
 Execution Time: 165.354 ms
(7 rows)

test=#  explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                          QUERY PLAN
       ----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.013..62.043 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=3676
 Planning Time: 0.078 ms
 Execution Time: 66.101 ms
   (7 rows)

You can check how different are now the statistics, first run read form disk, second one from buffer cache. All these lines only to say that in my opinion and for my experience, it is almost impossible to have a clear understanding about what is happening in you environment without all the information around how you did any single explain. And even having all the informations, it is sometimes very difficult to find the answer because of all the variables playing a role. My 2 cents