Postgresql – Postgressql text search: pg_trgm index with order by going very slowly

database-designlikepostgresqlpostgresql-9.5postgresql-extensions

I have a really slow query(~30 seconds) on a large text search DB that I could really use some help with.

I'm searching for ordered text, in a long file path string. So the path could be: folder1/cat 123/dog 234

I'm using ilike '%search_term_1% %search_term_2%' to do searches. So cat dog would return the result above.

I have a table with two columns:

create table file (
  id bigserial primary key,
  path varchar(2048) not null,
  peers integer);

I want to do a text search on path, and order by peers desc nulls last

I've created the following two indices to speed up the queries:

create index idx_file_path_tri on file using gin (path gin_trgm_ops);
create index idx_file_peers on file(peers nulls last);

Here's the explain anaylze results:

explain analyze select * from file where path ilike '%cat%' order by peers desc nulls last limit 15;

It looks like I need a compound index with the gin, but it won't let me build one based on peers desc nulls last

Limit  (cost=7512.09..7512.13 rows=15 width=126) (actual time=342729.147..342729.153 rows=15 loops=1)                      
   ->  Sort  (cost=7512.09..7516.48 rows=1753 width=126) (actual time=342729.144..342729.145 rows=15 loops=1)               
         Sort Key: peers DESC NULLS LAST                                                                                    
         Sort Method: top-N heapsort  Memory: 27kB                                                                          
         ->  Bitmap Heap Scan on file  (cost=809.59..7469.09 rows=1753 width=126) (actual time=2143.088..342610.395 rows=219
580 loops=1)                                                                                                                
               Recheck Cond: ((path)::text ~~* '%cat%'::text)                                                               
               Heap Blocks: exact=38190                                                                                     
               ->  Bitmap Index Scan on idx_file_path_tri  (cost=0.00..809.15 rows=1753 width=0) (actual time=2108.286..2108
.286 rows=223590 loops=1)                                                                                                   
                     Index Cond: ((path)::text ~~* '%cat%'::text)                                                           
 Planning time: 0.328 ms                                                                                                    
 Execution time: 342729.330 ms                                                                                              
(11 rows)

Note: the queries go extremely quickly without the order by.

Best Answer

The core performance problem is the selectivity estimation for the predicate path ilike '%cat%', which is off by a factor of 130:

(cost=0.00..809.15 rows=1753 width=0) (actual time=2108.286..2108 .286 rows=223590 loops=1)

Probably leading to a sub-optimal query plan.

In this particular case, Postgres identifies a quarter million rows, just to filter the top 15. A giant waste of time. Depending on missing information, some other query plan (some other query) will be much more efficient. Here is a related question with some generic techniques in the answers that apply in similar fashion to your case:

Selectivity estimation for pattern matching is hard, but there have been major improvements in Postgres 9.6. Detailed explanation in this closely related answer:

Your best course of action may be to install a current version of Postgres.

Of course, there may be all kinds of additional problems, your question does not reveal details.

Or you may be using the wrong tool altogether, you did not define what you are trying to search exactly.


Concerning:

It looks like I need a compound index with the gin, but it won't let me build one based on peers desc nulls last ...

Again: essential details like your CREATE INDEX command and the resulting error msg are missing. My educated guess: You need to install the additional module btree_gin first. You probably know that by now, since you just commented on this related answer explaining as much:

But don't bother, it probably won't help with your query.