I'm trying to speed up a query that executes three ILIKE
queries and reduces these with or (returning the overall count and 10 entries)
SELECT *, count(*) OVER() as filtered_count FROM "users"
WHERE (
(f_unaccent("users"."first_name") ILIKE f_unaccent('%foo%') OR
f_unaccent("users"."last_name") ILIKE f_unaccent('%foo%')) OR
f_unaccent("users"."club_or_hometown") ILIKE f_unaccent('%foo%')
) LIMIT 10 OFFSET 0
This works reasonably fast after adding gin-indexes for all queried attributes (here only for first_name):
CREATE INDEX users_first_name_gin
ON users
USING gin
(f_unaccent(first_name::text) COLLATE pg_catalog."default" gin_trgm_ops);
If I however add an additional order clause, e. g. ORDER BY users.first_name ASC
, postgresql does not use the gin index, but the normal b-tree index on first_name
, then filters the result. This takes significantly longer in my application. How can I adapt the query/indexes to keep using the gin indexes even for ordered queries?
edit: I'm using postgresql 9.4
Explain of unordered query:
"Limit (cost=125.98..139.61 rows=10 width=58) (actual time=17.828..17.833 rows=10 loops=1)"
" -> WindowAgg (cost=125.98..2972.72 rows=2088 width=58) (actual time=17.826..17.831 rows=10 loops=1)"
" -> Bitmap Heap Scan on users (cost=125.98..2946.62 rows=2088 width=58) (actual time=0.915..16.816 rows=1755 loops=1)"
" Recheck Cond: ((f_unaccent((first_name)::text) ~~* '%foo%'::text) OR (f_unaccent((last_name)::text) ~~* '%foo%'::text) OR (f_unaccent((club_or_hometown)::text) ~~* '%foo%'::text))"
" Heap Blocks: exact=891"
" -> BitmapOr (cost=125.98..125.98 rows=2088 width=0) (actual time=0.742..0.742 rows=0 loops=1)"
" -> Bitmap Index Scan on users_first_name_gin (cost=0.00..51.80 rows=2074 width=0) (actual time=0.600..0.600 rows=1735 loops=1)"
" Index Cond: (f_unaccent((first_name)::text) ~~* '%foo%'::text)"
" -> Bitmap Index Scan on users_last_name_gin (cost=0.00..36.31 rows=8 width=0) (actual time=0.069..0.069 rows=20 loops=1)"
" Index Cond: (f_unaccent((last_name)::text) ~~* '%foo%'::text)"
" -> Bitmap Index Scan on users_club_or_hometown_gin (cost=0.00..36.29 rows=6 width=0) (actual time=0.072..0.072 rows=0 loops=1)"
" Index Cond: (f_unaccent((club_or_hometown)::text) ~~* '%foo%'::text)"
"Planning time: 0.791 ms"
"Execution time: 17.909 ms"
Explain of ordered query:
"Limit (cost=0.42..404.22 rows=10 width=58) (actual time=2363.902..2363.908 rows=10 loops=1)"
" -> WindowAgg (cost=0.42..84314.74 rows=2088 width=58) (actual time=2363.900..2363.904 rows=10 loops=1)"
" -> Index Scan using index_users_on_first_name on users (cost=0.42..84288.64 rows=2088 width=58) (actual time=132.873..2362.996 rows=1755 loops=1)"
" Filter: ((f_unaccent((first_name)::text) ~~* '%foo%'::text) OR (f_unaccent((last_name)::text) ~~* '%foo%'::text) OR (f_unaccent((club_or_hometown)::text) ~~* '%foo%'::text))"
" Rows Removed by Filter: 99646"
"Planning time: 0.937 ms"
"Execution time: 2363.989 ms"
The indexes from \d users
"users_pkey" PRIMARY KEY, btree (id)
"index_users_on_club_or_hometown" btree (club_or_hometown)
"index_users_on_first_name" btree (first_name)
"index_users_on_last_name" btree (last_name)
"users_club_or_hometown_gin" gin (f_unaccent(club_or_hometown::text) gin_trgm_ops)
"users_first_name_gin" gin (f_unaccent(first_name::text) gin_trgm_ops)
"users_last_name_gin" gin (f_unaccent(last_name::text) gin_trgm_ops)
edit2:
When disabling index scans with set enable_indexscan = off;
, postgres uses the correct indexes again:
"Limit (cost=3273.20..3273.22 rows=10 width=58) (actual time=32.231..32.231 rows=10 loops=1)"
" -> Sort (cost=3273.20..3279.30 rows=2442 width=58) (actual time=32.229..32.229 rows=10 loops=1)"
" Sort Key: first_name"
" Sort Method: top-N heapsort Memory: 26kB"
" -> WindowAgg (cost=128.90..3220.43 rows=2442 width=58) (actual time=29.982..30.735 rows=2655 loops=1)"
" -> Bitmap Heap Scan on users (cost=128.90..3189.90 rows=2442 width=58) (actual time=1.323..28.260 rows=2655 loops=1)"
" Recheck Cond: ((f_unaccent((first_name)::text) ~~* '%foo%'::text) OR (f_unaccent((last_name)::text) ~~* '%foo%'::text) OR (f_unaccent((club_or_hometown)::text) ~~* '%foo%'::text))"
" Heap Blocks: exact=1057"
" -> BitmapOr (cost=128.90..128.90 rows=2443 width=0) (actual time=1.099..1.099 rows=0 loops=1)"
" -> Bitmap Index Scan on users_first_name_gin (cost=0.00..54.46 rows=2428 width=0) (actual time=0.961..0.961 rows=2647 loops=1)"
" Index Cond: (f_unaccent((first_name)::text) ~~* '%foo%'::text)"
" -> Bitmap Index Scan on users_last_name_gin (cost=0.00..36.31 rows=8 width=0) (actual time=0.066..0.066 rows=7 loops=1)"
" Index Cond: (f_unaccent((last_name)::text) ~~* '%foo%'::text)"
" -> Bitmap Index Scan on users_club_or_hometown_gin (cost=0.00..36.29 rows=6 width=0) (actual time=0.071..0.071 rows=1 loops=1)"
" Index Cond: (f_unaccent((club_or_hometown)::text) ~~* '%foo%'::text)"
"Planning time: 0.803 ms"
"Execution time: 32.292 ms"
Best Answer
Your added comment is on the right track already:
If 'foo' is a very common, Postgres expects it to be faster to just read from the b-tree index sequentially and just skip rows that do not match. The estimation is also based on cost setting and the expected selectivity of predicates. There are multiple entry points for skewed estimates.
Selectivity of predicates and combinations thereof
Your most important problem seems to be here:
Postgres expect 209 times as many rows as returned. explain.depesz.com can help auditing a query plan: http://explain.depesz.com/s/53E
You may be able to get more accurate estimations by increasing the statistics target for the indexed columns. Postgres not only gathers statistics for table columns, it does the same for indexed expressions (not for plain index columns for which statistics are available already). Details:
You can check with:
Basic row counts and settings are in
pg_class
andpg_attribute
:Indexes are treated as special tables internally. This should make it less surprising that you can do use
ALTER TABLE
on an index:Use this to increase the sample size for calculating statistics for the index column. Then run
ANALYZE users
to update statistics.You can't provide explicit column names for index entries, names are chosen automatically. You can look it up with the query on
pg_attribute
above. The column namef_unaccent
is derived from the used function name.Default statistics target is 100, the range of 0 - 10000 is allowed. For very big tables, 100 is often not enough to get reasonable estimates. Set it to 1000 (example) for the index expression to get better estimates.
Workaround
Like dezso commented you can get the alternative query plan by encapsulating the original form in a CTE (which acts as optimization barrier in Postgres) - before
ORDER BY
andLIMIT
in the outerSELECT
:Alternative index
Since you commented:
A single index for all three would be cheaper. GIN indexes can be multicolumn indexes. The documentation:
And:
So:
This would reduce the triple overhead per index entry to just one. Should be faster overall. Or, faster yet, concatenate all three columns into a single string. I am adding a space as separator to avoid false positives. Use any character as separator that is not going to show up in the search expression:
If all columns are
NOT NULL
, you can use plain concatenation instead:Be sure to use the same expression in your query:
Set
STATISTICS
to 1000 or more like demonstrated above andANALYZE
before you test again. Be sure to run the query multiple times to compare warm cache to warm cache.Besides the smaller index and faster computation, the main benefit for your case could be that a single predicate is less susceptible to errors in the cost estimation. Combining multiple predicates adds errors to the calculation.
Force query plan
If all else fails, you can force your preferred query plan like I suggested in the comment by disabling index scans temporarily. Remember, this is an evil hack that may fire back if underlying data distribution changes or if you upgrade to the next Postgres version:
Use
SET LOCAL
to confine the effect to the transaction and wrap the whole thing in an explicit transaction.