Given this table posts_lists
:
Table "public.posts_lists" Column | Type | Collation | Nullable | Default ------------+------------------------+-----------+----------+--------- id | character varying(20) | | not null | user_id | character varying(20) | | | tags | jsonb | | | score | integer | | | created_at | integer | | | Indexes: "tmp_posts_lists_pkey1" PRIMARY KEY, btree (id) "tmp_posts_lists_idx_create_at1532588309" btree (created_at) "tmp_posts_lists_idx_score_desc1532588309" btree (score_rank(score, id::text) DESC) "tmp_posts_lists_idx_tags1532588309" gin (jsonb_array_lower(tags)) "tmp_posts_lists_idx_user_id1532588309" btree (user_id)
Getting a list by tag is fast:
EXPLAIN ANALYSE
SELECT * FROM posts_lists
WHERE jsonb_array_lower(tags) ? lower('Qui');
Bitmap Heap Scan on posts_lists (cost=1397.50..33991.24 rows=10000 width=56) (actual time=0.110..0.132 rows=2 loops=1) Recheck Cond: (jsonb_array_lower(tags) ? 'qui'::text) Heap Blocks: exact=2 -> Bitmap Index Scan on tmp_posts_lists_idx_tags1532588309 (cost=0.00..1395.00 rows=10000 width=0) (actual time=0.010..0.010 rows=2 loops=1) Index Cond: (jsonb_array_lower(tags) ? 'qui'::text) Planning time: 0.297 ms Execution time: 0.157 ms
Getting a list ordered by score, limit 100 – also fast:
EXPLAIN ANALYSE
SELECT *
FROM posts_lists
ORDER BY score_rank(score, id) DESC
LIMIT 100;
Limit (cost=0.56..12.03 rows=100 width=88) (actual time=0.074..0.559 rows=100 loops=1) -> Index Scan using tmp_posts_lists_idx_score_desc1532588309 on posts_lists (cost=0.56..1146999.15 rows=10000473 width=88) (actual time=0.072..0.535 rows=100 loops=1) Planning time: 0.586 ms Execution time: 0.714 ms
But combining the above two queries is very slow:
EXPLAIN ANALYSE
SELECT * FROM posts_lists
WHERE jsonb_array_lower(tags) ? lower('Qui')
ORDER BY score_rank(score, id) DESC
LIMIT 100;
Limit (cost=0.56..33724.60 rows=100 width=88) (actual time=2696.965..493476.142 rows=2 loops=1) -> Index Scan using tmp_posts_lists_idx_score_desc1532588309 on posts_lists (cost=0.56..3372404.39 rows=10000 width=88) (actual time=2696.964..493476.139 rows=2 loops=1) Filter: (jsonb_array_lower(tags) ? 'qui'::text) Rows Removed by Filter: 9999998 Planning time: 0.426 ms Execution time: 493476.190 ms
Why? How to improve the efficiency of the query?
Definition of the two functions used above:
create or replace function score_rank(score integer, id text)
returns text as $$
select case when score < 0
then '0' || lpad((100000000 + score) :: text, 8, '0') || id
else '1' || lpad(score :: text, 8, '0') || id
end
$$
language sql immutable;
create or replace function jsonb_array_lower(arr jsonb)
returns jsonb as $$
SELECT jsonb_agg(lower(elem))
FROM jsonb_array_elements_text(arr) elem
$$
language sql immutable;
Best Answer
Sorting and paging
Your function
score_rank()
produces atext
from aninteger
score and the appended PKid
. That's not helpful for sorting. Replace it completely, I suspect you do not need it at all. Instead use the two columnsscore
andid
directly for sorting:Replace your index
tmp_posts_lists_idx_score_desc1532588309
with a smaller, faster, cheaper to maintain, more versatile index on(score DESC, id DESC)
.You can also base pagination on this multicolumn index efficiently with row value comparison. See:
You later mentioned a new function concatenating strings with base256 etc. All that smart trickery is not going to increase performance. Sorting on an
integer
is faster than sorting on strings in Postgres. Usinginteger
(orbigint
) instead ofvarchar(20)
would actually help in multiple ways.Statistics and query plan (a.k.a: Why?)
The main issue is the lack of statistics for values nested in the
jsonb
column. Postgres consequently sometimes misjudges the selectivity of the predicatejsonb_array_lower(tags) ? lower('Qui')
and chooses a bad query plan. In your example withLIMIT 2
the logic of the query planner can be illustrated like this - let's call this "Plan 1":Only two rows with the highest scores? Let's scan the index
posts_lists_idx_score_desc
starting with the highest scores. With any luck we'll have the result in no time!It's a reasonable plan for most cases with at least moderately common tags. But the tag 'qui' turns out to be very rare, and with low scores, too. The worst case. Postgres ends up scanning close to 4 million rows, just to keep 2. A colossal waste of time:
If the query planner had any idea how rare that tag actually is, it would start with the other index
posts_lists_idx_tags
like we see in your second example withLIMIT 100
- let's call this "Plan 2":Find matching rows, then sort by score and take the top N.
Plan 1 is more favorable the smaller the
LIMIT
and the more frequent the tag. (And if qualifying rows happen to sort on top.)Plan 2 is more favorable the bigger the
LIMIT
and the less frequent the tag.Postgres has currently no statistics about nested values in document types like
jsonb
. And no combined frequencies at all. See:Update: "Combined statistics" became possible with
CREATE STATISTICS
in Postgres 10.What ever else you do, be sure to run the latest version of Postgres. The planner gets smarter with every release.
Alternatives
1. One idea might be to use a Postgres array (
text[]
) and array operators instead of thejsonb
column to have some statistics for most common elements. Columnsmost_common_elems
,most_common_elem_freqs
, andelem_count_histogram
in the system viewpg_stats
.Helps Postgres to generate better query plans for some constellations, but it's no silver bullet. For starters, only the most common elements are stored. Postgres still doesn't know about the rarest elements.
2. Normalize your db design and move
tags
to a separate 1:n table with a single tag per row. That increases the disk footprint because of the added row overhead per tag. (But changing tags becomes much cheaper with less table bloat.) If your tags are stable, consider a full n:m relationship betweenposts_lists
and a new tabletags
. That's a bit smaller for lots of common tags, too. And it's the "clean" way. You have more detailed statistics and should see fewer bad query plans.3. Since Postgres 10 there is a variant of
to_tsvector()
that processesjson(b)
values. So it's simple to create a text search index and work with text search operators now.Index:
Query:
Be sure to use the
simple
dictionary. You don't want stemming which is built into most other dictionaries.Text search functions produce lower case output, it all works case insensitive by design. No need for processing like in your original function
jsonb_array_lower()
.4. While sticking with a
jsonb
index, try the more specializedjsonb_path_ops
operator class:Query with:
The manual:
But I do not expect much for your particular case.
5. Use a regime of "granulated" indexes, combined with a procedural solution. See:
db<>fiddle here - with a number of tests ...