Description
PostgreSQL 9.6 on Linux, size of tags_tmp
table ~ 30 GB (10 million rows), tags
is a text[]
and has only 6 values.
tags_tmp(id int, tags text[], maker_date timestamp, value text)
id tags maker_date value
1 {a,b,c} 2016-11-09 This is test
2 {a} 2016-11-08 This is test
3 {b,c} 2016-11-07 This is test
4 {c} 2016-11-06 This is test
5 {d} 2016-11-05 This is test
I need to retrieve data with filter on tags
as well as order by
on maker_date desc
. Can I create an index on both tags & maker_date desc
columns?
If not, could you suggest other ideas?
Query example
select id, tags, maker_date, value
from tags_tmp
where tags && array['a','b']
order by maker_date desc
limit 5 offset 0
SQL code:
create index idx1 on tags_tmp using gin (tags);
create index idx2 on tags_tmp using btree(maker_date desc);
explain (analyse on, costs on, verbose)
select id, tags, maker_date, value
from tags_tmp
where tags && array['funny','inspiration']
order by maker_date desc
limit 5 offset 0 ;
Explain result:
Limit (cost=233469.63..233469.65 rows=5 width=116) (actual time=801.482..801.483 rows=5 loops=1)
Output: id, tags, maker_date, value
-> Sort (cost=233469.63..234714.22 rows=497833 width=116) (actual time=801.481..801.481 rows=5 loops=1)
Output: id, tags, maker_date, value
Sort Key: tags_tmp.maker_date DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on public.tags_tmp (cost=6486.58..225200.81 rows=497833 width=116) (actual time=212.982..696.650 rows=366392 loops=1)
Output: id, tags, maker_date, value
Recheck Cond: (tags_tmp.tags && '{funny,inspiration}'::text[])
Heap Blocks: exact=120034
-> Bitmap Index Scan on idx1 (cost=0.00..6362.12 rows=497882 width=0) (actual time=171.742..171.742 rows=722612 loops=1)
Index Cond: (tags_tmp.tags && '{funny,inspiration}'::text[])
Planning time: 0.185 ms
Execution time: 802.128 ms
More information
I tested with using partial index for only one tag, of course, it's faster. But I have many tag , for example: create index idx_tmp on tags_tmp using btree (maker_date desc) where (tags && array['tag1') or tags && array['tag2'] or ... or tags && array['tag6']
. And I tested between tags && array['tag1']
and 'tag1' = any(tags)
, performance is same.
-
text[]
has only 6 values =a, b, c, d, e, f
.
For example:tags={a,b,c}, tags={a}, tags={a,c}, tags={a,b,c,d,e,f}, tags={b,f}
and so on. But it cannot has valueg->z, A-Z
and etc. -
create table tags_tmp(id int primary key not null, tags text[] not null, maker_date timestamp not null, value text)
-
In terms of
distinct
array values , thetags
which containsa
takes 20% rows of tablewhere 'a' = any(tags)
, b=20%where 'b' = any(tags)
, c=20%where 'c' = any(tags)
, d=20%where 'd' = any(tags)
, e=10%where 'e' = any(tags)
,f=10%where 'f' = any(tags)
. -
In addition,
(tags, maker_date)
is not unique. -
This table is not read only.
-
It is
sort on timestamp
, but my example shows dates, sorry about that.
Current situation: tags = 'a' or tags = 'b' or tags = 'c'
and more
(1) With GIN index
or convert text[] to int[]
as well as convert text[] to int
and more, it will use bitmap index on multi tags.
Finally, after testing, I decided to use old solution, change OR
into many UNION
clauses, each UNION
will limit the number of data. Of course, I will create partial index
for each tag value as well as I can combine with (1) above .
In terms of OFFSET
, it will use one or more condition in WHERE
clause instead.
Example
EXPLAIN (ANALYSE ON, costs ON, VERBOSE)
SELECT rs.*
FROM (
(SELECT tags,
id,
maker_date
FROM tags_tmp
WHERE 'a' = any(tags)
AND maker_date <= '2016-03-28 05:43:57.779528'::TIMESTAMP
ORDER BY maker_date DESC LIMIT 5)
UNION
(SELECT tags,
id,
maker_date
FROM tags_tmp
WHERE 'b' = any(tags)
AND maker_date <= '2016-03-28 05:43:57.779528'::TIMESTAMP
ORDER BY maker_date DESC LIMIT 5)
UNION
(SELECT tags,
id,
maker_date
FROM tags_tmp
WHERE 'c' = any(tags)
AND maker_date <= '2016-03-28 05:43:57.779528'::TIMESTAMP
ORDER BY maker_date DESC LIMIT 5)) rs
ORDER BY rs.maker_date DESC LIMIT 5 ;
Best Answer
General considerations
Index optimization always depends on the complete picture. Table size, row size, cardinalities, value frequencies, selectivity of typical queries, Postgres version, typical access patterns, etc.
Your case is particularly difficult for two reasons:
WHERE
andORDER BY
.Filter on array is most efficient with GIN or GiST index, but neither index type produces sorted output. The manual:
You can create a multicolumn GIN index on
(tags, maker_date)
or even more columns (the order of index columns is irrelevant for GIN indexes). But you need to have the additional modulebtree_gin
installed. Instructions:And it's not going to help for the
ORDER BY
component of your problem.One more clarification:
OFFSET m LIMIT n
is typically almost as expensive asLIMIT m+n
.Solution for added specifications
You clarified: only 6 distinct tags possible. That's crucial.
Your table is big and your table definition leaves room for improvements. Size matters for big tables. Your numbers (30 GB, 10 million rows) also suggest a big avg. row size of ~ 3 KB. Either you have more columns than you show or table bloat and need a
VACUUM FULL
run (or similar) or yourvalue
column is big and TOASTed, which would make my improvements even more effective since the main relation is cut down to half its size or less with this:The order of columns is relevant because of alignment padding. Details:
More importantly, this:
tags int
. Why?Arrays have a sizeable overhead of 24 byte (similar to a row), plus actual items.
So a
text[]
with 1-6 items like you demonstrate ('funny', 'inspiration', ...) occupies ~ 56 bytes on avg. And 6 distinct values can be represented by only 6 bits of information (assuming that sort order of the array is irrelevant). We could compress even more, but I chose the convenientinteger
type (occupies 4 bytes), which provides space for up to 31 distinct tags. That leaves room for later additions without changes to the table schema. Detailed rationale:Your tags map to bits in a bitmap,
'a'
being the least significant bit (to the right):So the tag array
'{a,d,f}'
maps to41
. You can use any arbitrary strings instead of 'a'-'f', doesn't matter.To encapsulate the logic I suggest two auxiliary functions, easily expandable:
tags -> integer:
integer -> tags:
Basics here:
For convenience, you can add a view to display tags as text array like you had it:
Now your basic query can be:
Using the binary AND operator
&
. There are more operators to manipulate the column.get_bit()
andset_bit()
from the binary string operators are also convenient.The above query should be faster already, for the smaller size and cheaper operators alone, but nothing revolutionary, yet. To make it fast we need indices, and the above cannot use an index, yet.
One partial index for every tag:
This query is equivalent to the above, but can utilize the indexes:
Postgres can combine several bitmap index scans in a
BitmapOr
step very efficiently - as demonstrated in this SQL Fiddle.You could add another index condition to limit indexes to
maker_date
> some constant timestamp (and repeat the verbatim condition in queries) to cut down their size (massively). Related example:More sophisticated:
Other related answers:
How to speed up ORDER BY sorting when using GIN index in PostgreSQL?
Can spatial index help a "range - order by - limit" query
Or just 6
boolean
columns ...Simply 6 boolean columns might be an even better choice. There are some pros & cons to either solution ...
You might define the flags
NOT NULL
, depending on your complete use case.Consider:
Partial indexes simply:
Etc.
Alternative for your special case
Thinking some more, since all of your few tags are so common, and combining multiple tags with OR is even less selective, it will be fastest to only have a btree index on
maker_date DESC
. Postgres can traverse the index and filter qualifying rows on tags. This will work in combination with separate boolean columns instead of the array or encoded integer, because Postgres has more useful columns statistics for separate columns.And then:
Paging
You need an unambiguous sort order for result sets, to make paging work. I did not bother in this answer, it's too long already. Typically you would add more columns to
ORDER BY
. How to make paging work efficiently with that: