PostgreSQL (big data): search by tags and order by timestamp

indexpostgresqlpostgresql-11postgresql-performance

We need to add a search feature on a large table (200M+ rows):

item_id | tags                          | created_at          | ...
-------------------------------------------------------------------
1       | ['tag1', 'bar2']              | 2020-01-06 12:43:32 |
2       | ['example5', 'tag9', 'foo2']  | 2020-01-10 10:40:00 |
3       | ['test1', 'tag5']             | 2020-01-11 12:43:32 |
...

The queries would be similar to this one:

SELECT * FROM items 
WHERE tags @> ARRAY['t2', 't5']::varchar[]
ORDER BY created_at DESC
LIMIT 100;

Basically it's like searching some logs by tags and ordering them by timestamp. Seems a common scenario…

What index should we use? Have you ever tested something similar in production?

Example 1: create a GIN index on tags. The problem is that the search may return millions of results and in order to apply order / limit you need to make millions of reads from the table on disk (in order to get the created_at value for each row).
Example 2: add the btree_gin extension and create a composite index on created_at and tags. The problem is the same as above: I think that PostgreSQL cannot use ordering since the index is declared as a GIN index and not as a btree.
Example 3: create a btree index on created_at and tags. PostgreSQL needs to scan the whole index, since btree doesn't support array operators. I also fear that due to the SELECT * PostgreSQL will not use an index-only scan, thus resulting in millions of reads from disk (that would be actually useless since it only needs 100 reads from disk).

Best Answer

There are two approaches:

create an index on the array:
```
CREATE INDEX ON items USING gin (tags);
```
That allows the database to quickly find the matching rows, but then it has to perform a top-n sort.
create a B-tree index on created_at:
```
CREATE INDEX ON items (created_at);
```
That will allow the database to avoid the sort, but it has to scan and discard the rows that don't match the condition.

Unfortunately, the two strategies are mutually exclusive, and which is best depends on the data. You'll have to experiment.

1. `f_unaccent()`

Seems like you are using my function as defined here:

Does PostgreSQL support “accent insensitive” collations?

Note the update I just made. This is better:

CREATE OR REPLACE FUNCTION f_unaccent(text)
  RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$  LANGUAGE sql IMMUTABLE;

Detailed explanation over there.

2. Recheck

Why it does a recheck?

The "Recheck Cond:" line is always in the EXPLAIN output for bitmap index scans. Not to worry. Detailed explanation:

"Recheck Cond:" line in query plans with a bitmap index scan

3. Index and query plan

Why is the index ignored

That's a misunderstanding. Your index is obviously not ignored. If Postgres expects to find enough rows so that some data pages in the main relation would have to be visited more than once (obviously the case with rows=10591544), it switches from index scan to bitmap index scan - which is followed by a "Bitmap Heap Scan" to fetch actual tuples. Details:

What makes this query really expensive is a combination of multiple unfortunate factors:

Neither index (Buffers: shared hit=1 read=804) nor table (Buffers: shared hit=1 read=749976) were cached. If you repeat that query right away, it will be much faster, since all of it is cached by then. This is the worst case possible
The search pattern f_unaccent('v%') - or just 'v%' is a very bad case for a trigram index. Not very selective - but still selective enough to use it instead of an actual sequential scan. A text_pattern_ops index would be much faster for this. See below.
More selective patterns (longer string) would also be much faster.
You had LIMIT 100, so Postgres started out optimistically hoping to find 100 rows quickly. But the query returns with 0 rows (rows=0). This means that Postgres had to walk through all candidate rows unsuccessfully. Another worst case scenario. Your second predicate is to blame here:
```
AND foo.configuration->'bar' @> '{"is":["a"]}'
```
Postgres has only very limited statistics for jsonb columns. It has no idea how selective that condition is going to be. If you have many queries on configuration->'bar', you could improve the situation drastically with another expression index ...
- Index for finding an element in a JSON array
Possibly even a multicolumn index.

4. `text_pattern_ops`

For just left-anchored patterns ("right end wildcard"), you can make do without trigram indexes. But a plain btree index won't do, if you are using any locale in your DB other than the "C" locale (which is effectively "no locale"). Else you need special operator classes to ignore the locale. Like:

CREATE INDEX index_foo_name_pattern_ops_de ON foo (f_unaccent(name) text_pattern_ops)
WHERE locale = 'de';

Details:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Postgresql – Why is PostgreSQL 9.5 not using the newest index for ORDER BY, even though it uses similar indices just fine

In PostgreSQL, an index which is DESC NULLS LAST cannot be used to satisfy an ORDER BY which is DESC NULLS FIRST (which includes ordering by simply DESC because that implies NULLS FIRST). This is the case even if the column is defined to be NOT NULL.

You could either rebuild the index, or (since you know the column is not null) you can add NULLS LAST to your query's ORDER BY to make it match the existing index.

Note that PostgreSQL does know how to follow an index backwards, so a default index (which is implicitly ASC NULLS LAST) would also be able to satisfy your DESC NULLS FIRST query. Because of this, it is rarely important to specify DESC in an index, but it can be important to specify which end the NULLS sort to.

Best Answer

Related Solutions

Postgresql – Using ILIKE with unaccent and with only right end wildcard

1. f_unaccent()

2. Recheck

3. Index and query plan

4. text_pattern_ops

Postgresql – Why is PostgreSQL 9.5 not using the newest index for ORDER BY, even though it uses similar indices just fine

Related Question

1. `f_unaccent()`

4. `text_pattern_ops`