Postgresql – Improving indexing of jsonb data when the json structure is well defined

indexindex-tuningjsonpostgresqlview

I have a simple table records that contains a primary key (id) and a jsonb field (data). The structure of the jsonb is consistent across all rows. I have created a view (metadata) for this table which extracts a particular value category, which I can then run queries against.

CREATE VIEW metadata AS
SELECT id as id,
       data -> 'some_key' -> 'some_array' ->> 0 as category
FROM records;

(the data structure is not controlled by me)

I can use this VIEW to perform queries like:

SELECT category
FROM metadata
WHERE category = 'category_0';

and it behaves as expected.

I was experiencing slow performance out of the above query (and similar equality queries), so I added an index as follows:

CREATE INDEX metadata_category_idx ON records ((data -> 'some_key' -> 'some_array ->> 0));

This has not improved my performance as I would have expected when I query for a particular category using string equality. It should be noted that for "categories" with a small number of rows, EXPLAIN ANALYZE tells me its using the index, however for "categories" with a large number of rows, it falls back to sequential searching. There are approximately 350k rows in my table. I've tried each type of index (i.e. GIN, GIST) with no measured benefits.

I intend to strictly use string equality when querying this view, and I have little control over the structure of the json object in data.

What would be the best way to construct my index such that the above query can run more efficiently?

Best Answer

If category were a regular column and not an expression, PostgreSQL would choose an index-only scan, which would be much faster if the table has been vacuumed recently. But index-only scans are not supported on expression indexes like that, see the documentation:

In principle, index-only scans can be used with expression indexes. For example, given an index on f(x) where x is a table column, it should be possible to execute

SELECT f(x) FROM tab WHERE f(x) < 1;

as an index-only scan; and this is very attractive if f() is an expensive-to-compute function. However, PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index. In this example, x is not needed except in the context f(x), but the planner does not notice that and concludes that an index-only scan is not possible. If an index-only scan seems sufficiently worthwhile, this can be worked around by adding x as an included column, for example

CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);

An additional caveat, if the goal is to avoid recalculating f(x), is that the planner won't necessarily match uses of f(x) that aren't in indexable WHERE clauses to the index column. It will usually get this right in simple queries such as shown above, but not in queries that involve joins. These deficiencies may be remedied in future versions of PostgreSQL.

So you would jave to include the JSON column in the index to get an index-only scan. This may be impossible, if the size of the JSON is too big. Otherwise, it may still improve the speed of the query, even though such a large index will we much less efficient.

Essentially, you are suffering from a bad data model. Given that the structure of the JSONs is the same for all rows, that should certainly have been designed using normal table columns and avoiding jsonb. Then the problem would not exist.