Postgresql – Which is more efficient for searches on JSON data in Postgres: GIN or multiple indexed columns

indexjsonpostgresqlpostgresql-9.6

For example, say I have a table with a medium number of rows (~100,000 or so) that has a jsonb column with the following example data in one of the rows:

{"name":"Bob", "favoriteColor":"red", "someOtherObject": {"somethingElse": true}}

Is it more efficient to:

Use a GIN index on the jsonb column, then use Postgres' built-in JSON query functions to pull out data
Create several columns to represent things that can be searched for i.e. one column for name, one column for favoriteColor, etc; build b-tree indices on those columns and then run a select document from table where name = 'Bob'
Some other solution?

Keep in mind the solution needs to efficiently support like queries to be able to search for values that start with a given input string.

Best Answer

The advantage of JSON is versatility: you can add any keys without changing the table definition. And maybe convenience, if your application can read and write JSON directly.

Separate columns beat a combined json or jsonb column in every performance aspect and in several other aspects, too: More sophisticated type system, the full range of functionality (check, unique, foreign key constraints, default values, etc.), the table is smaller, indexes are smaller, queries are faster.

For prefix matching on text columns you might use a text_pattern_ops index:

Why would you index text_pattern_ops on a text column?

Or, more generally, a trigram index supporting any LIKE patterns:

While you stick with JSON (jsonb in particular), there are also different indexing strategies. GIN or Btree is not the only decision to make. Partial indexes, expression indexes, different operator classes (in particular: jsonb_path_ops) Related:

How to get particular object from jsonb array in PostgreSQL?

Figure out type dynamically

This is the more interesting part of your question:

the type of age in the JSON document is number anyway, so why can't PostgreSQL figure out that by itself?

SQL is a strictly typed language, it does not allow the same expression to evaluate to integer in one row and to text in the next. But since you are only interested in the boolean result of the test, you can get around this restriction with a CASE expression that forks depending on the result of jsonb_typeof():

SELECT data->'name'
FROM   persons
WHERE  CASE jsonb_typeof(data->'age')
        WHEN 'number'  THEN (data->>'age')::numeric > '25' -- treated as numeric
        WHEN 'string'  THEN data->>'age' > 'age_level_3'   -- treated as text
        WHEN 'boolean' THEN (data->>'age')::bool           -- use boolean directly (example)
        ELSE FALSE                                         -- remaining: array, object, null
       END;

An untyped string literal to the right of the > operator is coerced to the respective type of the value to the left automatically. If you put a typed value there, the type has to match or you have to cast it explicitly - unless there is adequate implicit cast registered in the system.

If you know that all numeric values are actually integer, you can also:

... (data->>'age')::int > 25 ...

Postgresql – Inconsistent statistics on jsonb column with btree index

Currently (version 9.6), Postgres does not have any statistics about the internals of document types like json, jsonb, xml or hstore. (There has been discussion whether and how to change that.) Instead, the Postgres query planner uses constant default frequency estimates (like you observed).

However, there are separate statistics for functional indexes like your idx_test_btree. The manual has this tip for you:

Tip: Although per-column tweaking of ANALYZE frequency might not be very productive, you might find it worthwhile to do per-column adjustment of the level of detail of the statistics collected by ANALYZE. Columns that are heavily used in WHERE clauses and have highly irregular data distributions might require a finer-grain data histogram than other columns. See ALTER TABLE SET STATISTICS, or change the database-wide default using the default_statistics_target configuration parameter.

Also, by default there is limited information available about the selectivity of functions. However, if you create an expression index that uses a function call, useful statistics will be gathered about the function, which can greatly improve query plans that use the expression index.

The volume of statistics gathered depends on general setting of default_statistics_target, which can be overruled with a per-column setting. The setting for the column automatically covers depending indexes.

The default setting of 100 is conservative. For your test with 1M rows, if data distribution is uneven, it may help to increase it substantially. Checking on this once more I found you can actually tweak the statistics target per index column with ALTER INDEX, which is currently not documented. See related discussion on pgsql-docs.

ALTER TABLE idx_test_btree ALTER int4 SET STATISTICS 2000;  -- max 10000, default 100

Default names for index columns are not exactly intuitive, but you can look it up with:

SELECT attname FROM pg_attribute WHERE attrelid = 'idx_test_btree'::regclass

Should result in the type name int4 as index column name for your case.

The best setting for STATISTICS depends on several factors: data distribution, data type, update frequency, characteristics of typical queries, ...

Internally, this sets the value of pg_attribute.attstattarget, and the exact meaning of this is (per documentation):

For scalar data types, attstattarget is both the target number of "most common values" to collect, and the target number of histogram bins to create.

Then run ANALYZE if you don't want to wait for autovacuum to kick in:

ANALYZE test_data;

You must ANALYZE the table, since you cannot ANALYZE indexes directly. Check with (before and after if you want to verify the effect):

SELECT * FROM pg_statistic WHERE starelid = 'idx_test_btree'::regclass;

Try your query again ...

Best Answer

Related Solutions

Postgresql – Querying JSONB in PostgreSQL

Figure out type dynamically

Postgresql – Inconsistent statistics on jsonb column with btree index

Related Question