Postgresql – Does index occupy disk memory for rows for which index expression is NULL

postgresqlpostgresql-9.4

Let's assume that I have following pg_trgm index on mydata JSONB column:

CREATE TABLE mytable (mydata JSONB);
CREATE INDEX trgm_index_1 ON mytable
USING gin ((mydata#>>'{myfield}') gin_trgm_ops);

And let's assume that I will insert following data into mytable:

500k rows with mydata = { "myfield": "some_text" }
500k rows with mydata = { "another_field": "some_text" }

So 1M rows in total, but half of them contains value that is used in index expression (mydata#>>'{myfield}'). Does it mean that index will use ~50% less memory on disk than it would use if all of rows contained myfield json field?

Best Answer

Let's simply try it!

I've created the table from your question, then did the following:

INSERT INTO mytable 
SELECT json_object(ARRAY['myfield', i::text || 'bla'])::jsonb 
FROM generate_series (1, 100000) t(i);

\di+ trgm_index_1

 Schema │     Name     │ Type  │ Owner  │  Table  │ Size  │ Description 
────────┼──────────────┼───────┼────────┼─────────┼───────┼─────────────
 test   │ trgm_index_1 │ index │ avaczi │ mytable │ 19 MB │

So far, we have only rows where myfield is present. Let's now add some other rows, too:

INSERT INTO mytable 
SELECT json_object(ARRAY['other_field', i::text || 'bla'])::jsonb 
FROM generate_series (100001, 200000) t(i);

After this, the reported index size has changed to 22 MB, meaning that the new rows are also accounted for. If you want to exclude them, you have to create a partial index:

CREATE INDEX trgm_index_partial ON mytable
USING gin ((mydata#>>'{myfield}') gin_trgm_ops)
WHERE mydata ? 'myfield';

The size of this one is initially the same (19 MB) as the non-partial index, but after adding the second group of rows, it does not change:

\di+ trgm_index_*

 Schema │        Name        │ Type  │ Owner  │  Table  │ Size  │ Description 
────────┼────────────────────┼───────┼────────┼─────────┼───────┼─────────────
 test   │ trgm_index_1       │ index │ avaczi │ mytable │ 22 MB │ 
 test   │ trgm_index_partial │ index │ avaczi │ mytable │ 19 MB │

Assessment

In your last query, the bitmap index scan looking for 'hat' produces 307 hits.
Postgres then runs a bitmap heap scan to filter merchants similar enough ( similarity(...) > 0.2), producing 12 rows. Your test is with 30K rows, so your real life query will produce around 300 times as many hits, 90k / 3.5k for the test case at hand. An additional index on merchant will help.

Advice

I suggest you create an additional trigram index for the similarity search. Be sure to read the chapter in the manual about trigram index support. We need the additional module pg_trgminstalled (like you obviously have).

For your first request:

How can I search for a query like 'WALMART BAGS' which will first return me product BAG with merchant WALMART and then BAGS from other merchants.

I suggest this query using the similarity operator %:

-- SELECT set_limit(0.2)  -- Adjust similarity operator only if needed

SELECT *
FROM   products
WHERE  to_tsvector('english', product) @@ to_tsquery('bag')
AND    merchant % 'walmart'
ORDER  BY merchant <-> 'walmart'
--    LIMIT  n; -- possibly limit to top n results

Again, you can choose between GiST and GIN, but this time GiST carries a decisive advantage:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

Therefore, I suggest this index:

CREATE INDEX prod_merchant_trgm_idx ON products USING gist (merchant gist_trgm_ops);

As for your second request:

Can I have both GIN and GIST index working for me?

Yes, you can. It would hardly make sense to have both types for the same (combination of) column(s), but Postgres can combine GiST and GIN indices very well in the same query. I quote the excellent manual yet again, on Combining Multiple Indexes:

To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory giving the locations of table rows that are reported as matching that index's conditions. The bitmaps are then ANDed and ORed together as needed by the query. Finally, the actual table rows are visited and returned. The table rows are visited in physical order, because that is how the bitmap is laid out; this means that any ordering of the original indexes is lost, and so a separate sort step will be needed if the query has an ORDER BY clause. For this reason, and because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well.

PostgreSQL – JSONB with Indexing vs Hstore

Relational databases are designed around joins, and optimized to do them well.

Unless you have a good reason not to use a normalized design, use a normalised design.

jsonb and things like hstore are good for when you can't use a normalized data model, such as when the data model changes rapidly and is user defined.

If you can model it relationally, model it relationally. If you can't, consider json etc. If you're choosing between json/jsonb/hstore, generally choose jsonb unless you have a reason not to.

That's what I said in my blog post, which addresses just this topic. Please read the whole post. The paragraph you quoted points out that if you're choosing a dynamic structure you should choose jsonb over hstore, but the rest of the blog post is about why you should usually prefer to model relationally if you can.

So. Model the main structured part relationally. If the tables are really wide with lots of columns, this might be a sign that further normalization is required. Do not be afraid of joins. Learn to love joins. Joining many small tables will often be faster than querying and maintaining big denormalized tables. Denormalize only if you need to for specific cases, and preferably via materialized views ... but don't do it until you know you need to and have an actual concrete problem to solve.

For user-contributed data that's freeform and unstructured, use jsonb. It should perform as well as hstore, but it's more flexible and easier to work with.

One relevant thing to understand: GiST and GIN indexes like those used on jsonb are generally much less efficient than a plain b-tree index. They're more flexible, but a b-tree index on a normal column will almost always be much, much faster.

Best Answer

Related Solutions

PostgreSQL Optimization – FTS and Trigram-Similarity Query Optimization

Assessment

Advice

PostgreSQL – JSONB with Indexing vs Hstore

Related Question