PostgreSQL – Integer Arrays and Index for Equality

btreegist-indexpostgresql

I have a huge list of integer arrays (300,000,000 records) stored in Postgres 9.2 DB. I want to efficiently search these records for an exact match (equality only). I have heard of the intarray module and the corresponding gist-gin indexes. I would like to ask the following questions:

Does PostgreSQL uses a hash function for checking equality of integer arrays or does it perform a brute-force algorithm comparing one-by-one the elements of the array?
If PostgreSQL uses a hash function, is there some PostgreSQL function code to actually get the hash function result for a particular array?
Which index will be best for such a task? B-tree, or the gist – gin indexes provided by the intarray module? The dataset will be static, i.e, once all records are inserted there will no more insertions. So, building index / updating index time is not important to me.

Best Answer

1) as you already have discovered, you can't use b-tree as the index size is bigger than the page size

2) given:

As a rule of thumb, a GIN index is faster to search than a GiST index, but slower to build or update; so GIN is better suited for static data and GiST for often-updated data.

You would have to use GIN. And no, GIN doesn't use hash functions nor a brute-force algorithm. It's a reverse index:

A GIN index stores a set of (key, posting list) pairs, where a posting list is a set of row IDs in which the key occurs. The same row ID can appear in multiple posting lists, since an item can contain more than one key. Each key value is stored only once, so a GIN index is very compact for cases where the same key appears many times.

Internally, a GIN index contains a B-tree index constructed over keys, where each key is an element of one or more indexed items (a member of an array, for example)

Assessment

In your last query, the bitmap index scan looking for 'hat' produces 307 hits.
Postgres then runs a bitmap heap scan to filter merchants similar enough ( similarity(...) > 0.2), producing 12 rows. Your test is with 30K rows, so your real life query will produce around 300 times as many hits, 90k / 3.5k for the test case at hand. An additional index on merchant will help.

Advice

I suggest you create an additional trigram index for the similarity search. Be sure to read the chapter in the manual about trigram index support. We need the additional module pg_trgminstalled (like you obviously have).

For your first request:

How can I search for a query like 'WALMART BAGS' which will first return me product BAG with merchant WALMART and then BAGS from other merchants.

I suggest this query using the similarity operator %:

-- SELECT set_limit(0.2)  -- Adjust similarity operator only if needed

SELECT *
FROM   products
WHERE  to_tsvector('english', product) @@ to_tsquery('bag')
AND    merchant % 'walmart'
ORDER  BY merchant <-> 'walmart'
--    LIMIT  n; -- possibly limit to top n results

Again, you can choose between GiST and GIN, but this time GiST carries a decisive advantage:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

Therefore, I suggest this index:

CREATE INDEX prod_merchant_trgm_idx ON products USING gist (merchant gist_trgm_ops);

As for your second request:

Can I have both GIN and GIST index working for me?

Yes, you can. It would hardly make sense to have both types for the same (combination of) column(s), but Postgres can combine GiST and GIN indices very well in the same query. I quote the excellent manual yet again, on Combining Multiple Indexes:

To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory giving the locations of table rows that are reported as matching that index's conditions. The bitmaps are then ANDed and ORed together as needed by the query. Finally, the actual table rows are visited and returned. The table rows are visited in physical order, because that is how the bitmap is laid out; this means that any ordering of the original indexes is lost, and so a separate sort step will be needed if the query has an ORDER BY clause. For this reason, and because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well.

PostgreSQL – UPDATE CASE Statement Using Excessive Space

Size of the table is 53Gb without indexes. Obviously you have indexes, one of them a GIN index, which is typically multiple times the size of a plain table storing only the indexed column. I expect you get around 180 GB for

SELECT pg_total_relation_size(osmm.topographicarea);

Details on measuring sizes:

Measure the size of a PostgreSQL table row

As has been commented, due to the MVCC model of Postgres, an UPDATE in place effectively writes a new row version of every changed row (and also for affected indexes), which roughly duplicates the size of the table in your operation.

If you don't have to worry about concurrent access, you could go a different route:

Optimizing bulk update performance in PostgreSQL

60 GB (size of table, rounded up) will be too much for a temp table in memory. It would spill to disk, which voids the intended effect. You could just create a new regular table, drop the old one and rename the new - if you don't have concurrent access or depending objects.

To reclaim space you need

VACUUM FULL osmm.topographicarea;

While being at it, make that:

VACUUM FULL ANALYZE osmm.topographicarea;

Or run CLUSTER or use pg_repack or pg_squeeze if you cannot afford an exclusive lock on the big table.

Details:

Optimize `UPDATE`

Indexes are not helping for this UPDATE, on the contrary. Since you are updating every row and all involved columns are in the same row, indexes are of no use whatsoever here. They still have to be kept up to date at all times, though. It will be much cheaper to delete all indexes and add them back later - especially the comparatively expensive GIN index. And more ... Details:

How to speed up a slow UPDATE query

Also, you are updating every row unconditionally. If some of the rows already have the right value in fcode, it would be cheaper not to touch those at all. Details (last chapter):

How do I (or can I) SELECT DISTINCT on multiple columns?

But it sounds like you are adding a new column. In this case there is nothing to gain here.

Of course, all the usual advice for performance optimization applies as well.

Best Answer

Related Solutions

PostgreSQL Optimization – FTS and Trigram-Similarity Query Optimization

Assessment

Advice

PostgreSQL – UPDATE CASE Statement Using Excessive Space

Optimize UPDATE

Related Question

Optimize `UPDATE`