PostgreSQL – Fastest Way to Negate @>

postgresql

I have a text[] column in Postgres 9.5
I am planning on running a lot of queries with @> against it
I have a GIN index on this column

Normally you need to use the correct Postgres operator so that an index is consulted for your query. There is no operator that is the "opposite" of @> as far as I know.

I ran EXPLAIN ANALYZE on a query for this table and it takes 8 times as long to run WHERE NOT column @> array['thing'] as it does to run WHERE column @> array['thing']. The planner reports that it does a filter for WHERE NOT instead of a bitmap index scan on my index.

What is the correct way to negate @> so that my index is consulted?

I'm willing to create new indexes for this column, including functional indexes if that's the answer.

Here are depesz links:

https://explain.depesz.com/s/uXOT
https://explain.depesz.com/s/nc7Q
https://explain.depesz.com/s/wC9m
https://explain.depesz.com/s/Eat

Best Answer

I don't think there is any highly effective way to index generic versions of that expression. If you find a way to do so, I think you will be famous (well, famous within database circles, anyway).

If it is always exactly NOT column @> array['thing'] where 'thing' does not change, you could build special-purpose indexes, like

create index on foo (("column" @> array['thing']));

(or a partial index version of that)

Once 9.6 is released, you will be able to do the seq scan in parallel, and so get a boost that way.

Related Solutions

Postgresql – Optimizing queries on a range of timestamps (one column)

Two things that are very odd here:

The query select 300k rows from a table with 1M+ rows. For 30 % (or anything over 5 % - depends on row size and other factors) it doesn't typically pay to use an index at all. We should see a sequential scan.

The exception would be index-only scans, which I don't see here. The multicolumn index @Craig suggested would be the best option if you get index-only scans out of it. With lots of updates like you mentioned, this may not work out, in which case you are better off without the additional columns - and just the index you already have. You might be able to make it work for you with more aggressive autovacuum settings for the table. You can adjust parameters for individual tables.
While Postgres is going to use the index I would certainly expect to see a bitmap index scan for that many rows, not a plain index scan, which is typically the better choice for a low percentage of rows. As soon as Postgres expects multiple hits per data page (judging from its statistics on the table) it will typically switch to a bitmap index scan.
- Keep PostgreSQL from sometimes choosing a bad query plan

Judging from that I would suspect that your cost settings are inadequate (and possibly the table statistics, too). You may have set random_page_cost and / or cpu_index_tuple_cost too low, relative to seq_page_cost. Follow the links and read the manual.

Would also fit with the observation that cold cache is large factor, as we worked out in comments. Either you are accessing (parts of) tables that nobody has touched in a long time or you are running on a test system where the cache isn't populated (yet)?
Else you just don't have enough RAM available to cache most of the relevant data in your DB. Consequently, random access is much more expensive than sequential access when data resides in cache. Depending on the actual situation you may have to adjust to get better query plans.

One other factor has to be mentioned for slow response on the first read only: Hint bits. Read details in the Postgres Wiki and this related question:

Why does a SELECT statement dirty cache buffers in Postgres?

Or the table is extremely bloated, in which case an index scan would make sense and I would refer back to CLUSTER / pg_repack in my previous answer that you quoted. (Or just VACUUM FULL) And investigate your VACUUM settings. Those are important with many inserts and updates every day.

Depending on UPDATE patterns also consider a FILLFACTOR below 100. If you mostly update only newly added rows, set the lower FILLFACTER after compacting your table, so that only new pages keep some wiggle room for updates.

Schema

campaign_id is 99%+ NULL and dt_updated is 0% NULL.

Adjust the sequence of columns slightly, to save 8 bytes per row (in the 99% of cases where campaign_id is NULL):

CREATE TABLE traffic (
    uuid_self       uuid not null REFERENCES ... ,
    uuid_partner    uuid not null REFERENCES ... ,
    id              serial PRIMARY KEY,
    impressions     integer NOT NULL DEFAULT 1,
    clicks          integer NOT NULL DEFAULT 0,
    campaign_id     integer,
    dt_created      DATE DEFAULT CURRENT_DATE NOT NULL,
    dt_updated      TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
);

Detailed explanation and links to more:

Configuring PostgreSQL for read performance

To measure:

Measure the size of a PostgreSQL table row

PostgreSQL GIN index not used when ts_query language is fetched from a column

I suggest a solution with partial expression indexes:

CREATE TABLE search (
   search_id serial PRIMARY KEY
 , language  regconfig NOT NULL  -- order of columns matters a bit
 , content   text NOT NULL
   --  *no* redundant fulltext tsvector
);

No redundant fulltext column - makes the table smaller, which helps overall performance.
Create one partial expression index for every relevant language:

CREATE INDEX search_fulltext_dutch ON search USING GIN(to_tsvector('dutch', content))
WHERE language = 'dutch'::regconfig;
CREATE INDEX search_fulltext_english ON search USING GIN(to_tsvector('english', content))
WHERE language = 'english'::regconfig;
-- more?

All of the partial indexes together are only about as big as your total index.

Then match index conditions in the query:

SELECT * FROM search  -- does not return useless column fulltext now
WHERE  language = 'dutch'::regconfig  -- match partial index condition
AND    to_tsvector('dutch', content) @@ to_tsquery('dutch', 'vliegen')

UNION ALL
SELECT * FROM search
WHERE  language = 'english'::regconfig
AND    to_tsvector('english', content) @@ to_tsquery('english', 'vliegen');

-- more?

You get bitmap index or index scans this way.
Another index on language may be useful for other purposes, this query does not need it.

Best Answer

Related Solutions

Postgresql – Optimizing queries on a range of timestamps (one column)

Schema

PostgreSQL GIN index not used when ts_query language is fetched from a column

Related Question