PostgreSQL – Do I Need to ANALYZE Table After DROP INDEX?

indexpostgresql

In our postgres db we historically have many similar indexes (e.g. is_deleted boolean) that have quite bad selectivity value. Worse than that, sometimes query planner breaks and postgres starts to use some of them instead of much more efficient indexes so we need to call ANALYZE manually to fix performance, so we decided to delete most of them.

Do we need to call ANALYZE for each table after deleting index?

According to the answer in Is it necessary to ANALYZE a table after an index has been created? postgres collects stats about actual values in table (for simple indexes) without index-related info but I couldn't find proof in docs.

Best Answer

No, it's not necessary. Statistics on the main table are not affected by indexes at all.

One thing to observe: expression indexes introduce new "derived" columns, and Postgres gathers separate statistics for those. Removing such indexes also removes the special statistics - which may lead to different query plans, beyond the fact that the index is gone.

Since you mention indexes on boolean columns: those hardly ever make sense and it's a good idea to remove those. boolean column typically only make sense as condition in a partial index. There is an example in the manual.

Too bad

If you cannot change the query at all, that's too bad. You won't get a good solution. If you had not table-qualified the table (~~run.~~frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path (optionally just in sessions where this is desirable) - for hugely superior performance.

Here's a recipe for such a technique:

How can I fake inet_client_addr() for unit tests in PostgreSQL?

@Joishi's idea with a RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.

Better query / indexes

If you could change the query, you should try to emulate a loose index scan:

Optimize GROUP BY query to retrieve latest record per user

This is even more efficient when based on a separate table with one row per relevant run_id - let's call it run_tbl. Create it if you don't have it, yet!
Implemented with correlated subqueries:

SELECT run_id
    , (SELECT frame
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY frame DESC NULLS LAST
       LIMIT  1) AS max_frame
    , (SELECT "time"
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY "time" DESC NULLS LAST
       LIMIT  1) AS max_time
FROM   run_tbl r;

Create two multicolumn indexes with matching sort order for lightening performance:

CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);

NULLS LAST is only necessary if there can be null values. But it won't hurt either way.

Unused index in range of dates query

With only 280 distinct run_id, this will be very fast.

MATERIALIZED VIEW

Or, based on these key pieces of information:

The "frames_stat" table has 42 million rows

rows=280 -- number of returned rows = disctinct run_id

The table is unchanging (no inserts/deletes)

Use a MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.

Aside: never use reserved words like time (in standard SQL) as identifier.

Postgresql – Optimizing queries on a range of timestamps (one column)

Two things that are very odd here:

The query select 300k rows from a table with 1M+ rows. For 30 % (or anything over 5 % - depends on row size and other factors) it doesn't typically pay to use an index at all. We should see a sequential scan.

The exception would be index-only scans, which I don't see here. The multicolumn index @Craig suggested would be the best option if you get index-only scans out of it. With lots of updates like you mentioned, this may not work out, in which case you are better off without the additional columns - and just the index you already have. You might be able to make it work for you with more aggressive autovacuum settings for the table. You can adjust parameters for individual tables.
While Postgres is going to use the index I would certainly expect to see a bitmap index scan for that many rows, not a plain index scan, which is typically the better choice for a low percentage of rows. As soon as Postgres expects multiple hits per data page (judging from its statistics on the table) it will typically switch to a bitmap index scan.
- Keep PostgreSQL from sometimes choosing a bad query plan

Judging from that I would suspect that your cost settings are inadequate (and possibly the table statistics, too). You may have set random_page_cost and / or cpu_index_tuple_cost too low, relative to seq_page_cost. Follow the links and read the manual.

Would also fit with the observation that cold cache is large factor, as we worked out in comments. Either you are accessing (parts of) tables that nobody has touched in a long time or you are running on a test system where the cache isn't populated (yet)?
Else you just don't have enough RAM available to cache most of the relevant data in your DB. Consequently, random access is much more expensive than sequential access when data resides in cache. Depending on the actual situation you may have to adjust to get better query plans.

One other factor has to be mentioned for slow response on the first read only: Hint bits. Read details in the Postgres Wiki and this related question:

Why does a SELECT statement dirty cache buffers in Postgres?

Or the table is extremely bloated, in which case an index scan would make sense and I would refer back to CLUSTER / pg_repack in my previous answer that you quoted. (Or just VACUUM FULL) And investigate your VACUUM settings. Those are important with many inserts and updates every day.

Depending on UPDATE patterns also consider a FILLFACTOR below 100. If you mostly update only newly added rows, set the lower FILLFACTER after compacting your table, so that only new pages keep some wiggle room for updates.

Schema

campaign_id is 99%+ NULL and dt_updated is 0% NULL.

Adjust the sequence of columns slightly, to save 8 bytes per row (in the 99% of cases where campaign_id is NULL):

CREATE TABLE traffic (
    uuid_self       uuid not null REFERENCES ... ,
    uuid_partner    uuid not null REFERENCES ... ,
    id              serial PRIMARY KEY,
    impressions     integer NOT NULL DEFAULT 1,
    clicks          integer NOT NULL DEFAULT 0,
    campaign_id     integer,
    dt_created      DATE DEFAULT CURRENT_DATE NOT NULL,
    dt_updated      TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
);

Detailed explanation and links to more:

Configuring PostgreSQL for read performance

To measure:

Measure the size of a PostgreSQL table row

Best Answer

Related Solutions

PostgreSQL – Indexing a Query with MAX and GROUP BY

Too bad

Better query / indexes

MATERIALIZED VIEW

Postgresql – Optimizing queries on a range of timestamps (one column)

Schema

Related Question