Postgresql – Is REINDEX dangerous

postgresql

I've been trying to COUNT(*) a table with 150,000 rows that has a Primary key. It tool about 5 minutes, so I figured out this is an indexing problem.

Citing the PostgreSQL manual:

REINDEX is similar to a drop and
recreate of the index in that the
index contents are rebuilt from
scratch. However, the locking
considerations are rather different.
REINDEX locks out writes but not reads
of the index's parent table. It also
takes an exclusive lock on the
specific index being processed, which
will block reads that attempt to use
that index (…) The subsequent CREATE INDEX
locks out writes but not reads; since
the index is not there, no read will
attempt to use it, meaning that there
will be no blocking but reads might be
forced into expensive sequential
scans.

From your own experience, can you tell:

is REINDEXING dangerous? Can it harm the data consistency?
Can it take a lot of time?
Is it a probable solution to my scenario?

Update:

The solution that worked for us was recreating the same index with a different name, then deleting the old index.

The index creation is very fast, and we've reduced the index size from 650 MB to 8 MB. Using a COUNT(*) with between takes only 3 seconds.

Best Answer

Reindexing is not dangerous and can not harm data consistency. However, if you have time critical writes, you may loose data if the table is locked and the DML is aborted.

Reindexing should not take a lot of time, but will usually involve reading the whole table, sorting the index fields and writing a new index. Given the time for COUNT(*) it will likely take five minutes or more.

It is unlikely this is an indexing problem. COUNT(*) should use a table scan in which case no index is read. I would expect you have an IO problem of some sort.

Try using COUNT(1) or COUNT(pk_field) which may use the index.

If you are running on a Unix or Linux platform you may want to monitor disk activity with sar. You might also have a failing disk which can cut IO rates dramatically.

Tables with large objects can also increase IO significantly to construct the records for COUNT(*).

Related Solutions

Postgresql – Inner join using an array column

This is possible for PostgreSQL 8.4 or later. In order to create a multi-column GIN index, you need to install the additional module btree_gin.
Simple types like integer or date are usually better off with a (default) B-Tree index, so that's not installed with standard PostgreSQL. But for a case like this a multi-column index is fastest, so you need the additional index methods for the plain types.

In PostgreSQL 9.1 or later you can install it with CREATE EXTENSION:

CREATE EXTENSION btree_gin;

In older versions you would run as privileged system user (like postgres):

psql -d dbname -f SHAREDIR/contrib/btree_gin.sql

For a PostgreSQL 8.4 installation on Debian,this might be:

psql -d mydb -f /usr/share/postgresql/8.4/contrib/btree_gin.sql

Then, given this table:

CREATE TEMP TABLE tbl (age int, names text[], thedate date);

... you can create this multi-column GIN index:

CREATE INDEX tbl_gin_idx ON tbl USING GIN (names, age, thedate);

... which can be used by a query like:

SELECT * FROM tbl
WHERE  age = 26
AND    '{json}'::text[] <@ names
ORDER  BY thedate;

Note that a GIN index caries a non-trivial cost for write operations.
But that's hardly worth mentioning in comparison to what a SELECT on 10 million rows without index would do to you.

PostgreSQL Performance – Filter/Aggregate Performance Fluctuations

You have a lot of indexes and my first impuls would be to check whether all of them are actually used and remove the unused ones.

The optimal index for your query is not there, yet, though: a multicolumn index on (scat_group_code, revenue). Since this is a materialized view, it's reasonable to assume a read-only situation, which is perfect for index-only scans.

Also, since the table (many columns, some potentially big text columns) is much wider than the index (two columns, fewer rows for the partial variant), an index-only scan is particularly efficient here. Size matters.

If this particular query is very common / important, I would go one step further and make it a partial, multicolumn index:

CREATE INDEX ON v_premises_filter (scat_group_code, revenue)
WHERE area IS NOT NULL
AND   revenue IS NOT NULL;

I added the 2nd condition revenue IS NOT NULL because min() and max() ignore NULL values. I tested with Postgres 9.6. You may have to add the logically redundant predicate AND revenue IS NOT NULL to your query in older versions to make the query planner understand the index is applicable.

The index your query used in your tests (v_premises_filter_revenue_idx) is on (revenue), which is pretty inefficient for a predicate on scat_group_code. Adding that column as first index item makes a big difference, because Postgres can cheaply grab only the share with matching scat_group_code now. Where your 2nd query had to filter 1631527 rows, (Rows Removed by Filter: 1631527") it only has to filter much fewer rows with area is null OR revenue is null for the simple index and none at all for the partial index.

Also, since irrelevant rows are excluded from the index, it is substantially smaller, which contribute3s to performance as well.

You should see two index-only scans and a much faster query in either case now.

Why the fluctuation?

Your table is big (1909175 rows) and data distribution is obviously irregular. The default statistics_target setting of 100 is probably too small to gather enough details. Postgres decides to use a bad query plan for your 2nd query, based on insufficient information. It might help to increase the STATISTICS setting for scat_group_code, revenue and area.
But you won't need this with the suggested new index any more, because an index-only scan on the new index is always the best plan.

Consider this related answer from just yesterday:

Inconsistent statistics on jsonb column with btree index

Best Answer

Related Solutions

Postgresql – Inner join using an array column

PostgreSQL Performance – Filter/Aggregate Performance Fluctuations

Why the fluctuation?

Related Question