Postgresql – Postgres ORDER BY…LIMIT breaks after bulk insert or update in large table

postgresql

The largest table in our DB (Postgres 9.5) has almost 2 billion rows and is about to reach the 32-bit integer limit. I have begun a process of converting the PK (id) from int to bigint as follows:

ALTER TABLE bigtable ADD COLUMN id_bigint BIGINT NULL;
ALTER TABLE bigtable ALTER COLUMN id_bigint DROP DEFAULT;

Then, the following script runs to copy all values from id to id_bigint in chunks:

def copy_bigtable_id(model, table_name, batch_count, start_index=0):

    latest_id = model.objects.latest('id').id
    counter = start_index

    if latest_id < counter:
        print("Latest id {} in {} is less than start index {}. Nothing to copy. Exiting.".format(
            latest_id, table_name, start_index))
        return

    print("Starting migration of 'id' to 'id_bigint' on {} from index {} to {}".format(
        table_name, start_index, latest_id))
    print("Copy in batches of {} rows".format(batch_count))

    while counter <= latest_id:
        with connection.cursor() as cursor:
            cursor.execute('BEGIN;')
            select = "SELECT * FROM {} WHERE id >= %s AND id < %s FOR UPDATE".format(
                table_name
            )
            update = "UPDATE {} SET id_bigint=id WHERE id >= %s AND id < %s".format(
                table_name
            )
            cursor.execute(select, [counter, counter + batch_count])
            cursor.execute(update, [counter, counter + batch_count])
            cursor.execute('COMMIT;')
        counter = counter + batch_count
        latest_id = model.objects.latest('id').id
        print("copied {0:.2f}% of ids in {1}. counter={2}".format(
            min(100, 100.0 * counter / latest_id), table_name, counter)
        )
    print('ALL DONE! Last id in {} = {}'.format(table_name, latest_id))

Following the script, id is replaced by id_bigint as follows:

ALTER TABLE bigtable DROP CONSTRAINT bigtable_pkey;
DROP SEQUENCE bigtable_id_seq CASCADE;
ALTER TABLE bigtable RENAME COLUMN id TO id_old;
ALTER TABLE bigtable RENAME COLUMN id_bigint TO id;
ALTER TABLE bigtable ADD PRIMARY KEY (id);
CREATE SEQUENCE bigtable_id_seq START WITH <latest_id>
ALTER TABLE bigtable ALTER COLUMN id SET DEFAULT NEXTVAL('bigtable_id_seq');
ALTER TABLE bigtable ALTER COLUMN id_old DROP NOT NULL;
ALTER TABLE bigtable DROP COLUMN id_old;

This approach was chosen after failing to perform ALTER TABLE on the existing id field because the table is so large and ALTER TABLE takes an exclusive lock, blocking any access to the table. We wanted to minimize down time.

Our script has now been running for over 10 hours now and covered half of the table. The script is not blocking any other SELECT calls on the table as it uses ROW EXCLUSIVE lock only on the rows it is changing.

However, after covering 50%+ of the rows, normal SELECT queries on the table started to take extremely long time. It turns out the indexes set on the table were not being used.

This is the SELECT query which broke:

SELECT "bigtable"."id", "bigtable"."latitude", "bigtable"."longitude", 
"bigtable"."timestamp", "bigtable"."speed", "bigtable"."accuracy", 
"bigtable"."trip_id", "bigtable"."step_id", "bigtable"."revised" 
FROM "bigtable" 
WHERE "bigtable"."step_id" = 5907395 
ORDER BY "bigtable"."timestamp" ASC LIMIT 1;

this is the EXPLAIN result after the massive table updates (in the script above):

Limit  (cost=0.58..44711.92 rows=1 width=49)
  ->  Index Scan using bigtable_timestamp on bigtable  (cost=0.58..2826516601.39 rows=63217 width=49)
        Filter: (step_id = 5907395)

Notice that the query planner is not sorting the results as ORDER BY would suggest.

This is the EXAPLIN of the same query on my localhost replica of the database with much smaller dataset, but exact same schema. This is the query I'd expect and which returns results in milliseconds:

Limit  (cost=157.96..157.97 rows=1 width=53)
  ->  Sort  (cost=157.96..158.30 rows=136 width=53)
        Sort Key: "timestamp"
        ->  Index Scan using bigtable_step_id on bigtable  (cost=0.43..157.28 rows=136 width=53)
              Index Cond: (step_id = 5907395)

bigtable has the following indexes:

CREATE UNIQUE INDEX bigtable_pkey ON bigtable(id int8_ops);
CREATE INDEX bigtable_step_id ON bigtable(step_id int4_ops);
CREATE INDEX bigtable_timestamp ON bigtable(timestamp int4_ops);
CREATE INDEX bigtable_trip_id ON bigtable(trip_id int4_ops);

The "broken" query plan in the first EXPLAIN takes hours to complete. I dug a bit more and found this post which mentions that Postgres query planner sometimes gets confused with ORDER BY... LIMIT 1. Is this the case I'm running into??

Rows are inserted regularly into bigtable with ever-increasing step_id and timestamp. This query used to take a dozen milliseconds at worst, before it quickly deteriorated to over 30 minutes. This happened, again, after we began populating bigtable with id_bigint. Could this be the result of index bloat?

Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?

What happened is precisely what we wanted to avoid: down time. The app is running but effectively not serving clients because each query on that large table, which used to take milliseconds, now takes over 1 hour. I don't have EXPLAIN query plans from before, but the queries took milliseconds and I'm pretty sure index scans were used. I tried to run EXPLAIN on the same query now by gave up after 15 min. Looking at the running queries on the live instance, similar queries had been "active" for longer than 30 min. So there has been a fundamental change in the way the query planner runs those queries, not using the indexes.

SELECT relname, last_vacuum, last_autovacuum, n_dead_tup from pg_stat_user_tables shows a huge amount of dead tuples on this table (1331381938) and that no vacuum or autovacuum has been performed on it. We have autovacum enabled and autovacuum is constantly running on this bigtable. At this point in time it's been running for over 20 hours and has not completed.

What is the fastest way to resolve this situation?

  • Should I run VACUUM on this table? If so, should I combine it with ANALYZE to rebuild the query planner statistics?
  • Should I run REINDEX on this table?

Given the sheer size of this table, is there a way to speed up VACUUM or ANALYZE?

Best Answer

Then, a script running in a loop copies the values from id to id_bigint using SELECT FOR UPDATE + UPDATE in chuncks of 1 million rows.

Could you show us exactly how that was done? Details matter here. How fast are these chunks going by?

Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?

It shouldn't "screw them up", other than by possibly bloating them, or clearing the all-visible flag on so many pages that an index-only-scan doesn't seem worthwhile anymore. Do you have before and after query plans you can show us? Did the before involve index-only-scans, or just regular index scans?

We have autovaccum enabled and autovaccum is constantly running on this bigtable.

Given the sheer size of this table, is there a way to speed up VACCUM or ANALYZE?

Autovacuum has a default level of IO throttling which makes it pretty slow. (This default was reduced by a factor of 10 between v11 and v12 so it is much faster now, you didn't say what version you were using.) While a manual vacuum is unthrottled by default. If you start a manual VACUUM, is should run much faster than autovac does.

The autovac should automatically cancel itself after it detects the manual VACUUM is trying to run. There are situations where it won't, but they probably don't apply to you.

This might consume all of the IO capacity on your system, but if you are currently down anyway that should hardly matter.

I would not expect ANALYZE to much difference, as the only column whose distribution is changing is not yet being used. I guess the correlation values might be changing in important ways, but I doubt that can explain what you are seeing.

Should I run VACCUM FULL?

No. Regular manual vacuum should be enough.

Should I run REINDEX on this table?

I see no reason to think that would help more than trivially. And it will certainly take time and resources away from things that possibly could.