Postgresql – may i reduce the restore time with pgsql 9.2

pg-restorepostgresql

I need your help about how to reduce time to restore process for a big dump (400GB)

We have a database that takes weeks to restore due to the single threaded nature of index generation on PostgreSQL. Please i need your help to figure out a way that we can restore this quicker, maybe by disabling index generation at load time and adding the indexes back later or some better trick.

sample pg_dump command:

pg_dump --compress=0 -bo -F c --lock-wait-timeout=1500 -h $HOST -p $PORT $DBNAME | lbzip2 > $DB-$TIMESTAMP.bz2

sample pg_restore command:

pg_restore -Ov -j 2 -h $HOST -p $PORT --dbname=$DBNAME $RESTOREFILE

-j option does not help us since it helps the part that doesn't take long (backup restore) but not the part that does take a long time (index generation) – this ticket is to figure out a way to streamline the process so index generation is done separately on an already working database or to speed up index generation, on PG 9.2

I'd like a clear procedure for removing index generation from the restoration process and doing it afterwards, to not block usage of the DB.

Is possible do that? what is your opinion about that?

Best Answer

pg_restore has an option to run the time-consuming parts of the restore, such as the index rebuild process, with multiple "jobs".

From the pg_restore documentation for PostgreSQL 9.2:

-j number-of-jobs --jobs=number-of-jobs

Run the most time-consuming parts of pg_restore — those which load data, create indexes, or create constraints — using multiple concurrent jobs. This option can dramatically reduce the time to restore a large database to a server running on a multiprocessor machine.

Each job is one process or one thread, depending on the operating system, and uses a separate connection to the server.

The optimal value for this option depends on the hardware setup of the server, of the client, and of the network. Factors include the number of CPU cores and the disk setup. A good place to start is the number of CPU cores on the server, but values larger than that can also lead to faster restore times in many cases. Of course, values that are too high will lead to decreased performance because of thrashing.

Only the custom archive format is supported with this option. The input file must be a regular file (not, for example, a pipe). This option is ignored when emitting a script rather than connecting directly to a database server. Also, multiple jobs cannot be used together with the option --single-transaction.

Related Solutions

PostgreSQL pg_dump – What Data Gets Backed Up on a Live Server?

It's the data at the start of the command for an entire database. According to the manpage:

It makes consistent backups even if the database is being used concurrently. pg_dump does not block other users accessing the database (readers or writers).

and in SQL Dump:

Dumps created by pg_dump are internally consistent, meaning, the dump represents a snapshot of the database at the time pg_dump began running

Dumping in parallel (--jobs) may be problematic with changing data, but only when targeting less recent versions:

For a consistent backup, the database server needs to support synchronized snapshots, a feature that was introduced in PostgreSQL 9.2

I don't think the output format makes any difference in the rescue operation. Note that for a parallel dump, directory is the only possible format.

PostgreSQL – pg_restore Error: Schema ‘public’ Already Exists

The error is harmless but to get rid of it, I think you need to break this restore into two commands, as in:

dropdb -U postgres mydb && \
 pg_restore --create --dbname=postgres --username=postgres pg_backup.dump

The --clean option in pg_restore doesn't look like much but actually raises non-trivial problems.

For versions up to 9.1

The combination of --create and --clean in pg_restore options used to be an error in older PG versions (up to 9.1). There is indeed some contradiction between (quoting the 9.1 manpage):

--clean Clean (drop) database objects before recreating them

and

--create Create the database before restoring into it.

Because what's the point of cleaning inside a brand-new database?

Starting from version 9.2

The combination is now accepted and the doc says this (quoting the 9.3 manpage):

--clean Clean (drop) database objects before recreating them. (This might generate some harmless error messages, if any objects were not present in the destination database.)

--create Create the database before restoring into it. If --clean is also specified, drop and recreate the target database before connecting to it.

Now having both together leads to this kind of sequence during your restore:

DROP DATABASE mydb;
...
CREATE DATABASE mydb WITH TEMPLATE = template0... [other options]
...
CREATE SCHEMA public;
...
CREATE TABLE...

There is no DROP for each individual object, only a DROP DATABASE at the beginning. If not using --create this would be the opposite.

Anyway this sequence raises the error of public schema already existing because creating mydb from template0 has imported it already (which is normal, it's the point of a template database).

I'm not sure why this case is not handled automatically by pg_restore. Maybe this would cause undesirable side-effects when an admin decides to customize template0 and/or change the purpose of public, even if we're not supposed to do that.

Best Answer

Related Solutions

PostgreSQL pg_dump – What Data Gets Backed Up on a Live Server?

PostgreSQL – pg_restore Error: Schema ‘public’ Already Exists

Related Question