PostgreSQL Clustering – Perform Clustering with Limited Disk Space

clusteringpostgresql

According to v12 docs

Cluster with index scan requires

free space on disk at least equal to the sum of the table size and the
index sizes

and cluster with sequential scan and sort requires

as much as double the table size, plus the index sizes

Is there any way out if I do not have that much free space? i.e. can it be run in portions somehow?

If I run CLUSTER command and it fails because of insufficient free disk space will the performed work be entirely rolled back (when run with or without any parameter if it matters)?

Best Answer

yes postgresql would roll the transaction back.

Almost all commands in Postgresql are run in transaction so if something goes wrong the command will be rolled back.

If the Postgresql system tables and WAL are on the same disk do not run this command if disk space is this limited. Postgresql would panic and shutdown until space is freed up.

A Few things to keep in mind,

Postgresql will not keep the table in that order so if the table is heavily updated. The physical sort order will be lost very quickly due to MVCC
Have to run the cluster command on a regular bases to keep the table's physical order which recreates the limited disk space problem.
The advantage for clustering is for read only queries that select record set in the order of the cluster.

If disk space is limited and the table is updated/inserted regularly in a short period of time the table and indexes will grow to a point the cluster can not be rerun and vacuum will not be able run

looks like its time to start planning on adding disk space and moving these tables to new TableSpaces

Prepare empty pages at the end of a table for testing

The system column ctid represents the physical position of a row. You need to understand that column:

How do I decompose ctid into page and row numbers?

We can work with that and prepare a table by deleting all rows from the last page:

DELETE FROM tbl t
USING (
   SELECT (split_part(ctid::text, ',', 1) || ',0)')::tid     AS min_tid
        , (split_part(ctid::text, ',', 1) || ',65535)')::tid AS max_tid
   FROM   tbl
   ORDER  BY ctid DESC
   LIMIT  1
   ) d
WHERE t.ctid BETWEEN d.min_tid AND d.max_tid;

Now, the last page is empty. This ignores concurrent writes. Either you are the only one writing to that table or you need to to take a write lock to avoid interference.

The query is optimized to identify qualifying rows quickly. The second number of a tid is the tuple index stored as unsigned int2, and 65535 is the maximum for that type (2^16 - 1), so that's the safe upper bound.

SQL Fiddle (reusing a simple table from a different case.)

Tools to measure row / table size:

Measure the size of a PostgreSQL table row

Disk full

You need wiggle room on disk for any of these operations. There is also the community tool pg_repack as replacement for VACUUM FULL / CLUSTER. It avoids exclusive locks but needs free space to work with as well. The manual:

Requires free disk space twice as large as the target table(s) and indexes.

As a last resort, you can run a dump/restore cycle. That removes all bloat from tables and indexes, too. Closely related question:

I need to run VACUUM FULL with no available disk space

The answer over there is pretty radical. If your situation allows for it (no foreign keys or other references preventing row deletions), and no concurrent access to the table), you can just:

Dump the table to disk connecting from a remote computer with plenty of disk space (-a for --data-only):

From remote shell, dump table data:

pg_dump -h <host_name> -p <port> -t mytbl -a mydb > db_mytbl.sql

In a pg session, TRUNCATE the table:

-- drop all indexes and constraints here for best performance
TRUNCATE mytbl;

From remote shell, restore to same table:

psql -h <host_name> -p <port> mydb -f db_mytbl.sql
-- recreate all indexes and constraints here

It is now free of any dead rows or bloat.

But maybe you can have that simpler?

Can you make enough space on disk by deleting (moving) unrelated files?
Can you VACUUM FULL smaller tables first, one by one, thereby freeing up enough disk space?
Can you run REINDEX TABLE or REINDEX INDEX to free disk space from bloated indexes?

Whatever you do, don't be rash. If in doubt, backup everything to a secure location first.

Postgresql – TOAST Table Growth Out of Control – FULLVAC Does Nothing

This:

INFO: "pg_toast_16874": found 22483 removable, 10475318 nonremovable row versions in 10448587 pages 22483 removable, 10475318 nonremovable row versions in 10448587 pages

suggests that the underlying issue is that something can still "see" those rows so they can't be removed.

The candidates for that are:

Lost prepared transactions. Check pg_catalog.pg_prepared_xacts; it should be empty. Also run SHOW max_prepared_transactions; it should report zero.
Long-running sessions with an open, idle transaction. In PostgreSQL 8.4 and above this should only be an issue for SERIALIZABLE transactions. Check pg_catalog.pg_stat_activity for <IDLE> in transaction sessions.

Most likely you have a client that's failing to commit or rollback transactions during long idle periods.

If this doesn't turn out to be it, the next thing I'd check would be to do a sum of the octet_size of each column of the table of interest. Compare that to the pg_relation_size of the table and its TOAST side-table. If there's a big difference then the space consumed is likely by no longer visible rows and you probably do have table bloat issues. If they're quite similar, you can start narrowing down where the space use is by summing up the octet sizes per column, getting the top 'n' values, etc.

Best Answer

Related Solutions

PostgreSQL – How VACUUM Returns Disk Space to Operating System

Prepare empty pages at the end of a table for testing

Disk full

Postgresql – TOAST Table Growth Out of Control – FULLVAC Does Nothing

Related Question