Postgresql – How to deal with fragmentation in PostgreSQL

clustered-indexperformancepostgresql

What are best practices for dealing with fragmentation in PostgreSQL?


Example:

I have a multi-tenant database whose workload includes on-the-fly aggregation for reports.

CREATE TABLE account (
    id serial PRIMARY KEY,
    name text NOT NULL
);

CREATE TABLE widget (
    id bigserial PRIMARY KEY,
    account_id int NOT NULL REFERENCES account (id),
    length numeric NOT NULL,
    weight numeric NOT NULL,
    color text NOT NULL,
    shape text NOT NULL
);

CREATE INDEX ON widget (account_id);

I want to run account-scoped queries on widget, e.g.

SELECT color, count(*), sum(weight) AS weight
FROM widget
WHERE account_id = 42094 AND 3 < length
GROUP BY 1

Suppose there are 5,000 records in account and 100,000,000 records in widget.

Since PostgreSQL lacks clustered indices, an individual account's 20,000 widget records get scattered all over the entire table. With enough records in the table, PostgeSQL will have to read 20,000 pages for a query that under other circumstances would be very fast.


This would seem to be a common sort of situation…how do users deal with this?

Best Answer

CLUSTER widget USING widget_account_id_idx;

That will rewrite the table in index order, and your query will become faster.

There are down sides:

  • during CLUSTER, the table will be inaccessible

  • the order is not maintained, so you have to run CLUSTER everynow and then