Postgresql – table bloating in databases

postgresqlterminology

Can someone explain what bloating means in database terms? For example what does it mean to say an index is bloated. I tried searching for it but there is no explanation of what bloating is, only what it causes or what it is caused by.

Best Answer

Due to how PostgreSQL handles transactions and concurrency, MVCC - Multi-Version Concurrency Control, you can get bloat. In PostgreSQL, when you do an UPDATE or DELETE, the row isn't actually physically deleted. For a DELETE, it simply marks the row as unavailable for future transactions, and for UPDATE, under the hood it's a combined INSERT then DELETE, where the previous version of the row is marked unavailable.

While the data is marked unavailable, it is still there, and the space cannot be used. To then mark the space as available for use by the database, a vacuum process needs to come along behind the operations, and mark that space available for the database to use. It isn't returned to the operating system, however. That only happens when there are no active rows in an entire page, which can be uncommon in some workloads. This can be a good thing for some workloads, because you can just simply update the space on the individual pages inside of the data files, without needing to add additional data files.

Problems come about with bloat when there are are excessively large numbers of dead tuples versus live tuples. Walking along and checking all of the visibility flags takes time, and having more data files for a relation results in additional unnecessary IO load. Bloat is especially noticeable on indexes, which can also have many dead tuples, sometimes many more than the table. Bloat can slow index lookups and scans, which will show up in slowly increasing query times and changing query plans.

You can restore space by using pg_reorg, pg_repack, CLUSTER, or VACUUM FULL. This will go through and reorganize the files, moving tuples and reorganizing to make sure that there are no dead tuples, which will eliminate the bloat.

Bloat can also be efficiently managed by adjusting VACUUM settings per table, which marks dead tuple space available for reuse by subsequent queries.

You can use queries on the PostgreSQL Wiki related to Show Database Bloat and Index Bloat to determine how much bloat you have, and from there, do a bit of performance analysis to see if you have problems with the amount of bloat you have on your tables.