Postgresql – Extracting ‘hot columns’ into a separate table

database-designpostgresql

I have this table:

CREATE TABLE fragment
(
  fragment_id integer,
  start_date timestamp without time zone,
  end_date timestamp without time zone,
  duration integer,
  -- <10+ more columns>
  revision_1 integer,
  revision_2 integer
)

It is pretty big: 44 million rows, 27 GB of disk space. Daily insert rate is about 70k rows.

The data in this table is almost never updated except for the last two columns named revision_1 and revision_2. They are updated via triggers set on other related tables. Updates come very frequently, especially for new rows in fragment table. Each row can be updated up to 50-100 times. Old rows (let's say 1 week old), however, stop being updated, as they are considered 'processed'.

As far as I know, UPDATE operation in Postgres is implemented as something like DELETE + INSERT. So, when a value in a single column is updated, the whole row is marked as deleted and a new row is created. That's why, I think, my fragment table is autovacuumed every day which takes several hours.

The question is, is it generally a good idea to extract 'hot columns' into a separate table? I mean something like this:

CREATE TABLE fragment_revision
(
  fragment_id integer,
  revision_1 integer,
  revision_2 integer
)

Best Answer

The behaviour you're describing is called MVCC (Multi version concurrency control). Strictly saying it's not delete + insert. It is more like:

  1. copy current version of row
  2. update as requested
  3. append new current version to a chained list of version

In the background this history is being cleaned up depending on how old your oldest transaction is. If you have long running transactions the history can grow pretty big.

This is not postgresql specific. It's a very common method across many databases to handle conccurrency. To see more: MVCC in Postgresql

Therefore it does make sense to move such "hot column" to a separate table. Especially if the row size is significantly bigger than the size of the "hot columns".

Be aware though it has some implications on the select performance:

  • If these columns are present in your queries for filtering or sorting query time will take a hit
  • To retrieve these columns you need to join the table which again has some impact on time

There is an alternative method of splitting where your main table contains only the small, filtering and sorting columns (including these columns too) and "data" is stored separately.

It's always best to try which works best for your dataset and query patterns. I did a benchmark about a year ago comparing one big table with two splitting strategy that can help you get started: http://charlesnagy.info/it/postgresql/split-or-leave-frequently-updated-column-in-postgresql