Postgresql – Extracting ‘hot columns’ into a separate table

database-designpostgresql

I have this table:

CREATE TABLE fragment
(
  fragment_id integer,
  start_date timestamp without time zone,
  end_date timestamp without time zone,
  duration integer,
  -- <10+ more columns>
  revision_1 integer,
  revision_2 integer
)

It is pretty big: 44 million rows, 27 GB of disk space. Daily insert rate is about 70k rows.

The data in this table is almost never updated except for the last two columns named revision_1 and revision_2. They are updated via triggers set on other related tables. Updates come very frequently, especially for new rows in fragment table. Each row can be updated up to 50-100 times. Old rows (let's say 1 week old), however, stop being updated, as they are considered 'processed'.

As far as I know, UPDATE operation in Postgres is implemented as something like DELETE + INSERT. So, when a value in a single column is updated, the whole row is marked as deleted and a new row is created. That's why, I think, my fragment table is autovacuumed every day which takes several hours.

The question is, is it generally a good idea to extract 'hot columns' into a separate table? I mean something like this:

CREATE TABLE fragment_revision
(
  fragment_id integer,
  revision_1 integer,
  revision_2 integer
)

Best Answer

The behaviour you're describing is called MVCC (Multi version concurrency control). Strictly saying it's not delete + insert. It is more like:

copy current version of row
update as requested
append new current version to a chained list of version

In the background this history is being cleaned up depending on how old your oldest transaction is. If you have long running transactions the history can grow pretty big.

This is not postgresql specific. It's a very common method across many databases to handle conccurrency. To see more: MVCC in Postgresql

Therefore it does make sense to move such "hot column" to a separate table. Especially if the row size is significantly bigger than the size of the "hot columns".

Be aware though it has some implications on the select performance:

If these columns are present in your queries for filtering or sorting query time will take a hit
To retrieve these columns you need to join the table which again has some impact on time

There is an alternative method of splitting where your main table contains only the small, filtering and sorting columns (including these columns too) and "data" is stored separately.

It's always best to try which works best for your dataset and query patterns. I did a benchmark about a year ago comparing one big table with two splitting strategy that can help you get started: http://charlesnagy.info/it/postgresql/split-or-leave-frequently-updated-column-in-postgresql

Related Solutions

Postgresql – Bulk update of all columns

If you only want to update data, I'm not sure what the INSERT statement is for in your question.

If you just want to update several rows with a single statement, you might be looking for this:

with update_values (ID,PARENT_ID,BOUGHT_IN_FORM_TYPE_ID,PRIORITY,NAME,HEADING,DESCRIPTION,ICON,BOUGHT_IN_CONTROL_PANEL_FILE_ID) as 
(
  VALUES
     (109,1,28,100,'Tooling','Tooling','Enter your Machine Tools here','tooling.png',null), 
     (1,0,1,200,'Bought In','Bought In','','boughtin.png',null)
)
update bought_in_control_panel
   set parent_id = ud.parent_id, 
       bought_in_form_type_id = ud.bought_in_form_type_id,
       ....
from update_values ud
where ud.id = bought_in_control_panel.id;

Postgresql – Delete duplicate records with no change in between

Core feature is the window function lag().
Also pay special attention to avoid deadlocks and race conditions with concurrent deletes and inserts (which can affect which rows to delete!):

CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor int)
  RETURNS integer AS
$func$
DECLARE
   del_ct int;
BEGIN
   -- this may or may not be necessary:
   -- lock rows to avoid race conditions with concurrent deletes
   PERFORM 1
   FROM   vendor_prices
   WHERE  vendor = _vendor
   ORDER  BY sku, effective_date, id  -- guarantee row locks in consistent order
   FOR    UPDATE;

   -- delete redundant prices
   DELETE FROM vendor_prices v
   USING (
      SELECT id
           , price = lag(price) OVER w  -- same as last row
             AND (lead(id) OVER w) IS NOT NULL AS del  -- not last row
      FROM   vendor_prices
      WHERE  vendor = _vendor
      WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
      ) d
   WHERE v.id = d.id
   AND   d.del;

   GET DIAGNOSTICS del_ct = ROW_COUNT;  -- optional:
   RETURN del_ct;  -- return number of deleted rows
END
$func$  LANGUAGE plpgsql;

Call:

SELECT remove_vendor_price_dupes(1);

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Best Answer

Related Solutions

Postgresql – Bulk update of all columns

Postgresql – Delete duplicate records with no change in between

Notes

Related Question