PostgreSQL – UPDATE CASE Statement Using Excessive Space

performancepostgresqlpostgresql-performanceupdate

I am running a CASE statement on one of my tables and it consumes a huge amount of hard drive space that is never freed up even when i run VACUUM ANALYZE. Autovacuum is ON as well.

I am using PostgreSQL 9.3 64Bit on a windows 2012 server. I have tweaked the postgresql.conf using advice on the wiki. So it is as optimised as I can make it.

The table is large (>150million rows) which I need to add an extra column and populate it based on the contents of 3 other columns. The size of this table is 53Gb without indexes.

Having tested lots of approaches I am using a CASE statement. Two of the columns are arrays so I have used GIN indexes and a standard index.

A sample of my CASE statement looks like this, the rules go up to 38

UPDATE osmm.topographicarea
    SET fcode = (CASE
        --building CASE statements
        WHEN (descriptivegroup @> '{Building}' and descriptiveterm @> '{Archway}') then 1
        WHEN (descriptivegroup @> '{Building}') then 2

        WHEN (descriptiveterm @> '{Step}') then 3
        WHEN (descriptivegroup @> '{Glasshouse}') then 4
        WHEN (descriptivegroup @> '{Historic Interest}') then 5
        WHEN (descriptivegroup @> '{Inland Water}') then 6
      ELSE 99             
   END);
COMMIT;

The process takes over 5hrs but adds a huge 180Gb to the table!!

Adding an integer column to this table surely should not do this?

How can I fix this please?

Best Answer

Size of the table is 53Gb without indexes. Obviously you have indexes, one of them a GIN index, which is typically multiple times the size of a plain table storing only the indexed column. I expect you get around 180 GB for

SELECT pg_total_relation_size(osmm.topographicarea);

Details on measuring sizes:

Measure the size of a PostgreSQL table row

As has been commented, due to the MVCC model of Postgres, an UPDATE in place effectively writes a new row version of every changed row (and also for affected indexes), which roughly duplicates the size of the table in your operation.

If you don't have to worry about concurrent access, you could go a different route:

Optimizing bulk update performance in PostgreSQL

60 GB (size of table, rounded up) will be too much for a temp table in memory. It would spill to disk, which voids the intended effect. You could just create a new regular table, drop the old one and rename the new - if you don't have concurrent access or depending objects.

To reclaim space you need

VACUUM FULL osmm.topographicarea;

While being at it, make that:

VACUUM FULL ANALYZE osmm.topographicarea;

Or run CLUSTER or use pg_repack or pg_squeeze if you cannot afford an exclusive lock on the big table.

Details:

Optimize `UPDATE`

Indexes are not helping for this UPDATE, on the contrary. Since you are updating every row and all involved columns are in the same row, indexes are of no use whatsoever here. They still have to be kept up to date at all times, though. It will be much cheaper to delete all indexes and add them back later - especially the comparatively expensive GIN index. And more ... Details:

How to speed up a slow UPDATE query

Also, you are updating every row unconditionally. If some of the rows already have the right value in fcode, it would be cheaper not to touch those at all. Details (last chapter):

How do I (or can I) SELECT DISTINCT on multiple columns?

But it sounds like you are adding a new column. In this case there is nothing to gain here.

Of course, all the usual advice for performance optimization applies as well.

Related Solutions

Postgresql – Use CASE to select columns in UPDATE query

If you specify a column should be updated then it will always be updated, but you can change the value you put in conditionally and put back the original values depending on your conditions. Something like:

UPDATE some_table
SET    column_x = CASE WHEN should_update_x THEN new_value_for_x ELSE column_x END
     , column_y = CASE WHEN should_update_y THEN new_value_for_y ELSE column_y END
     , column_z = CASE WHEN should_update_z THEN new_value_for_z ELSE column_z END
FROM   ...

So if the conditions are not right for an update to a particular column you just feed back it's current value.

Do note that every row matched will see an update (even if all the columns end up getting set to the values they already have) unless you explicitly gate this circumstance in you filtering ON and WHERE clauses, which could be a performance problem (there will be a write, indexes will be updated, appropriate triggers will fire, ...) if not mitigated.

Postgresql – Slow Query with LIMIT

Adding LIMIT to the query often makes Postgres choose a different query plan. If statistics or cost estimates stray too far from actual data distribution / actual costs, you may end up with a slower query, even though Postgres reckoned it would be faster. Irregular data distribution may cause the query planner to misjudge the selectivity of WHERE conditions.

In particular, increasing the STATISTICS target for status, account_id, created_at and id might help.

Details:

How can I avoid PostgreSQL sometimes choosing a bad query plan for one of two nearly identical queries?

Aside from that, a (partial) multi column index may help. Depends on the complete situation. Which of your WHERE conditions are constant? How can they vary? Is some additional maintenance cost and disk space acceptable for a special index?

For example:

CREATE INDEX messages_special_idx ON messages (account_id, created_at, id)
WHERE status = 'unread'
AND   created_at >= '2014-01-01 01:00:00';

Run ANALYZE messages; before you try your queries again.

Best Answer

Optimize UPDATE

Related Solutions

Postgresql – Use CASE to select columns in UPDATE query

Postgresql – Slow Query with LIMIT

Related Question

Optimize `UPDATE`