Postgresql – Most efficient way to do many UPDATES in Postgresql

postgresqlupdate

I have to update cca 2.5 milion rows in a table with cca 20 milion rows.

The updated column value is based on value of another column, but another select is required to determine it's value. The column being updated is also a part of an index.

Currently I generated sql script with one UPDATE statement per row, but I imagine this is not very efficient. What's the most efficient way and are there any post update actions to perform to keep the table performing well? The table is being constantly used (selected/updated/inserted to), I don't want this to cause any downtime. The version of postgresql is 9.1.

EDIT

The updates in the script are of form

UPDATE table1 SET fid=123 WHERE id=345;

Where id column is integer and the primary key of the table.

The schema looks like this:

CREATE TABLE table1 (
    id serial PRIMARY KEY,
    date timestamp,
    rname varchar(50),
    fid integer references table2(id)
);

CREATE INDEX ON table1 (fid, date);

CREATE TABLE table2 (
    id serial PRIMARY KEY,
    rname varchar(50)
);

The fid column was recently added and I need to update some records in table1 so the records are properly linked by record id and not a rname column which causes data redundancy and other problems.

I've written perl script that generates set of update statements to do so based on the current state of the tables. The new inserted records in table have fid properly filled, I need to update the old.

Best Answer

If you don't want to interfere with other activity, the one UPDATE at a time in autocommit mode is very likely the best option. You should probably set synchronous_commit=off in that session (and only that session).

The indexes are going to slow you down, perhaps by a lot depending on your RAM and your IO system. But if the index is necessary for the other actions you don't want to interfere with, then there isn't anything you can do about it.

But since the fid is not yet correctly populated, the index on it is probably not actually useful to the concurrent processes you want avoid interfering with, as they haven't been changed yet to rely on that column being accurate. If that is the case, you can drop that index to gain speed, and build it in bulk later. The same probably applies to the foreign key constraint.

Once that index is gone, your updates can proceed via HOT (Heap Only Tuples) updates provided each block has enough free space. In that case, the updates will not have to do maintenance on the primary key index, either, saving that much more IO. To maximize the likelihood that this will work optimally, it is important that each UPDATE be its own transaction. That way one UPDATE can reuse space freed up by an earlier one.

Also, your WHERE clause should probably be like:

WHERE id=345 and fid is not null;

That way if the script gets interrupted, you can re-run it with minimal damage.

Since you seem to be running this on a test system already, then an EXPLAIN (ANALYZE,BUFFERS) of some of the updates would be helpful, especially with track_io_timing set to on.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Update column based on sum of data from another table

If you have a primary key, you don't have duplicate rows.

In Oracle, you do multi-table UPDATEs by using MERGE, and not UPDATE, because that is inefficient.

merge into table1 
using (select col1, col2, col3, col4, sum(val2) sum_val2 from table2 group by col1, col2, col3, col4) table2
on (table1.col1 = table2.col1 and table1.col2 = table2.col2 and table1.col3 = table2.col3 and table1.col4 = table2.col4)
when matched then update set table1.val1 = table2.sum_val2, table1.status = 'V';

If col1, col2, col3, col4 are really a composite primary key, no need to GROUP BY and SUM:

merge into table1 
using table2
on (table1.col1 = table2.col1 and table1.col2 = table2.col2 and table1.col3 = table2.col3 and table1.col4 = table2.col4)
when matched then update set table1.val1 = table2.val2, table1.status = 'V';

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Update column based on sum of data from another table

Related Question