Postgresql – Postgres record locking issue – could asynchronous UPDATES be a solution

performancepostgresql-9.3postgresql-performance

Is it possible (and sensible) to issue asynchronous UPDATEs to a postgres table?

Here’s the issue. I have two process, both of which access the same table.

Process “A” loads vast quantities of external records into a postgres table, this involves a trigger that updates a summary table (the summary table is the one involved in the conflict)

Process “A” is quite critical, by tweaking the number of records in each INSERT statement, and the number of INSERTS before issuing a COMMIT, I can insert about 4000 records in a single transaction that takes between 5 and 8 seconds.

Process “B” reads data from the database, does a lot of number crunching on the data and then UPDATEs the summary table. The problem is that although the reading and processing of the data can be achieved in milliseconds, the UPDATEing of the summary table typically takes 4 to 5 seconds to wait for the process “A” transaction to release it’s locks on the summary table. This is very in-efficient and dramatically slows down the overall process.

The following may (or may not) also be relevant.

Process “A” updates only fields “A”, “B” and “C” in the summary
table.
Process “B” updates only fields “D”, “E” and “F” in the summary
table.
Both process “A” and “B” only update the summary table by
explicitly identifying single rows via the unique primary key.
No other processes read or update the summary table.
Only the latest version (for any one record) of the values in “D”, “E” and “F” are relevant.

Bearing in mind the above, is it possible for process “B” to issues it’s UPDATES to the summary table asynchronously (by this I mean issue the UPDATE, and then carrying on with the next iteration of the process without waiting for the response to the UPDATE). If it is possible, how? And perhaps more importantly, is it sensible?

Thanks

Best Answer

Having thought a bit more about the problem, I realise there's actually a relatively simple solution. Since each process only updates its 'own' fields the solution is to split the data into two tables allowing each half to be updated by it's own process and any SELECTs can be done by joining on the common primary key.

Related Solutions

Mysql – Improve `Update` performance (rows locking issue)

First, each time you UPDATE the status column, you are having to update the index as well (source). Evaluate your indexing to see if you really need the index on the status column. My guess is no, since it has an extremely low cardinality and MySQL probably won't use it anyway.

If you ignore me and think you do need it, follow the advice in the article to drop the index before your loop and re-add it after you're done.

Here are some other things you might do if that doesn't help:

You are taking all the columns from the data but only using number. Don't do a SELECT *, but instead a SELECT number. That won't help your writes, but it is a good performance practice. Only select the columns you're using.
Your number index isn't getting used at all. This means it is not unique enough to be useful for updating. (Slight tangent: how many rows does a single UPDATE affect?) I would drop it, or at least add it to process index.
It looks like process is unique enough for MySQL to whittle the amount of rows down to 16k, instead of 1 million. In light of this, I would add AND process=x to your update statement (I'm assuming you know process from the original SELECT statement):
```
-- FAILED--
UPDATE data SET status = 2, error='$error' WHERE process=X AND number = $data['number']

-- SUCCESS --
UPDATE data SET status = 1 WHERE process=X AND number = $data['number']
```

A hint about unnecessary indexes in InnoDB. InnoDB is using a hidden 'primary key' (since you don't have one defined) and is using that when it writes the indexes. So for each Index you're using, you add the size of the index + the size of the hidden primary key to the data file. If you're not using the index (or MySQL can't use it), you are wasting space and adding overhead each time you insert a new number (same for status, as discussed earlier)

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Best Answer

Related Solutions

Mysql – Improve `Update` performance (rows locking issue)

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Related Question