PostgreSQL – How to Limit Multiple and Concurrent Executions of a Task

concurrencypostgresqlredis

Let's assume you have a task FOO that can be queued once every minute, and a pool of 50 workers that can be paused. The queue is paused for 10 minutes, and 10 FOO tasks are queued. When the queue is resumed, the 10 FOO tasks will be executed almost concurrently (because there are more workers than tasks).

In this case, I need to ensure that no more than 1 FOO task per minute (time can vary) is performed.

One solution, using Redis, is to take advantage of Redis atomic and the TTL of a key. When a FOO task starts, it checks if the key worker:FOO exists. If does, then it exists, if it does not it sets the value and a TTL to the maximum frequency. This is easy to achieve using SETNX worker:FOO whatever and then using TTL worker:FOO if the previous command returned 1.

Because SETNX is atomic, I won't fall into the case where two FOO tasks are executed because of the race condition between the GET and the SET.

Now the question is: what is the correct way to achieve the same result using PostgreSQL? I can have a table with a key and a executed_on timestamp value, but how can I ensure that there is no case where two FOO tasks are both executed because of the delay between FOO 1 checks the record and writes a lock?

Best Answer

Since you're trying to serialize work, I'd update a record in a table.

CREATE TABLE task_keys (
  task varchar(10) primary key,
  last_executed timestamp with time zone not null,
  by_worker_id integer
);

INSERT INTO task_keys(task, last_executed) 
VALUES ('FOO', '-infinity');

then to see if you can run a task yet:

UPDATE task_keys SET
  last_executed = current_timestamp,
  by_worker = $1
WHERE task = 'FOO'
  AND last_executed < (current_timestamp - INTERVAL '1' MINUTE)
RETURNING *;

The isolation rules of READ COMMITTED guarantee that if this query successfully updates the table and returns a row, no other query can have concurrently done so. A row lock is taken on the relevant task_keys row. If another UPDATE tries to affect the same row, it'll wait until the row lock is released by a commit or rollback of the holding transaction... then it will re-check the WHERE clause. If the other tx committed then the WHERE clause will no longer match, so it'll affect zero rows.

See the documentation on transaction isolation.

If you need concurrency this gets a little tricker. What you really want is a token pool that refills on a timer, where workers can grab tokens from the pool to do work. That's effectively what we're doing here, with one token - so one option is to add more rows for the same task and grab the first task where the last_executed timestamp is old enough.

There are two flaws with this whole approach though:

It doesn't know when a task finishes, so long running tasks could overlap; and
It doesn't care if a task succeeds or fails

To solve those you need to use a proper work queue implementation. Those are currently very difficult to implement correctly in the database, so I suggest you look at using an external message queue / work queue system to manage them. In PostgreSQL 9.5 the new FOR UPDATE SKIP LOCKED feature will make implementing work queues like this in the database quite simple, though.

BTW, advisory locking is often a good choice for this sort of thing, but it won't help you with the need to expire the lock automatically after a certain amount of time.

Related Solutions

Postgresql – EC2 – How to correctly back up PostgreSQL data

See the fine manual. If my advice conflicts with its' in any way, it's right.

A sync isn't a bad idea, unless your copy tool fsync()s each WAL file it writes and the directory it's in before copying the next one. An incomplete last WAL file doesn't matter much; at worst, you just delete it. Pg will generally choke on an incomplete WAL - though there's no checksumming done, so you could be really unlucky and have it try to apply garbage data that by sheer insane chance happened to look like real WAL records. In your position I'd be syncing the volume before a snapshot to make sure any unwritten dirty buffers in RAM hit the file system image on disk. A freeze would help avoid messy but non-fatal partially written WALs, so it's not a terrible idea but not vital. What's vital is to have an undamaged timeline up until the point of recovery. Personally, I write my WALs to a temporary file name and rename them to their final name only once fully copied; if you do this, you don't need to freeze.
Sounds correct. A live snapshot is just like doing a plug pull test on a live system with write-through caching. Your database should recover fine when restored from a live snapshot, same as after plug-pull. I'd recommend that you automate tests of restores from snapshots. (Note: A snapshot restore test is not a complete substitute for plug pull testing because it doesn't account for possible disk, raid controller, etc write caching). Not only the config files and the dump, but the database its self should be fine after your snapshot. Consider syncing the volume before the snapshot to make sure all the dump data etc has actually hit disk.

2a. Might save some disk space. Little difference otherwise. You'll get to keep the snapshots a lot longer without all the churn of the live database on them.
Why even snapshot your code volume? A plain file level copy may well be just fine. Certainly a live snapshot should be.
This is not a solid backup scheme. It fails in one critical area: There is no restore testing and validation being performed. You should always test your backups on a regular basis to make sure you can really restore them.

Personally, I recommend that you use WAL shipping, or send database dumps, to a different host, preferably one not on Amazon EC2 or at least in a different region. This host should perform automated restore tests, send reports to you of the results, and should also be checked manually.

While your snapshots (containing dumps) will be on S3, and will be safe there, that doesn't mean they'll be accessible when you need them urgently. Amazon's durability claims are reassuring, but your data can still be safe and completely inaccessible to you during a badly timed outage of the S3 service.

Postgresql – How to improve performance on PostgreSQL when using multiple concurrent processes

You sort of answer your own question when you say you have no pooling but...

This is not an answer out of the box, with all client/db stuff you may need to do some work to determine exactly what is amiss

backup postgresql.conf changing

log_min_duration_statement to 0 
log_destination = 'csvlog'              # Valid values are combinations of      
logging_collector = on                # Enable capturing of stderr and csvlog 
log_directory = 'pg_log'                # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,        
debug_print_parse = on
debug_print_rewritten = on
debug_print_plan output = on
log_min_messages = info (debug1 for all server versions prior to 8.4)

Stop and restart your database server ( reload may not pick up the changes ) Reproduce your tests ensuring that the server time and client times match and that you record the start times etc.

copy the log file off an import into editor of your choice (excel or another spreadsheet can be useful for getting advance manipulation for sql & plans etc)

now examine the timings from the server side and note:

is the sql reported on the server the same in each case

if the same you should have the same timings

is the client generating a cursor rather than passing sql

is the query arriving on the server when you believe it should do

is one driver doing a lot of casting/converting between character sets or implicit converting of other types such as dates or timestamps.

and so on

The plan data will be included for completeness, this may inform if there are gross differences in the SQL submitted by the clients.

Best Answer

Related Solutions

Postgresql – EC2 – How to correctly back up PostgreSQL data

Postgresql – How to improve performance on PostgreSQL when using multiple concurrent processes

Related Question