PostgreSQL BDR – Update Conflict Detection

postgresqlpostgresql-bdr

Apparently only the new values of rows are sent between nodes on update. So how can there ever be detectable conflict? Node A receives an update from Node B, and a different update from node C. How can A know that there was a conflict? That the B and C updates were not conducted in some well defined order?

And what does "last-update-wins" mean? Are the updates timestamped? In which case does it rely on synchronized clocks? Or does it just mean whichever update happens to get to Node A last wins?

SUMMARY

Best Answer

The manual discusses much of this.

Yes, commits are timestamped. Conflict resolution uses commit timestamp, not the timestamp of an individual tuple insert/update/delete. See track_commit_timestamps in the postgres docs.

In your example, node A might or might not see a conflict. If B's update happened first (according to B's clock), and B's update reaches A before C's does, A doesn't see a conflict. But if C's update reaches A before B's, and B's happened earlier, then when B's update arrives at A, A needs to know to discard it since it's already applied a newer one. Otherwise it'd become inconsistent with the state of B (which will apply C's update when it receives it) and C (which has overwritten B's older update with its own newer one).

Because it's a loosely coupled replication system with no global lock manager, snapshot manager, transaction manager, etc, there's no one concept of "now" and no one globally consistent state. Or at least it's fuzzy due to latency and replication lag. You'll see such systems (somewhat incorrectly) referred to as "AP", for the "availability" and "partition tolerence" of CAP. It's an eventually consistent system with weaker inter-node guarantees than made by stock PostgreSQL. You can concurrently UPDATE a tuple on two different nodes. You can INSERT the same key into a UNIQUE-constrained column (or same PRIMARY KEY) on two different nodes. etc.

When the nodes replicate and sync up these conflicts need to be resolved so that all nodes have the same outcome and no errors result. We can't just ERROR because the xacts already committed on their origin nodes and we have no way to "un-commit" them; even if we did, we shouldn't, since other xacts on the local nodes might've used those committed changes as inputs into other computations so we're stuck with them.

So we have to do things like pick the newest of two inserts. We use the timestamp of commit for this. It's sent on the protocol messages, and it's recorded by postgres for local commits.

The old row values don't matter for this. For example, we only care that some peer node updated table "x" but we've updated it more recently, so we presume the newer update should be kept. Otherwise when our change replicated to the peer node, it'd apply our change and if we'd also applied theirs, we'd have different final states. To get a consistent result, we must discard their change.

Similarly, if two new rows are inserted, we must consistently pick one of them to keep when they conflict on the same PK. We pick the newest of the two rows based on recorded commit timestamp.

So yes, it relies on clocks. The clocks don't have to be in perfect sync, but roughly right is desirable, otherwise you might overwrite newer tuples with significantly older ones when doing conflict resolution, i.e. pick the wrong "last" update. The manual should discuss clock sync a bit more.


Having access to old row values would be nice in some cases, but is doesn't let you know for sure what the right outcome is. You still have to pick one of multiple possible "correct" results. Consider this:

given table

CREATE TABLE x (
  id integer primary key,
  y integer not null
);

INSERT INTO x (id, y) VALUES (1, 1);

Two nodes run

UPDATE x SET y = y + 1 WHERE id = 1;

at the same time. Both produce a new local row with value y = 2 as a result. Now each replicates to the other.

If we know the old value was 1 and the new value is 2, we might say "well, duh, the correct result is 3 on both nodes".

But we have no way to know that one or both nodes didn't instead run

UPDATE x SET y = 2;

instead of y = y + 1. So it might want the final result to be 2, not 3. The only way to know is to know the application logic. BDR can't tell based on just the old and new tuples.

So really we don't need the old tuple values.