PostgreSQL – Statistical Probability of Incomplete Transaction Commit

postgresqltransaction-log

tldr

After spending time with bad databases which I will not mention, I came to question even the point of a database.

I knew that they could serve a purpose to keep permanent data organized, but that was it.

After delving into Postgres with the great help of especially Craig Ringer & Erwin Brandstetter, I've now come to see what a database could and should be:

  1. Convenient syntax
  2. Customizable storage methods for performance relative to persistence
  3. Organized writes and even reads through serialization
  4. Confidence that writes should never be incomplete

…and so on.

The point

The crash course provided has made me just enough knowledgeable on 1) – 3) to be dangerous, but I know no details about 4).

It seems like well-written database software like Postgres are designed to ensure that a transaction will never commit incomplete data.

Speaking out of ignorance, nothing is perfect, so in that vein, how statistically likely is it that Postgres will incompletely commit a transaction and not be able to detect the malfunction?

Best Answer

Speaking out of ignorance, nothing is perfect, so in that vein, how statistically likely is it that Postgres will incompletely commit a transaction and not be able to detect the malfunction?

Barring undetected PostgreSQL or operating system kernel / C library bugs, and in the presence of good quality hardware with ECC memory, nil.

PostgreSQL's write-ahead logging, where it:

  • Records changes only to to write-ahead log
  • fsync()s the WAL to ensure it's on disk and persistent
  • and only then write the changes to the main tables

is extremely robust.

You do need to ensure that your operating system and hardware honours fsync requests properly. The pg_test_fsync program is useful for this, though the results require interpretation. Also, beware of cheap SSDs without proper power-failure protection.

The main risks to data integrity are:

  • The system administrator
  • Lack of backups and replication
  • Operating system and PostgreSQL bugs;
  • Cheap poor quality memory;
  • Cheap SSDs without power failure protection;
  • Faulty/overheating hardware, though on modern x86 you'll usually get MCEs warning you about issues

Anyone who gives you "statistics" or numbers on things like the probability of a future bug causing issues is likely full of it.

There's been lots of work done on transient memory errors; the risk is real but the probability is low. I've never seen a bit-flip error in the wild, but I have colleagues who have. Google has lots of info.

The most important thing is proper backup and admin processes.

I wrote a bit more about this topic a while ago.

BTW, there are a few exceptions to the rule that effects aren't visible until they're committed:

  • Locking (obviously)
  • hash indexes (don't use them)
  • SEQUENCEs (and the SERIAL columns that use them)
  • ... more others I haven't thought of