PostgreSQL Migration – Use Statement Log and WAL for Compatibility

logsmigrationpostgresqlpostgresql-9.3transaction-log

We're migrating from postgres 8.1 to 9.3, since we need streaming replication badly. All the application developers have deserted the company a long time ago, way before my time here, so there is really no one who knows what's going on inside the applications. We've been treating them as black boxes.

As a strategy to migrate and test the applications compatibility with postgres 9.3, I'm considering:

  1. Do a pg_dump to have a starting point.
  2. Log all SQL statements.
  3. Replay all the logged SQL statements to the new server from the point were I did the
    pg_dump. This will take of synchronization, and…
  4. Check the response of the new server to the statements fed on (3), to see if It can
    digest all the SQL statements that will be sent to it. This will take care of "input"
    compatibility. Once we're satisfied with the "input" compatibility, and the new
    server is up to date, then…
  5. Pick a good sample of the SELECT commands from the logs, replay them to both servers,
    capture the output on files, compare the files. This would take care of "output"
    compatibility.

This would run over 6 months, to make sure there will be no surprises. If you see any holes in my plan, feel free to point them out 🙂

The only doubtI have is about the order in which the statements appear in the log file:

Do the statements in the log file appear in the same order as they do in the WAL file?

Thanks

Best Answer

WAL vs statement log ordering

Do the statements in the log file appear in the same order as they do in the WAL file?

That doesn't really make sense.

Statements do not appear in the WAL at all. Their effects do, in terms of transaction begins/commits/rollbacks, changes to table heap and index blocks, etc, but the actual statement is never recorded in the WAL at all.

Many queries run concurrently on most systems. The write-ahead log contains the effects of these queries mixed together in the order in which they did the work. So the queries are effectively all in the WAL at the same time.

PostgreSQL logs queries in the order they're received for parsing and execution. Because the operating system scheduler controls when each PostgreSQL backend runs this isn't necessarily the exact same order each query begins executing, and it's certainly not the same order each query finishes execution.

About all you can do is set a log_line_prefix that records the transaction IDs associated with statements, then look for those transaction IDs in the corresponding WAL entries. To do that you have to parse the disk block structure to find the xmin or xmax fields on changed tuples; a tool like pg_xlogdump will help you with this.

All in all I don't expect you to gain much from trying to match the WAL against the query log.

Replaying statements to compare results

Pick a good sample of the SELECT commands from the logs, replay them to both servers, capture the output on files, compare the files. This would take care of "output" compatibility.

This would work if your application only ever uses runs one query at a time, i.e. it only uses one connection.

Almost all real world applications do many concurrent writes. You can't simply replay the query log to replicate the effect of concurrent write operations - especially in read committed transaction isolation, where one transaction's commit can affect the results of other concurrent transactions.

If you're logging transaction boundaries and transaction IDs with an appropriate log_line_prefix you can probably reassemble an interleaved ordering of all individual statements that get run by each session and use a custom tool to replay them in that exact order. It won't be perfect though, as individual statements can be subject to race conditions with each other.

Take the simple example:

INSERT INTO mytable (id, somevalue)
SELECT 1, 'blah'
WHERE NOT EXISTS (SELECT id FROM mytable WHERE id = 1);

If you run this twice at the same time from two different connections the effects can vary. They'll change based on details of exactly when each started, and on operating system scheduler timing decisions. Possible outcomes include:

  • The first inserts a row and returns INSERT 1, the second inserts no row and returns INSERT 0. Or vice versa.

  • The first inserts a row and returns INSERT 1. The second attempts to insert the same row, because the SELECT did not find any row with id=1, and fails with a duplicate key error.

So as you can see, it's just not possible to replay the exact state of one database onto another using just the query log. That's before considering things like the effect of rollbacks on sequences, volatile functions, functions using timestamps, etc, too.

Limitations of comparing input and output

Some app bugs won't be detected by the approach you describe.

For example, it's entirely normal for rows to be inserted in different orders into a table based on small details of timing, differences in plan choices, etc. A correct application won't mind this, because it either doesn't care about the row order, or or specifies a specific order by clause when it does care. However, in reality some apps try to treat tables as if they have some kind of innate order and suddenly get very upset if the table is not "in order". To PostgreSQL the table is just a heap with no innate order at all, it just happened to be returning the rows in the order the app was expecting because that was fastest. So if your app is relying on behaviour like this you won't notice the problem with either of your proposed tests.

If you attempt to compare results of replay without sorting the results then everything will appear to mismatch because of those minor timing and plan differences. If you do sort the results though, you might not notice cases where the app's making unsafe assumptions about ordering that it gets away with on 8.1 and not on 9.3.

Another case is the change of bytea_output default format. Your app's inserts will produce the same results, but selecting the data will return it in a different format now. If your app doesn't understand \xDEADBEEF hex format bytea it might try to interpret that as an octal bytea . You'd only notice the problem if you went to save it back to the DB and it got mangled, or if you got an error in the app as a result of the misinterpreted bytea data.

So how do I check compatibility?

If your application doesn't have a good unit and integration testing framework, now is the time to add one.

Read the release notes for each major PostgreSQL release and look for specific compatibility challenges in your queries, like missing from clauses, implicit casts to/from text, etc.

Manually test every part of the app you can against a dump of the old DB (taken with 9.3's pg_dump) that's applied to a test copy of the new DB.

You should also simply review your queries to look for likely issues. Some things, like a LIMIT clause without a corresponding ORDER BY, are pre-existing bugs in the app code that it might've got away with before, and might still get away with in 9.3 ... or might not. PostgreSQL tries to detect obvious bugs, but there are legitimate cases where you might use a LIMIT without an ORDER BY, so in this case and many related ones it can't warn you in case that's really what you intended.

The biggest problems you're going to face are going to be:

  • Switch to standard_conforming_strings = on changing handling of escapes in strings
  • Removal of implicit from clause causing previously valid-looking but nonstandard and potentially unsafe queries to fail;
  • Removal of implicit casts to text causing previously valid-looking (but probably buggy) queries to fail;
  • Change of default bytea_output (if you use bytea fields)