Postgresql – How to anonymize pg_dump output before it leaves server

herokupg-dumppostgresql

For development purposes, we dump the production database to local. It's fine because the DB is small enough. The company's growing and we want to reduce risks. To that end, we'd like to anonymize the data before it leaves the database server.

One solution we thought of would be to run statements prior to pg_dump, but within the same transaction, something like this:

BEGIN;
UPDATE users
SET email = 'dev+' || id || '@example.com'
  , password_hash = '/* hash of "password" */'
  , ...;
-- launch pg_dump as usual, ensuring a ROLLBACK at the end
-- pg_dump must run with the *same* connection, obviously

-- if not already done by pg_dump
ROLLBACK;

Is there a ready-made solution for this? Our DB is hosted on Heroku, and we don't have 100% flexibility in how we dump.

I searched for postgresql anonymize data dump before download and variations, but I didn't see anything highly relevant.

Best Answer

There are four tough problems to solve when trying to anonymize data. I'm going to summarize an old blog post of mine about it:

1: Anonymized data may grow in size. If you have peoples' names, for example, you don't really want all of the anonymous ones to have exactly the same length as the original data. If you change the data as it goes out, that means the solution needs to understand the original field name's length to avoid growing data out to something too large to fit.

2: Anonymized data has different statistics distribution. Say you've got a date-of-birth field - you could completely randomize the dates, but then it'll be much less meaningful on the other side, especially if you're trying to do performance testing. Plus...

3: Anonymized data breaks referential integrity. I know it's a bad idea, but sometimes applications join on fields that need to be anonymized. I hate that I've seen this, but I've seen folks using email address in multiple tables, and then expecting to join on those, or find related orders.

4: Anonymizing slows down the export. You'll need to run some kind of business logic to determine which columns need to be anonymized, and then generate the new data in its place.

I used to say that if you could solve this problem, you'd be rich - but then you've got some competition. Delphix does data masking for DB2, Oracle, SQL Server, etc, but I don't think they've got Postgres or Heroku covered - yet.