Postgresql – Robust way to build a DWH without downtime

data-warehousepostgresql

I have a process that build a data warehouse using Postgres. The process runs every day at midnight and works in three steps:

  • Copy the production database (prod) to a temporary database (dwh_tmp) using pg_dump
  • Generate the fact and dimension tables in the temporary schema using the data from the production database
  • Rename the current DWH database (dwh -> dwh_old) and move in the temporary database instead (dwh_tmp -> dwh) using ALTER DATABASE dwh_tmp RENAME TO dwh

The problem is that the last step fails if there are still open connections to the dwh database. To get rid of these connections (e.g. using pg_terminate_backend()) I would need to run the DWH generation process with super-user privileges, which I'd like to avoid for security reasons.

Any suggestions on how to implement this in a more robust way? My main goal is to make sure that the DWH is always in a good state and the downtime is as small as possible, hence my idea of first building a temporary DWH and then renaming it. I'm not in direct control of the connecting processes that keep me from renaming the database (as they're openend from a different host).

Best Answer

For reference, here's the solution that I came up with:

  • Create a new schema dwh
  • Build the new version of the DWH in that shema
  • Drop the old public schema
  • Rename the new dwh schema to public

(this assumes that your main schema is named public, of course)

This method works quite reliably and without interrupting/killing existing database connections, so I prefer the solution over my initial "create and rename database" approach. Here's the SQL that I use for building the DWH (it assumes that you're connected to the DWH database):

--we drop the temporary schema if it already exists
DROP SCHEMA IF EXISTS dwh CASCADE;
--we create the temporary schema
CREATE SCHEMA dwh;

--we modify the search path for this session
SET search_path to dwh;

--create your DWH here...

--now we commit the new version

--make sure you're connected to the correct database when executing this ;)
DROP SCHEMA IF EXISTS public CASCADE;

--we rename the DWH schema to public
ALTER SCHEMA dwh RENAME TO public;

--you might need to restore permissions on the new public schema...
GRANT USAGE ON SCHEMA public TO dwh_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dwh_user;

Apart from needing to manually restore permissions and having to recreate extensions in the new schema, I found this solution to be quite optimal as it does not require superuser privileges for the DWH user, and works without creating/dropping any databases. I have not yet figured out a way to copy permissions from one schema to another, which is my only minor issue with this approach.