Postgresql – Replicating PostgreSQL data into Citus/Greenplum

data-warehousepostgresqlreplication

I need to integrate data from 3 different PostgreSQL databases (OLTP application backends) in a data warehouse. For the data warehouse itself I consider using Citus or Greenplum. There is a requirement that the data from applications has to be synced with the data warehouse as close to real time as possible (everything above 3-5 minutes delay is unacceptable, real time replication would be the best). In this regard I have the following questions:

  1. Will Postgres logical replication work with Citus? Citus is a Postgres extension, can you treat a Citus cluster as an ordinary Postgres database? If yes, then logical replication should theoretically work, but how does it deal with distributed tables?
  2. Greenplum is a Postgres fork, so will Postgres logical replication work with it at all? I have also read that Greenplum is not optimized for OLTP workloads, does that mean it will break when I try to ingest OLTP data into it?
  3. If logical replication does not work with Citus/Greenplum, then how to stream data from Postgres? Do I need to stream logical-level WAL into Kafka and then write custom logic for translating it into SQL statements on the target database? Are there any tools for that?

Bonus question: does anyone have experience with both Citus and Greenplum, especially with their SQL limitations? I know that Citus does not fully support correlated subqueries and recursive CTEs, does Greenplum have any similar limitations?

I would appreciate any help with these questions, I tried googling but there is little or no info on the subject, could you please give at least some direction?

Best Answer

After some research I've managed to come up with an answer. Kudos to @a_horse_with_no_name for pointing me in the right direction.

Real-time sync between different data sources can be done using Kafka Connect.

Kafka Connect allows to send data from various systems to a Kafka cluster via Source Connectors and read data from Kafka using Sink Connectors. All the platform-specific code for getting data from a system to Kafka or transferring data from Kafka to a system is abstracted away by the maintainers of the specific connector.

Connector maintainers usually will respect some kind of a "contract" between Sinks and Sources, i.e. they will agree in advance on a normalized message format, which allows to use disparate Sources and Sinks (e.g. you can have a PostgreSQL Source and Elasticsearch & Neo4J Sinks in parallel, despite the fact they have completely different paradigms).

For the case of syncing Postgres with Citus/Greenplum, it should be possible to do via Debezium PostgreSQL Source Connector and JDBC Sink connector (both Citus and Greenplum have JDBC support).