SQL Server – How to Setup Continuous DB Replication to PostgreSQL on AWS RDS

amazon-web-servicesawspostgresqlsql server

I am looking for a way to set up continuous data replication between an on-prem Microsoft SQL Server to AWS RDS PostgreSQL. The source table is updated once every 24 hours. The challenge here is that I don't have any control over the source (on-prem) DB. All I have is a DB user that I can use to query the database. In a nutshell, I am looking for the following requirements:

  • The source machine should not be modified in any way (i.e. applying configurations to the source machine)
  • Cheap (do not require a migration server to run 24/7) and easy to manage (set up once and forget it)
  • Ideally, the solution should only migrate the changes done to the source DB table such as adding/removing records. In other words, avoid emptying the target table and re-filling it with the source table records (that is what the AWS DMS "Full-Load" would do)

I was looking at AWS DMS Ongoing Replication but it requires changes to the source machine so I had to dismiss that solution (see Using a Microsoft SQL Server database as a source for AWS DMS). Also, that solution would require a "Replication Instance" to run 24/7.

Is there anything else I should be looking at that satisfy the above requirements? Or my only option would be to go with "Full-Load" AWS DMS migration which would be potentially costly solution provided the frequency of updates to the source database table?

I would appreciate any thoughts on the problem, please.

Best Answer

This is a very rough thought process on one thing you could do (and I'm sure there's ways to fine tune this idea, which I'll continue to update as I think of them) but you can leverage a couple of the things MS SQL Server and AWS offers to meet most of your needs.

In order of events, this is the steps you can take:

  1. Scale up an additional Microsoft SQL Server possibly using SQL Server Express Edition: This server will house a job that will handle pulling the data from your primary MS SQL Server and comparing it to your AWS PostgreSQL instance, to generate the changes.

    There are a lot of limitations with SQL Server Express Edition, so you'll have to read up on it, but it's a free edition of Microsoft SQL Server so it checks off the cheap checkbox if you are able to leverage it. If you can't, then a somewhat cheap alternative would be to use SQL Server Standard Edition on a two core server. Make sure to install the Integration Services component during the installation of SQL Server (regardless of which edition you choose).

    You can choose to scale this up in AWS, a VM, or however you see fit. (In AWS you can create a job that turns this server on and off as needed.)

  2. Create a SQL Server Integration Services project: This will be compiled into an SSIS package that will live on the server job server above. In the SSIS project you'll be able to connect to your primary Microsoft SQL Server, pull in the data from the source table, and then pull in the destination copy of the table on your AWS PostgreSQL instance, and then compare the two datasets for changes. How you implement your comparator is up to you, but one thing in Microsoft SQL Server that can be useful is a function called HASHBYTES(). You can hash the whole row and compare it's hash to your AWS PostgreSQL instance's hashed rows. (I'm not sure if there's an equivalent function in PostgreSQL, but if there's not then you can import your PostgreSQL's table data into your MS SQL job server first and then compare both datasets with HASHBYTES.) Then you'll be able to output the resultant rows that've changed to a destination. Unfortunately I don't think you'll be able to directly output them to your AWS PostgreSQL instance, so the next step will be an intermediary step.

  3. Add a step to your SSIS project to output the rows that changed to either another staging table on your job server or to a raw file like a CSV. (This will be up to you to decide how to implement, as a staging table means you'll need to leverage a more complex process in AWS to pull in the data. I think a CSV is the simpler route, you just need to ensure you setup a file share for the file to be saved to.)

  4. Deploy the SSIS project as a package to your Microsoft SQL job server. Then use the SQL Agent to schedule a job that runs daily to execute the SSIS package.

  5. Create an S3 bucket (or alternative AWS storage solution) in your AWS cloud for where the raw file from step 3 will be copied to.

  6. Create a scheduled job that will migrate your raw file from step 3 to your S3 bucket in step 5. This again will be up to you and can be accomplished many ways such as the following examples:

  1. Leverage the AWS SDK, or alternative services they provide to create an AWS job that uploads the data from the file you imported into your S3 bucket into your PostgreSQL instance. I'd recommend importing it to a separate staging table in PostgreSQL first (but this up to you how you end up implementing it). (You might be able to roll step 6 into the same job you create for this step.) See this AWS doc for more info: Importing data into PostgreSQL on Amazon RDS

  2. After importing the data above into a staging table in PostgreSQL, you can (either in the same job or a new job), delete from your destination table the records with the same primary keys as the records in your staging table. Then insert the records from your staging table into the destination table. (Although this is technically doing a DELETE / INSERT on the records that were updated, it's still better than truncating the entire table and inserting everything, so I think it should suffice your requirement of not doing a full load.)

Implementing this process is going to be a lot of work (more than it took to read :), so you might consider just paying an AWS partnered company to implement a solution for you if it's too much time to implement, but that's for you to decide.

One other thing that might be useful to be aware of in Microsoft SQL Server is Change Data Capture (CDC). This is a process in SQL Server that logs changes to tables (essentially maintains a transaction log table of every table in your database). It might be useful in conjunction or in place of SSIS and is the main component of Microsoft SQL Server that AWS's out of box DMS uses to migrate data from SQL Server.

Another similar component in Microsoft SQL Server is called Temporal Tables which have similar functionality as Change Data Capture (but are probably less used).

If you have any questions, I'm happy to elaborate on any of the steps in my rough solution above.

Related Question