PostgreSQL Batch Inserts – Random Silent Failures

amazon-rdspentahopostgresqlsql server

I have been using Kettle to import data from a SQL Server to PostgreSQL on AWS RDS and when I was enabling all of my foreign key constraints I realized that data was missing, often in the even hundreds.

But checking the log of the Kettle job, it seems to be doing all read and writes correctly from a table of 1923 entries

2017/02/26 21:36:42 - Table output.5 - Finished processing (I=0, O=240, R=240, W=240, U=0, E=0)
2017/02/26 21:36:42 - Table output.3 - Finished processing (I=0, O=240, R=240, W=240, U=0, E=0)
2017/02/26 21:36:42 - Table output.7 - Finished processing (I=0, O=240, R=240, W=240, U=0, E=0)
2017/02/26 21:36:42 - Table output.6 - Finished processing (I=0, O=240, R=240, W=240, U=0, E=0)
2017/02/26 21:36:42 - Table output.1 - Finished processing (I=0, O=241, R=241, W=241, U=0, E=0)
2017/02/26 21:36:42 - Table output.2 - Finished processing (I=0, O=241, R=241, W=241, U=0, E=0)
2017/02/26 21:36:42 - Table output.4 - Finished processing (I=0, O=240, R=240, W=240, U=0, E=0)
2017/02/26 21:36:42 - Table output.0 - Finished processing (I=0, O=241, R=241, W=241, U=0, E=0)

But when running the query count on Postgres I get 1423 entries, and similar behavior on maybe 10% of all tables I have transferred.

As I don't have any prior experience in neither Kettle nor Postgres, what is it that can cause this issue?

Best Answer

With the help of the author of the script i used (sqlserver2pgsql) we realized it was a (new) issue with how Kettle handled parallel output table truncates.

https://github.com/dalibo/sqlserver2pgsql/issues/57#issuecomment-282675275

What happened was, and the reason i only lacked even hundreds of data, that the output job in some cases did run truncate after one or more workers already had done inserts on the table