PostgreSQL to Spark SQL – Direct Connection vs HDFS

hadooppostgresql

If the end goal is to perform computations with Spark, what are the reasons one would first transfer Postgres data to an HDFS (with Sqoop) vs. just using Spark SQL with Postgres directly (with the JDBC?)

The answer to this question (which is referencing MongoDB, not PostgreSQL…but still applicable) mentions that these are the two options, but I'm wondering what would motivate choosing one over the other.

Best Answer

This might be too broad of a question but as with any system, the less steps and dependencies you have, the easier it is to support it as long as it meets current and future (or future enough) business goals.

In your use case scenario it doesn't look like you have any use for HDFS and SQOOP. In a lot of environments they might. For example they might want to use HDFS as a file store and keep the records handy even after they were processed with Spark. It could be that they are customer records for example, and are aggregated, this way you would keep the originals handy so you could give each individual customer their own records if they need it.

By going directly into Spark and based on your job, you will mutate the data and you might not be able to get the original back. By keeping a copy in HDFS you could also always recreate the end result as you have a copy of what made it.

If you don't need the data to persists in the form it was when you pulled it, or don't need to be able to recreate the output by going through the same process you might not need to. Another item might be that you want a live dev/qa environment. By putting it in HDFS first, you can follow the same process for all environments using the most recent data. You might not be able to recreate an issue or scenario if the data get's processed but not maintained/stored.

Those are some of the MANY possible reasons. It really depends.