Mongodb – Fastest way to sync (or keep import) 3.5TB data from hadoop to sharded mongodb cluster

hadoopimportmongodbmongodb-3.0

There are 3.5TB data in our hadoop cluster(yes on hdfs). And we have newly built a sharded mongodb cluster(the latest 3.x) with 3 mongos, 3 configdb and 3 shards(each shard has 1 primary and 2 secondary nodes)

We are looking for the best/fastest way to import these data from hadoop/hdfs to our newly built sharded mongodb cluster.

All these data will be into sharded collections in mongodb cluster.

We don't have much experience on this and no clue how to do this with the fastest way in our environment.

Appreciate if anyone can give a clue or the tools we can leverage. open source tools or commercial are both ok to us.

Joe

Best Answer

Why would you want to copy TB's of data from one environment to another system. There are rich projects available such as Apache Drill (to name few) which can seamlessly access Mongo for querying requirements. Have you explored that option?

When you say you have TB's of data in HDFS, what is the format of the data because this is a critical question to be aware of