MongoDB – Is It Always Faster to Create Indexes After Loading Data?

indexmongodboptimizationtokumx

I have a large number of records (~1 billion) that I need to load into MongoDB (actually TokuMX, but whatever). I have about 6 different indices I need to create on the collection. Is it always faster to load the data, and then create the indices? When I look at Mongo's logfile, It seems like Mongo does some kind of large operation (maybe a row count?) before actually starting index creation, and it does this for every index I create.

Will it always be faster to create the indices after loading the data?

If I wait until after loading the data, would it be faster to create each index in the background at the same time rather creating them than one-by-one?

Best Answer

Back in the day we would bulk load our data in this way:

  1. Drop indexes
  2. Load data in the order for which the clustered index would be built (i.e., you export the data in a precise way)
  3. After the load is completed, create the clustered index
  4. Next, create any additional non-clustered indexes
  5. Miller time (this was before I could afford decent beer)

That method always proved faster than leaving the indexes in place. However, this was for Sybase and SQL Server. I imagine other systems would be similar, but I can't say for certain.