Mongodb – How to load csv data into mongodb and choose the types I’d like to have created for each column

etlimportloadmongodb

I'm looking for a method to load csv data in mongodb and specify the types that I'd like to have created for each column?

I've been using mongoimport but I find it quite painful because there doesn't appear to be any way to specify the datatype to be used for a particular column in the csv file.

I load the data …

mongoimport --drop -d myDatabase -c myCollection --type csv --fields name --file myFile.csv

Then I do analysis to see what types mongo actually created.

> db.artist.itunes.feed.count()
36545
> db.artist.itunes.feed.count( { 'name' : { $type : 2 } } )
36511
> db.artist.itunes.feed.count( { 'name' : { $type : 1 } } )
1
> db.artist.itunes.feed.count( { 'name' : { $type : 16 } } )
33

Then I do surgery to correct errors like so …

db.artist.itunes.feed.find( { 'name' : { $type : 1 } } ).forEach( function (x) { 
  x.name = new String(x.name); // convert field to string 
  db.artist.itunes.feed.save(x); 
});

This is tedious.

It would be great if there were a method that allowed me to specify at import time what type to create for each column in the csv file like so:

mongoimport --drop -d myDatabase -c myCollection --type csv --fields field1,field2 --types 2,2 --file myFile.csv

http://docs.mongodb.org/manual/reference/operator/query/type/

Best Answer

Ensure you're using mongoimport version 3.4 or higher, and use the --columnsHaveTypes parameter to tell the import what type each column is.

See the MongoDB Reference Documentation for details.

The following types can be specified:

auto()
binary(<arg>)
boolean()
date(<arg>)
date_go(<arg>)
date_ms(<arg>)
date_oracle(<arg>)
decimal()
double()
int32()
int64()
string()

Consider the following CSV data:

Katherine Gray, 1996-02-03, F, 1235, TG9yZW0gaXBzdW0gZG9sb3Igc2l0IGFtZXQsIGNvbnNlY3RldHVyIGFkaXBpc2NpbmcgZWxpdCwgc2VkIGRvIGVpdXNtb2QgdGVtcG9yIGluY2lkaWR1bnQgdXQgbGFib3JlIGV0IGRvbG9yZSBtYWduYSBhbGlxdWEuIFV0IGVuaW0gYWQgbWluaW0gdmVuaWFtLCBxdWlzIG5vc3RydWQgZXhlcmNpdGF0aW9uIHVsbGFtY28gbGFib3JpcyBuaXNpIHV0IGFsaXF1aXAgZXggZWEgY29tbW9kbyBjb25zZXF1YXQuIER1aXMgYXV0ZSBpcnVyZSBkb2xvciBpbiByZXByZWhlbmRlcml0IGluIHZvbHVwdGF0ZSB2ZWxpdCBlc3NlIGNpbGx1bSBkb2xvcmUgZXUgZnVnaWF0IG51bGxhIHBhcmlhdHVyLiBFeGNlcHRldXIgc2ludCBvY2NhZWNhdCBjdXBpZGF0YXQgbm9uIHByb2lkZW50LCBzdW50IGluIGN1bHBhIHF1aSBvZmZpY2lhIGRlc2VydW50IG1vbGxpdCBhbmltIGlkIGVzdCBsYWJvcnVtLg==
Albert Gilbert, 1992-04-24, T, 13, Q3VwY2FrZSBpcHN1bSBkb2xvciBzaXQgYW1ldCB0b290c2llIHJvbGwgYm9uYm9uIHRvZmZlZS4gQ2FuZHkgY2FuZXMgcGllIGNyb2lzc2FudCBjaG9jb2xhdGUgYmFyIGxvbGxpcG9wIGJlYXIgY2xhdyBtYWNhcm9vbi4gU3dlZXQgcm9sbCBjdXBjYWtlIGNoZWVzZWNha2Ugc291ZmZsw6kgYnJvd25pZSBpY2UgY3JlYW0uIEp1anViZXMgY2FrZSBjdXBjYWtlIG1hY2Fyb29uIGRhbmlzaCBqZWxseS1vIHNvdWZmbMOpLiBDYWtlIGFwcGxlIHBpZSBnaW5nZXJicmVhZCBjaG9jb2xhdGUgc3VnYXIgcGx1bS4gU3dlZXQgY2hvY29sYXRlIGNha2UgY2hvY29sYXRlIGNha2UganVqdWJlcyB0aXJhbWlzdSBvYXQgY2FrZS4gU3dlZXQgc291ZmZsw6kgY2hvY29sYXRlLiBMaXF1b3JpY2UgY290dG9uIGNhbmR5IGNob2NvbGF0ZSBtYXJzaG1hbGxvdy4gSmVsbHkgY29va2llIGNha2UgamVsbHkgYm

The --fields option specifies which field type mongoimport will use when importing the data into MongoDB (command-line wrapped for readability):

mongoimport --db users --collection contacts --type csv 
  --columnsHaveTypes --fields "name.string(),birthdate.date(2006-01-02),contacted.boolean(),followerCount.int32(),user thumbnail.binary(base64)" 
  --file /example/file.csv