MongoDB Experience: Bulk Loading Data

mongoDB has a mongoimport command. The docs only shows the usage but not any examples. here are my first examples.



You need to specify your database (-d) and collection (-c) for importing. In my example, I also specified the collection fields with (-f)

The –file is actually optional, specifying the filename as the large argument also works.

mongoimport -d test -c foo -f a -type csv data
connected to:
imported 10 objects

NOTE: The default type is JSON, so you can get some nasty errors if you forget the csv type.

Wed Jun  9 11:18:26 Assertion: 10340:Failure parsing JSON string near: 1
0x68262 0x23968 0x250563 0x251c7b 0x24cb00 0x250280 0x1af6
 0   mongoimport                         0x00068262 _ZN5mongo11msgassertedEiPKc + 514
 1   mongoimport                         0x00023968 _ZN5mongo8fromjsonEPKc + 520
 2   mongoimport                         0x00250563 _ZN6Import9parseLineEPc + 131
 3   mongoimport                         0x00251c7b _ZN6Import3runEv + 2635
 4   mongoimport                         0x0024cb00 _ZN5mongo4Tool4mainEiPPc + 2880
 5   mongoimport                         0x00250280 main + 496
 6   mongoimport                         0x00001af6 start + 54
exception:Failure parsing JSON string near: 1

In my second example I’m adding multiple fields. This time my data file also has a headers which you can ignore with (–headerline)


name, age
Mickey Mouse,65
Minnie Mouse,64
Donald Duck,
Taz Devil,22
Marvin the Martian,45
$ mongoimport -d test -c foo -f name,age -type csv --headerline data2.csv
connected to:
imported 6 objects
{ "_id" : ObjectId("4c0fb0dfa5cd86585be6ca63"), "a" : 0 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca64"), "name" : "Mickey Mouse", "age" : 65 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca65"), "name" : "Minnie Mouse", "age" : 64 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca66"), "name" : "Donald Duck" }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca67"), "name" : "Taz Devil", "age" : 22 }
{ "_id" : ObjectId("4c0fb2bea5cd86585be6ca68"), "name" : "Marvin the Martian", "age" : 45 }

You can also use the –drop argument to truncate your collection before loading.

Real Data

I’m going to use the Freebase Olympics data to perform a more robust test.

bunzip2 olympics.tar.bz2
tar xvf olympics.tar
cd olympics

Loading this data via the following convenience script gave me some more meaningful data.

> db.olympic_host_city.find();
{ "_id" : ObjectId("4c0fb666a5cd86585be7d9b6"), "name" : "Vancouver", "id" : "/guid/9202a8c04000641f80000000000401e2", "olympics_hosted" : "2010 Winter Olympics" }
{ "_id" : ObjectId("4c0fb666a5cd86585be7d9b7"), "name" : "Moscow", "id" : "/guid/9202a8c04000641f800000000002636c", "olympics_hosted" : "1980 Summer Olympics" }
{ "_id" : ObjectId("4c0fb666a5cd86585be7d9b8"), "name" : "St. Moritz", "id" : "/guid/9202a8c04000641f80000000001c33e8", "olympics_hosted" : "1948 Winter Olympics,1928 Winter Olympics" }

Here is the simple load script I used.


load_file() {
  local INPUT_FILE=$1
  [ -z "${INPUT_FILE}" ] && echo "ERROR: File not specified" && return 1

  echo "Loading file ${INPUT_FILE}"

  COLLECTION=`echo ${INPUT_FILE} | cut -d. -f1`

  FIELDS=`head -1 ${INPUT_FILE} | sed -e "s/	/,/g;s/ /_/g"`
  echo "mongoimport -d olympics -c ${COLLECTION} -type tsv --headerline -f $FIELDS --drop ${INPUT_FILE}"
  time mongoimport -d olympics -c ${COLLECTION} -type tsv --headerline -f $FIELDS --drop ${INPUT_FILE}
  return 0

process_dir() {

  echo "Processing" `pwd`
  for FILE in `ls *.tsv`
    load_file ${FILE}

  return 0

main() {

main $*
exit 0
