I have 2TB of data currently spread across many CSV files that looks like:
{
'rec_date': "2010-01-29",
'rec_time': "09:15:00",
'site_no': '46',
'data_owner': '1',
'flow_period': '5',
'vehicle_count': '60',
'detector': '8'
}
I would like to insert this into mongodb so that I can quickly do queries that typically get the vehicle volume for a site and its surrounding sites for the past n
Mondays of the past m
years. (I wish to analyse the history of a site to determine if there's abnormal traffic flow for this time of day at this time of year).
Initially I wanted to store a document for each site, that contained its entire history, like so:
{
site_no: 46,
history: [{
$date: 1264756500000,
readings: [{sensor:1,vehicle_count:60},
{sensor:2,vehicle_count:32},
...
]
},{
$date: 1264756800000,
readings: [....]
}
]
}
But this would make each document very large, as the entire history for each site will be more than 16MB. More traffic volume data is coming in every 5 minutes from every site and I would need to be appending to the history array very often.
So my question is, how should I format my data so that I can perform the queries I want but minimising data redundancy? Should I just make a new record for every sensor reading?
Best Answer
Yes, create a new record for every reading.
MongoDB and other document stores are designed for efficient retrieval of de-normalized or irregularly structured data. A MongoDB collection is essentially a set of key-value pairs with a rich value type. Updating the value is an atomic operation, which means that constantly appending to a list within the value is not going to be efficient and, as you mentioned, you will eventually hit the document size limit.