Though this is years old question...
In short, you can understand ACID as guarantee of data integrity/safety in any expected circumstances.
As like in generic programming, all the headaches comes from multi-threading.
The biggest issue on NoSQL is mostly ACI. D(urability) is usually a separated issue.
If your DB is single-threaded - so only one user can access at once -, that's natively ACI compliant. But I am sure virtually no server can have this luxury.
If your DB need to be multi-threaded - serve multiple users/clients simultaneously - you must need ACI-compliant transaction. Or you will get silent data corruption rather than simple data loss. Which is a lot more horrible. Simply, this is exactly same with generic multi-threaded programming. If you don't have proper mechanism such as lock, you will get undefined data. And the mechanism in DB called fully ACID compliance.
Many YesSQL/NoSQL databases advertises themselves ACID-complient, but actually, very few of them are really does.
No ACID compliance = You will get always undefined result under multi-user (client) environment. I don't even think what kind of DB does this.
Single row/key ACID compliant = You will get guaranteed result if you modify only single value at once. But undefined result (=silent data corruption) for simultaneous multi row/key update. Most of currently popular NoSQL DBs including Cassandra, MongoDB, CouchDB, … These kind of DBs are safe only for single-row transaction. So you need to guarantee your DB logic won't touch multiple rows in a transaction.
Multi row/key ACID compliance = You will always get guaranteed result for any operation. This is minimal requirements as a RDBMS. In NoSQL field, very few of them does this. Spanner, MarkLogic, VoltDB, FoundationDB. I am not even sure there's more solutions. These kind of DBs are really fresh and new, so mostly nothing is known about their ability or limitation.
Anyway, this is a comparison except D(urability). So don't forget to check durability attribute too. It's very hard to compare durability because range becomes too wide. I don't know this topic well…
No durability. You will lost data at any time.
Safely stored on disk. When you get COMMIT OK
, then the data is guaranteed on disk. You lost data if disk break.
Also, there're difference even on ACID compliant DBs.
Sometimes ACID compliant / you need configuration / no automatic something.. / some components are not ACID-complient / very fast but you need to turn off something for this... / ACID-compliant if you use specific module... = we will not bundle data safety by default. That's an add-on, option or separated sold. Don't forget to download, assemble, setup and issuing proper command. Anyway, data safety may be ignored silently. Do it yourself. Check it yourself. Good luck not to make any mistake. Everyone in your team must be flawless DBA to use this kind of DB safely. MySQL.
Always ACID compliant = We don't trade data safety with performance or anything. Data safety is a forced bundle with this DB package. Most commercial RDBMS, PostgreSQL.
Above is typical DB's implementation. But still, any other hardware failure may corrupt the database. Such as memory error, data channel error, or any other possible errors. So you need extra redundancy, and real production-quality DB must offer fault tolerance features.
No redundancy. You lose all data if your data corrupted.
Backup. You make snapshot copy/restore. You lose data after last backup.
Online backup. You can do snapshot backup while the database is running.
Asynchronous replication. Backup for each second (or specified period). If machine down, this DB guaranteed to get the data back by just rebooting. You lose data after last second.
Synchronous replication. Backup immediately for each data update. You always have exact copy of original data. Use the copy if origin breaks.
Until now, I see many DB implementation lacks many of these. And I think if they lacks proper ACID and redundancy support, users will lose data eventually.
A 'data logging' approach to the problem
For anything involving 50,000 inserts a second you will need to batch your inserts. If you have to run it on cheap hardware then you need to look at the efficiency of your I/O.
If you assume:
Up to the second latency is not necessary (i.e. batched writes are acceptable)
You have no requirement for analytics within the application except your graphs by device.
Any further analytic requirements can be met by batching data into an external system
Some administrative overhead is acceptable in adding devices to the system (i.e. they have to be registered).
A polled once per second read is acceptable - incoming items can be batched and recorded on a per-second basis with a null recorded if a datum is not produced by a given device within that period.
Then you can store one row per second with all the devices de-normalised into an array of readings. The application maintains a registry of offsets into this array by device. As you add more devices the BLOB simply expands.
This changes your problem to one of storing one 50K-ish BLOB per second indexed on the time, which can be done by pretty much any DBMS platform that supports BLOBs. You might even be able to use a key-value pair system such as Berkely DB.
50K per second is 180MB per hour, 4.3GB per day and approximately 1.5TB per year. This should be possible to manage even with fairly modest hardware. Depending on how much you need to archive you can periodically clear down historical data. You will need something that supports partitioned tables to do this efficiently, though.
Getting the data back out
One disadvantage of this approach is that you would have to read your entire data set to query the statistics for any given device, which would imply scanning 4GB of data for a single day. If you need to support a lot of users querying the devices on an ad-hoc basis you will need to find a way to supplement this store with a fast querying capability. Some possibilities for this are:
One hour's data is approximately 180MB in memory. Cache a rolling period of a few hours - either within your application, or in leading partitions on a fast store (maybe a SSD) that can be archived off on an hourly basis.
If your applications that displayed the graphs queried this on a rolling basis (i.e. only got the most recent data) then this structure could be keyed on time and still support a significant query workload.
You may need to benchmark the BLOB reader performance of your platform to judge whether you need to build additional caching within your middle-tier, although you will almost certainly need a middle tier server application of some description to avoid shipping the entire BLOB record out to the client.
If you only expect to monitor a small subset of sensors at any given time, then you may be able to run caching on a per-sensor basis, with the cache living for some period (e.g. 10 minutes) after the last read for that sensor. The main cache would push values out into the per-sensor cache until the per-sensor cache expires. Per-sensor caching could be implemented fairly efficiently as a ring buffer of 1-2 hours (or whatever period seems appropriate) worth of sensor readings with supplementary data for the beginning and end periods. A time period should then resolve into an offset within the ring buffer. In this case the memory overhead per sensor amounts to a few bytes of header information.
Calculate aggregates of average readings for each minute, 10 minutes, hour or suitable period, and store those in a supplementary table. Graphing data over longer time periods can use the aggregates.
Take the aggregate values and un-pivot them into a dimensional structure in a supplementary data mart (i.e. with time and sensor ID dimensions). This can be done on (for example) a nightly basis. Ad-hoc queries on historical data can come from the data mart.
Some combination of these approaches will allow quick access to near-realtime data with a fair amount of read traffic. Even if you have to cache this with your application and write something hand built to do it.
Administrative overhead
A sensor will have to be registered with the system and given an ID (essentially the ID is - or maps to - an array index within the BLOB). As the BLOB is opaque the database doesn't need to care about columns. A KVP store that can deal with BLOBs efficiently may be sufficient.
A sensor can be registered with the system, which will then start recording data. You may need to explicitly record start/end dates of sensor connections, and maintain metadata that links sensor IDs to offsets within the array. You will need to recycle array slots as it's a static structure. Any middle-tier caching will have to be aware of this as well.
Message queues
You may wish to consider an asynchronous architecture using a transactional queue manager such as RabbitMQ to manage spikes in the read workload with the regular writes. The logging process and a query fronting process would collect requests to be processed in a batch queue. This would smooth out spikes in the system load. If you anticipate a significant ad-hoc query workload you may wish to look into this.
Pros and Cons - summary
This structure is optimised to minimise the storage space requirements, which should in turn minimise the I/O overhead on the system and allow it to run on relatively cheap hardware - in particular hardware with relatively cheap storage subsystems. In principle an ordinary PC should be able to handle the write workload from the sensors, although you may need faster disk hardware or SSDs to cope with any substantial query workload.
However, this architecture requires a fair bit of plumbing to get the data back out. The Sensor ID cannot be indexed in the database, as it is just an offset into an array of sensor readings.1 This means an entire periods worth of data must be read to get a single sensor reading out. The querying functionality may therefore require a fairly sophisticated caching function in the middle tier to get decent performance, so this approach comes at the price of some development overhead.
1 This is an egregious violation of first normal form, although a relational store may not be necessary or even desirable for this application.
Best Answer
If the two families of data are related then I would generally recommend keeping the in the same database - it makes maintaining integrity easier, for instance taking a reliable and consistent backup is (depending on database engine of course) a single operation.
Of course if the two families of data are very loosely coupled anyway this need not be a concern, and using different databases or even completely different technologies for each becomes more likely to be a good solution particularly as you scale out (your first hardware scaling being too put each family of data on its own (virtual) machine. Remember though that if you do split the data over more than one storage technology you are increasing the amount of knowledge needed to maintain the system.
In MSSQL or similar you could break data with significantly different IO patterns into separate file groups in the same database and store them on different drives if needed for performance - this way all the data can be dealt with atomically for transaction or backup purposes and so forth while still being segregated for performance reasons.