Sql-server – Storing massive amounts of data from a sensor array

database-designsql server

I have been tasked to implement a solution (app and db) to store the data samples from a huge sensor array. The array currently consists of about 20,000 sensors, but that is soon going to grow, up to 100,000 sensors. Each sensor sends a data sample every 10 seconds and each sample is 28 bytes in size.

Doing the sums thus leads to:

  • 8640 samples per sensor per day
  • 242kB of data per sensor per day
  • 864 million samples per day

Now I have been wondering what the best way would be to store/retrieve the data ? I "joined" this project after the software has already been specified, so it needs to be implemented on a Windows Platform using SQL Server.

The current solution in my head is to create a DB with two tables to store the data samples. The first serves as a sort of an index into the second that stores the collated samples in a binary field on a per day per sensor basis :

Table 1:

  RecordID - BigInt - Identity
  SensorID - BigInt - Primary Key
  Date - DateTime - Primary Key (yyyy-mm-dd)

Table 2:

  RecordID - BigInt - Primary Key (from an insert into Table 1)
  Data - Binary 

Basically I will write the samples from all sensors into temporary files (1 per sensor). At the end of each day I will then create an entry in Table 1, use the generated RecordID and dump the file into the Data field in Table 2.

This way I end up with only 100,000 entries into the table per day, instead of 864 million entries. The data should be available on the LAN or High Speed WAN, so retrieval of the sensor data on a whole day basis would be acceptable.

Although all data has to be stored, most of it will probably never be read. So the amount of reads on the table(s) will not be hugely more than the writes.

I know that I could implement something using the file system by just storing the path to the data files, but I read that SQL Server outperforms NTFS while your binary fields are less thank 256kB. (A gray area exists between 256kB and 1MB, while NTFS far outperforms SQL Server for binary sizes > 1 MB).

I am also slightly wary of storing data from 100,000 sensors into their own files without causes problems in the file system by either having huge amounts of files in a folder, or by having a complex tree structure with a few files in each folder, while not even taking file fragmentation into account.

  1. Can anyone offer me some practical advice/comments on the above?

  2. Are there obvious pitfalls that I am going to fall into?

  3. The sample data does compress quite nicely. A 242 kB file compresses to about 85kB. Can I however implement some type of compression on database level so that the sample data (column) is compressed automatically?

  4. Is SQL Server an obviously wrong choice for this project?

  5. Is my design of the two tables wise, or could I just as well combine it into a single table that will still be as "performant" as the two tables?

Best Answer

Yes, there is a pretty big pitfall you're going to run into fairly quickly, and that is with the size and maintenance of the tables. You are somewhat on the right track by saying that you want to put your data into a temporary table daily, and then move it into your permanent table, but you'll soon run into trouble with this scheme.

For example, let's say you want to "roll off" the oldest month's worth of data after two years. In your design, you would have to issue a DELETE statement against your big, big table. This will likely be somewhat slow, depending on the number of indexes you have. Also, it will cause index fragmentation, and the only way to fix that would be to rebuild or reorganize the indexes on this very large table which would also cause performance problems. There are a whole host of other issues with a big single table type design as well. For example, with a big, single table, you can't do FILEGROUP based backups, which means that if you want to have a full backup of your database, it's gonna be BIG, and it's gonna take a LONG time to complete.

What's the solution? Table partitioning. Read about this in depth, in as many places as you can. Basically, partitioning allows you to split up your data onto "tables within tables" -- each partition shares the same schema, and is accessed through the table object, but can be indexed and maintained differently. Partitions are basically tables, cut up by some useful key. In your case it will likely be date. They can be dropped just like (and just as fast as) tables, which means that if you partition your big data tables by date, you can simply drop old partitions instantly, with no adverse effect to the indexes on any of the other partitions. You can put partitions on different filegroups, which means that older partitions can be rolled off, or rolled on to cheaper commodity storage if it's not commonly used. Last but not least, in SQL 2012 you'll be able to create COLUMNSTORE type indexes on your older, read-only partitions, while having a different, more insert-oriented indexing scheme on the active partition where you're inserting all your sensor data.

Hope this helps. You have a good amount of research to do regarding partitioning and partitioning schemes, but hopefully now you know the direction you need to be looking.

P.S.: Oh, and I forgot your bulleted list of questions... Answer 1, 2, and 5. See above. Answer 3: In SQL Server, you can compress on a partition by partition basis, so compress your older partitions aggressively using PAGE compression. But I believe your out-of-row large data types will not be compressed if you do this -- again, you may want to alleviate this problem by normalizing your sensor values. Answer 4: Absolutely not, but if all you want to do is store static data by day and never search on it any other way, compressed flat files may be a much easier way to go.

P.P.S: Oh, and another thing. You don't need your two-table solution to make this all work. Large binary sensor data should be of type VARBINARY(MAX) because its values can be stored "out of row" but still be a column in a single table (see the sp_tableoption documentation). You may want to consider normalizing some of your sensor data out of the binary data you have in the table, though, because your database won't be good for much beyond retrieving chunks of sensor data by time if you don't.