SQL Server 2016 – Storing PLC Data in Database

sql-server-2016storagetable

I'm developing a software solution that requires storing raw data points originating from many PLCs (programmable logic controller, used to control industrial machines such as conveyor belts and robots) into a SQL Server database. My main concern is how to properly store large amounts of raw numeric data.

Each data point has the following properties:

  1. Datetime stamped
  2. May be a: boolean, integer, floating point, or a string
  3. A data point may be stored for each second all year (~31,536,000 data points per year)

Each PLC I'm collecting data from may have as many as 500 data points, and I want to plan on supporting thousands of PLCs. Meaning, I'll have at least 15,768,000,000,000 data points per year (31,536,000 points per year * 500 points * 1000 plcs). I currently have a table for each data type (bool, int, float, string). In an effort to reduce the number of records and storage, I am storing 4 data points per record [ID, DateTime, DataGroupID, Value1, Value2, Value3, Value4].

I have this implemented, and it's working well (fast queries with my largest table at ~72 million records) though it is isn't very flexible (Changing data types requires moving data from table to table, regrouping data requires table/column changes, etc.)

So, my thought to improve flexibility is to keep the data type specific tables, but to simply store each data point as its own record. Some brief math of the max size of a bigint (the data type of record ID), showed that I have plenty of IDs available. Though the storage requirements, and query performance is still a concern. Should I be concerned about anything using this method? Is there a better option?

The root of my concern is that I simply have not experienced gigantic databases/tables, so while I know that a few million records is small, I still don't know what an over-the-top table would look like. In addition, the storage of the data is the foundation for my software, so updating requires a lot of work to make it happen.

Best Answer

Let's start with the basics:

Should you store 16 trillion raw data points per year in a relational database? Probably not. Think about what you're going to do with it afterwards, and what those queries will look like. Relational databases are great for data that has relationships to other tables and needs to be inserted/updated/deleted frequently, or read by common reporting tools, but it's not quite as good of a fit here.

If you don't do it in an RDBMS, where do you do it? Meet the time series database, something designed for this exact purpose. Examples include Graphite and InfluxDB.

If you had to do it, what should you look out for? Any kind of maintenance will be extremely challenging - index rebuilds, stats updates, backups, CHECKDB, etc. Instead, consider a dynamic sharding design where data is split into smaller volumes. In the data warehousing realm, for example, large tables are typically split by date range (2017Q3, 2017Q2, 2017Q1, etc) so that if you need to add new columns for new incoming data, you only have to modify the current table. For reporting purposes, you can union all the tables together into a view - but just be warned that any kind of ad-hoc queries on a trillion-row table can prove challenging. (Heaven help you if someone wants an order-by and there's no supporting index - goodbye, tempdb drive space.)