Sql-server – How to store many smallish time-series in a relational DB

database-designoraclescalabilityschemasql server

Input Data

Multiple Tests Beds generate measurement data of various complexity.

In it's most basic form, not considering any meta-data, one measurement on a Test Bed will be a small (1 – a few thousand samples) time-series with a couple of dozen channels/signals/attributes per sample.

Measurements across time and Test Beds will have a similar set of signals, but not always the same as sensors are added and removed for the test setups.

Data volume

Currently we estimate our data rate at 6 testbeds x 4 test per hour x 12 hours a day x 4000 samples per test == 1,152,000 samples per day x 365 == 420,480,000 samples per year

_ x 48 columns per sample (currently 32 bit floats, mostly) ~~~ 75 GB per year

(columns in this case refers to channel/signal)

If/When more testbeds are added the data volume might increase accordingly.

Data Input

The test beds generate the data locally and the data is then imported asynchronously into the db. (A few thousand samples might be generated in a the time of one second, then reviewed locally and then either scratched or imported.)

Queries

We expect queries to be mostly on aggregates of the single measurements. I.e., you like to find all measurements (each having 4k samples) where e.g. the mean of channel_output_voltage is within a certain range.

Database layout?

What is a good way to set up tables for this? What factors have to be taken into account?

Theoretically I could go with one table per measurement generating 100,000 tables per year, but it doesn't strike me as a good idea.

Or I could stick everything into one big table (with hundreds of columns) that has room for all channels and channels get added as needed: One row per sample. Unused channels remain NULL.

MEASUREMENTS
------------
measurement_id } PK
time_stamp     }
channel_1 (may be NULL for a certain measurement_id ...)
channel_2
...
channel_n(+1)

Or I could go with an approach of having one table for the samples (timestamps) and one table containing all the values: (one row per sample in MEASUREMENTS table and n rows per sample in the SAMPLE_VALUES table)

SAMPLES            SAMPLE_VALUES
------------       -------------
measurement_id     sample_id
time_stamp         channel_id (links to a channels table where there is a name etc.)
sample_id          channel_value

What other options are there? How to further investigate which option we should choose?

Database products

Due to customer constraints we would like to put this in MS SQLS or Oracle.


From one answer:

Don't store the raw data, only store aggregates. Seriously.

This assumes that there is a meaningful way to determine ex antes what queries the customer is going to want to run against their data. No way 🙂

Best Answer

Are your queries supposed to collect data for each month/year?

You can use partitions to store your information in different physical files. Partitions can increase the speed of SELECT statements when you only need information about a specific period. http://msdn.microsoft.com/en-us/library/ms345146%28v=sql.90%29.aspx

When creating a partitioned table in Microsoft SQL Server you can also create different file groups on different physical locations and back those up separately.

With regards to your question about the database design, you may want to read about normalization here: http://en.wikipedia.org/wiki/Database_normalization