Sql-server – How to store many smallish time-series in a relational DB

database-designoraclescalabilityschemasql server

Input Data

Multiple Tests Beds generate measurement data of various complexity.

In it's most basic form, not considering any meta-data, one measurement on a Test Bed will be a small (1 – a few thousand samples) time-series with a couple of dozen channels/signals/attributes per sample.

Measurements across time and Test Beds will have a similar set of signals, but not always the same as sensors are added and removed for the test setups.

Data volume

Currently we estimate our data rate at 6 testbeds x 4 test per hour x 12 hours a day x 4000 samples per test == 1,152,000 samples per day x 365 == 420,480,000 samples per year

_ x 48 columns per sample (currently 32 bit floats, mostly) ~~~ 75 GB per year

(columns in this case refers to channel/signal)

If/When more testbeds are added the data volume might increase accordingly.

Data Input

The test beds generate the data locally and the data is then imported asynchronously into the db. (A few thousand samples might be generated in a the time of one second, then reviewed locally and then either scratched or imported.)

Queries

We expect queries to be mostly on aggregates of the single measurements. I.e., you like to find all measurements (each having 4k samples) where e.g. the mean of channel_output_voltage is within a certain range.

Database layout?

What is a good way to set up tables for this? What factors have to be taken into account?

Theoretically I could go with one table per measurement generating 100,000 tables per year, but it doesn't strike me as a good idea.

Or I could stick everything into one big table (with hundreds of columns) that has room for all channels and channels get added as needed: One row per sample. Unused channels remain NULL.

MEASUREMENTS
------------
measurement_id } PK
time_stamp     }
channel_1 (may be NULL for a certain measurement_id ...)
channel_2
...
channel_n(+1)

Or I could go with an approach of having one table for the samples (timestamps) and one table containing all the values: (one row per sample in MEASUREMENTS table and n rows per sample in the SAMPLE_VALUES table)

SAMPLES            SAMPLE_VALUES
------------       -------------
measurement_id     sample_id
time_stamp         channel_id (links to a channels table where there is a name etc.)
sample_id          channel_value

What other options are there? How to further investigate which option we should choose?

Database products

Due to customer constraints we would like to put this in MS SQLS or Oracle.

From one answer:

Don't store the raw data, only store aggregates. Seriously.

This assumes that there is a meaningful way to determine ex antes what queries the customer is going to want to run against their data. No way 🙂

Best Answer

Are your queries supposed to collect data for each month/year?

You can use partitions to store your information in different physical files. Partitions can increase the speed of SELECT statements when you only need information about a specific period. http://msdn.microsoft.com/en-us/library/ms345146%28v=sql.90%29.aspx

When creating a partitioned table in Microsoft SQL Server you can also create different file groups on different physical locations and back those up separately.

With regards to your question about the database design, you may want to read about normalization here: http://en.wikipedia.org/wiki/Database_normalization

Related Solutions

How to enforce the structural constraints of rectangularly arrayed data

All current RDBMS' tables can have CONSTRAINTS on columns. These constraints are checked every time data is inserted into the table. It can also check data against other tables.

We know that each Plate Type has certain number of Rows and Columns. We can enumerate all Rows and Columns for each Plate Type. So, when data is inserted, the DB can check if a certain row/column combination exists for a given Plate Type.

Lets create a set of tables:

create table Plate_Types (
Plate_Type_id int,
Plate_Size int,
Plate_row int,
Plate_col int)

This table holds description of every Plate size like this:

Id     Size   Row     Col
1       6       1       1    -- 2x3
1       6       1       2
...
1       6       2       3
5       1536    1       1    -- 32x48
...
5       1536    32      32

Then, in your main table from Alternative 1 we introduce a Foreign Key - a "link" to another table to check if row and column are valid for this Plate size.

create table MyTable (
well_id int,
plate_id int, 
plate_size int,
row_id int,
col_id int,
value real);

ALTER TABLE MyTable  
ADD CONSTRAINT FK_Plate_SizeCheck 
   FOREIGN KEY (Plate_size_id, Row_id, Column_id)
    REFERENCES Plate_Types (Plate_type_id, Plate_row, Plate_col);

This Constraint here does the following: for every inserted row DB goes to table Plate_Types and looks for combination of Plate_size_id, Plate_row and Plate_col. In other words, it checks if this Plate Size can have row I and column J. If there is no match, then the DB fires an error.

Please note that this is one of several possible solutions for data integrity enforcement for your example. Medical data often comes in huge volumes and performance of this particular design is different question.

PS. This is a shortened explanation for non-developers. Code as well as table design is for concept illustration only.

Sql-server – Identity column value falling behind randomly

Randomly, one of these tables' identity values will fall behind, stopping any inserts from happening and we have no idea why.

Inserts are probably stopping because of an attempt to reuse an already existing unique value in the PRIMARY KEY, thus triggering the error like:

Msg 2627, Level 14, State 1, Line 27
Violation of PRIMARY KEY constraint 'PK__ID__1234'. Cannot insert duplicate key in object 'dbo.MyTable'. The duplicate key value is (8).
The statement has been terminated.

Note that if your IDENTITY column does not have a UNIQUE index or constraint, it is possible to reseed repeatedly and have many identical ID values. You do not want to do that, of course.

I have not personally found an error that, in itself, would reseed the IDENTITY value. Of course, it is possible to reset the SEED to a range where there will soon be a conflict by running a reseed that is lower than the current seed:

DBCC CHECKIDENT( MyTable,RESEED, 7) WITH NO_INFOMSGS

It could be that code somewhere in one of the processes actually does a RESEED on the table under some unusual circumstances.

(For example, this could be from a merge of two data sets, where the code reads the high value from one data set and after the import RESEEDs to the lower of the two high values that were merged.)

You should also read Martin Smith's post at: https://stackoverflow.com/questions/14146148/identity-increment-is-jumping-in-sql-server-database