Recommendation for storage of series of time series

database-recommendationtime-series-database

Just a few words describing the data:
In my application, there are acceleration measurements (for example at 25kHz) for the exemplary duration of one second. These measurements get repeated in not necessery äquidistant time steps for that measurement point. (Maybe each five or ten minutes). This is a kind of interrupted permanent monitoring, somehow two-periodic:

short time period of measurement is 25.000 Hz (the resolution of the measurements)
long time periodic (not in a strong sense, might differ) every 5 minutes

There are 20 or more of those points.

As dealing with time series, the first idea might be the usage of a time series db. On the other hand, for me it seems as if the main purpose of time series db is storage of scalar values. Of course, my measurements are scalar values. But I'm not sure if it would be a good idea to store every scalar value as a (time/value/measpos_id)-triple – leading to an enormous number of entries. I think single of those entries would never be evaluated.
Another idea could be the storage of the measurement vector (all values from that second) together with the starting time and the measpos_id. But howto do that? Taking all values as a blob? Not every system is capable of dealing with vectors – and maybe they differ in length. Are there concepts in timeseries-db for such problems, which I'm not aware of?
Further for evaluation (extraction) I think maybe the exctraction of the complete vector would be the most used case.
Please feel free to ask, if my description is incomplete or some more details could help in finding a good solution.
What are your recommendations? NoSQL or relational SQL? Further ideas? Every hint is welcome. Thanks in advance.

additions:

A rough idea for the volume is steady growing in size of about 1 TB a year
Giving a sample is not that easy – I'll try to describe:
Think of 1 column with 25000 float values for each measurement (each minute rougly and for each measurement position), timestamped each of these columns (at begin).
Usage for big data evaluation (means testing many kind of algorithms); windowing data, fft (spectral analysis), comparison, aggregation like energetic sum, value of max amplitude, pos (freq) of max amplitude, many more
purpose (focus) of evaluation: wear detection for condition monitoring of for example rolling devices (gears, generator sets, turbines, shafts, bearings)
evaluation would (from todays view) focus on each seperate column and maybe compare to others – but not combine (stack) columns together.
data size example: 25.000 float values in each column for 20 measured engines each 5 minutes (12 per hours) results in 6e6 floats each hour or 5.25e10 floats each year.

Best Answer

I can suggest Akumuli. It's a time-series database that supports compression and high-throughput data ingestion. With 25KHz measurement frequency and 20 engines, you will need to write 500K data points per second in the worst case. Akumuli can handle an order of magnitude larger throughput (largest throughput ever recorded is around 16M data points per second).

Also, because of compression, the database needs only around 3-9 bytes per data point. Each data point is a timestamp with nanosecond precision + 64-bit floating point value. There is an automatic data retention that deletes old data only if there is not enough disk space to store the new data.

You can store data from each engine in the same time-series or you can create new time-series per burst.

The real time-series database can be a big win because you won't need to use all these fancy tricks. There is a downsides of cause. E.g. there is no clustering and backfill.

Disclaimer: I'm the author so I'm a bit biased.

Related Solutions

Understanding how to choose between fields and tags in InfluxDB

I just read a tutorial that said fields are data and tags are metadata. That is a very intuitive definition.

The example had pressure and temperature fields and a weather station tag. Again, crystal clear and perfectly matches the description.

Unfortunately, they then said that if you are querying on pressure or temperature and not weather station, you should flip the field and tag designations around. In other words the definitions provided for field and tag are meaningless.

The simple solution is to stipulate that fields can either be indexed or not indexed. Fields that are indexed are called tags. Use tag when you need to index a field (to dramatically improve query speed for example).

Time series data of several sources in one SQLite table or several? File size impact

Personally, I hate to throw away potentially useful data. I'd create a table for the sources, and include source_id in the time series table as a foreign key to the source table. This should take up less space, but still retain the source information (without requiring multiple identical tables).

I've put together a short example; see this db-fiddle link.

Here's the code form my example:

CREATE TABLE source
     ( source_id INTEGER PRIMARY KEY ASC
      ,name varchar(10)
     );

CREATE TABLE time_series
     ( series_id INTEGER PRIMARY KEY ASC
      ,timestamp INTEGER
      ,value1 INTEGER
      ,value2 INTEGER
      ,value3 INTEGER
      ,source_id INT
      ,FOREIGN KEY (source_id) REFERENCES source(source_id)
     );

INSERT INTO source (name)
VALUES ('AAAA'), ('BBBB'), ('QZQZ');

INSERT INTO time_series (timestamp, value1, value2, value3, source_id)
VALUES (12345678, 100, 105, 110, 1)
      ,(12345681, 105, 105, 105, 1)
      ,(12345684, 110, 105, 100, 1)
      ,(12345678, 9, 27, 81, 3)
      ,(12345681, 27, 81, 243, 3)
      ,(12345684, 81, 243, 729, 3)
;



SELECT * FROM source;

SELECT s.name as source, timestamp, value1, value2, value3
  FROM source s INNER JOIN time_series ts ON (s.source_id = ts.source_id)
 ORDER BY source, timestamp
;

I should note that I don't normally use SQLite. As you noted in your response to my original comment, SQLite maintains a 64-bit integer row ID column in every table by default. I've set up the tables in the example to use that row ID value as the primary key for each table. If I've read the documentation correctly, the foreign key column should just be big enough to hold the value from the primary key. Assuming you don't manually insert a source_id that's huge, I believe that source_id in the time_series table should only require 1 byte.

You might want to put a UNIQUE index on source_id and timestamp (you should almost certainly have some sort of index on them); presumably, you should never have two entries for the same source and the same time. As pointed out by Serge Stroobandt, this could even be the primary key. However, since SQLite will create a unique row ID value anyway, I'd be inclined to leave that as the key. If you really wanted to use source_id and timestamp as the primary key, look into creating the table WITHOUT ROWID.

Best Answer

Related Solutions

Understanding how to choose between fields and tags in InfluxDB

Time series data of several sources in one SQLite table or several? File size impact

Related Question