Recommendation for storage of series of time series

database-recommendationtime-series-database

Just a few words describing the data:
In my application, there are acceleration measurements (for example at 25kHz) for the exemplary duration of one second. These measurements get repeated in not necessery äquidistant time steps for that measurement point. (Maybe each five or ten minutes). This is a kind of interrupted permanent monitoring, somehow two-periodic:

  • short time period of measurement is 25.000 Hz (the resolution of the measurements)
  • long time periodic (not in a strong sense, might differ) every 5 minutes

There are 20 or more of those points.

As dealing with time series, the first idea might be the usage of a time series db. On the other hand, for me it seems as if the main purpose of time series db is storage of scalar values. Of course, my measurements are scalar values. But I'm not sure if it would be a good idea to store every scalar value as a (time/value/measpos_id)-triple – leading to an enormous number of entries. I think single of those entries would never be evaluated.
Another idea could be the storage of the measurement vector (all values from that second) together with the starting time and the measpos_id. But howto do that? Taking all values as a blob? Not every system is capable of dealing with vectors – and maybe they differ in length. Are there concepts in timeseries-db for such problems, which I'm not aware of?
Further for evaluation (extraction) I think maybe the exctraction of the complete vector would be the most used case.
Please feel free to ask, if my description is incomplete or some more details could help in finding a good solution.
What are your recommendations? NoSQL or relational SQL? Further ideas? Every hint is welcome. Thanks in advance.

additions:

  • A rough idea for the volume is steady growing in size of about 1 TB a year
  • Giving a sample is not that easy – I'll try to describe:
    Think of 1 column with 25000 float values for each measurement (each minute rougly and for each measurement position), timestamped each of these columns (at begin).
  • Usage for big data evaluation (means testing many kind of algorithms); windowing data, fft (spectral analysis), comparison, aggregation like energetic sum, value of max amplitude, pos (freq) of max amplitude, many more
  • purpose (focus) of evaluation: wear detection for condition monitoring of for example rolling devices (gears, generator sets, turbines, shafts, bearings)
  • evaluation would (from todays view) focus on each seperate column and maybe compare to others – but not combine (stack) columns together.
  • data size example: 25.000 float values in each column for 20 measured engines each 5 minutes (12 per hours) results in 6e6 floats each hour or 5.25e10 floats each year.

Best Answer

I can suggest Akumuli. It's a time-series database that supports compression and high-throughput data ingestion. With 25KHz measurement frequency and 20 engines, you will need to write 500K data points per second in the worst case. Akumuli can handle an order of magnitude larger throughput (largest throughput ever recorded is around 16M data points per second).

Also, because of compression, the database needs only around 3-9 bytes per data point. Each data point is a timestamp with nanosecond precision + 64-bit floating point value. There is an automatic data retention that deletes old data only if there is not enough disk space to store the new data.

You can store data from each engine in the same time-series or you can create new time-series per burst.

The real time-series database can be a big win because you won't need to use all these fancy tricks. There is a downsides of cause. E.g. there is no clustering and backfill.

Disclaimer: I'm the author so I'm a bit biased.