Postgresql – Which data model/schema to apply to a timeseries datastore for data sources with varying fields

database-designpostgresqlschematime-series-databasetimescaledb

I'm asked to develop a data-storage for time-series data, yet despite significant research I'm unsure about the data model and storage technology to chose.

About the data

The source data which is to be stored in the data-storage is provided by physical measurement units. Each of the unit may or may not have a different subset of variables with up to 300 variables per measurement-station (e.g. fuel type, fuel consumption, speed) while the number of different signals over all stations is on the order of 1500. The expected subset of variables per station is known beforehand. However, additional sensors might be added to a station with time (schema change might be required over time). All of the stations provide data in varying rate ranging from 20Hz to 0.2Hz.

In addition there is a fair amount of meta-data available to all these measurement-stations of which we will have about 500 in the end.

The data usually comes in in batches and not as "realtime" stream. The batch sizes differ from hourly to monthly batches.

About the queries

The querying on the data is done for two main reasons, reporting and statistical analysis on data of a single measurement-station as well as cross-station comparison. About 80% of the queries are concerned with data which came in during the last 30 days. Querying is done on a daily basis therefore SELECT load exceeds INSERT load.

Ideally queries like

SELECT var1, var2, ... varN FROM station_data WHERE station_id=X OR station_id=Y AND TIMESTAMP BETWEEN ... AND ...;

would be possible for ease of data access for non SQL-specialists. Moreover simple time based aggregation arithmetics should be possible (AVG, MAX, etc. pp).

Current situation

Currently a highly normalized structure is used to store the data in a PostgreSQL database which grew by now to about 6TB with one table per variable. Each of the about 1500 data tables is of the form

(timestamp, station_id, value)

with indexes on (station_id), (station_id, timestamp), (timestamp) and a unique constraint on (station_id, timestamp, value).

This structure requires heavy outer joining (up to 300 outer joins) which make data retrieval cumbersome and computationally expensive.

Research

So far the following considerations were made:

DB Technology

While NoSQL would give the required schema flexibility, tools to ensure data integrity, access control and management of the meta-data seem to be challenging and no NoSQL experience exists in house. Further, reading comments and answers along the line of this seem to go in favour of a SQL solution for our usecase.
Different time-base optimized databases were considered (mainly CrateDB and TimescaleDB). Both look promising with regard to their "automatic" partitioning and sharding where TimescaldeDB would be slightly favoured for it being based on PostgreSQL.

Data Model/Schema

So far two different schemas were worked out which work in principle. However, both have significant drawbacks which I need to find a way around.

EAV (anti-)pattern with one huge vertical data table with sharding on station_id and monthly partitioning on timestamp. While the required schema flexibility would be given, this pattern would not comply with the required ease of access as it still relies heavily on inner joins. Furthermore, type-safety for different datatypes is not ensured on db-side and access control is not possible.
One table per station_id with horizontally changing schema on addition of a sensor to a specific station. This unnormalized structure is on first sight appealing from an application point of view (fast inserts, little indexing required, simple query on single station). However, querying would required dynamic SQL since the enduser might not know the table name for the specific station and cross station comparison would only be possible with extended SQL queries or client side code.

General considerations

While storage capacity is of no concern, reliabiliy, uptime and speed of data retrieval is.

Question

Which of the proposed data models would be preferred in order to meet the requirements while maintaining scalability? Suggestions for any additional schema which fits the requirements is highly welcome.

Thank you.

Best Answer

I had pretty similar situation with my data, except variability of number of variables, but as TmTron said JSON might work for you. Here is what schema I had (adapting to your data):

Table "sensor: containing whatever metadata you want about 1k+ rows regularly on some cases 7k+ no actual difference.

Table "sensor_data":

timestamp,
sensor_id int, -- FK to sensor
measurement_id int (I had 14),
var1,var2,var3,var4,var5 --fo me it is a set of 5 int8, in your case it's non-columnable data, let's say JSON
Index by (sensor_id,measurement_id, timestamp) (about 1/3 of table size)

Tons of queries like

{select timestamp, var1,var2,var3,var4,var5 from sensor_data where sensor_id = xx and timestamp between xxxx and xxxx}

Table became bigger, querying slower, customers angrier and so on.

First attempt of optimisation was partitionining by range of sensor_ids - 20 per partition, space consumption remain the same, schema became more complex, queries became faster but not so much.

So, here is still working schema:

custom data type "metric" (timestamp, var1,var2,var3,var4,var5)

table sensor_data:

date
sensor_id
measurement_id
dataset -- its a column of type "metric[]" - array containing all data for a date unique index by date, sensor_id, measurement_id

select query was replaced with function get_data(sensor_id,measurement_id,from_time, to_time) select (unnest(dataset)).* from sensor_data where sensor_id = xx and data between from_time::date and to_time::date and measurement_id = xxx

insert became more complex:

insert into sensor_data value (to_date(timestamp), sensor, measurement, [(timestamp, var1,var2,var3,var4,var5)])
on conflict (date, sensor_id, measurement_id) do update
set dataset=dataset||excluded.dataset

Space consumption is ~10 times less, querying is more complex but dramatically faster.

If you do not request data by measurement_id just remove it from index and query. If you have significantly more data per day, you can store data per hour replacing "date" column with "hour" as date_trunc('hour',timestamp) and partition table per month, so you will have maximum 744 (31*24)rows per sensor, per measurement in each table. It's pretty reasonable number of rows and will work fast enough.

Obviously you have to compose your own datatype (for most cases type (timestamp,JSON) will work)

Main idea is that postgres stores arrays of data outside of table and reads them only when they needed (moreover it's compressed). So table became "kinda index" to data stored somewhere else, but still remains a table that you can index and partition.

Limitation is that you can't control dataset array content with constraints and aggregate data directly. But for simple aggregations (like max,min,avg) you can pre-aggregate data and still store it on row level.

Related Solutions

Mysql – Storing a large number of small size entries in MySQL

It depends on exactly what you're trying to do with the data -- if it's only being used for graphing, and you don't need high resolution the further out it goes (eg, you're not trying to plot a graph with full temporal resolution for a day from 1 year ago), you might actually want to look at RRDTool rather than a relational database.

If you're going to ever need to do analysis of the values (how often does a place get above 80°F?), you'll want to store discrete values, not an XML structure; but you could also use flat file stuctures that are meant for dealing with this type of data (eg, CDF, NetCDF ... maybe even HDF )

update :

I'd store each time as a separate record, as it makes it easier to adjust the granularity when graphing. For instance, to extract the high/low/mean for each hour:

SELECT   min(date_obs),min(temp),max(temp),avg(temp)
FROM     observations
WHERE    date_obs between ...
GROUP BY floor(unix_timestamp(date_obs)/3600)

Also, it allows you to change the cadence for the measurements without needing to change the table structure.

Single database schema or multiple schemas

the 'cleaner' solution would definitely using a single schema, and relate your data to a specific event, either with a EventID column, or yet better, using an Events table, relating your data to a row in this table (but keep it simple, you don't need to relate every row in every table to your event, just the 'top-most' objects)

but on the other hand, maybe 'You Aint Gonna Need It', a lot can change in a year, and you'll probably have to make a lot of changes and improvements for next years event. anyway you'll find yourself carrying over and migrating data which nobody needs or wants, so you may end up purging your database after all.

i learned to stick to the YAGNI rule the hard way several times, and keeping things simple always proofed to be the better solution. as long as you produce clean and maintainable code, it's often much simpler to add functionality, than to trim unneeded (and often untested) fat