PostgreSQL – Combining Multiple SELECT/WHERE into Result with Multiple Columns

pivotpostgresql

I have a Timescale DB / PostgreSQL table (DataTable) which looks like this

I'm trying to find a query which would return different columns for different subsystem/metrics combination like this

Furthermore I would like to reduce the amount of requested data points using the Timescale DB function time_bucket_gapfill() to keep the response time low for big time ranges. I would also like to use different aggregation functions for different subsystem/metric combinations.
I use the queried data to plot the curves in Grafana.
What is the most time efficient way to achieve that?

At the moment I use separate queries for each metric which quickly gets very slow. For the example the queries could look like that

SELECT
    time_bucket_gapfill('30s',time,start=>'2021-07-19T09:06:26.605Z',finish=>'2021-07-19T20:11:12.340Z') AS "time",
    max(value) AS "Latitude"
FROM DataTable
WHERE
    time BETWEEN '2021-07-19T09:06:26.605Z' AND '2021-07-19T20:11:12.340Z' AND
    subsystem = 'position' AND
    metric = 'lat'
GROUP BY 1,metric,subsystem
ORDER BY time

SELECT
    time_bucket_gapfill('30s',time,start=>'2021-07-19T09:06:26.605Z',finish=>'2021-07-19T20:11:12.340Z') AS "time",
    min(value) AS "Longitude"
FROM DataTable
WHERE
    time BETWEEN '2021-07-19T09:06:26.605Z' AND '2021-07-19T20:11:12.340Z' AND
    subsystem = 'position' AND
    metric = 'lon'
GROUP BY 1,metric,subsystem
ORDER BY time

SELECT
    time_bucket_gapfill('30s',time,start=>'2021-07-19T09:06:26.605Z',finish=>'2021-07-19T20:11:12.340Z') AS "time",
    avg(value) AS "Temperature"
FROM DataTable
WHERE
    time BETWEEN '2021-07-19T09:06:26.605Z' AND '2021-07-19T20:11:12.340Z' AND
    subsystem = 'health' AND
    metric = 'temperature'
GROUP BY 1,metric,subsystem
ORDER BY time

I would much appreciate if somebody could point me in the right direction.

Update:

Using the FILTER clause as suggested by @Charlieface doesn't work for me, because it doesn't remove the other rows but instead places NULLs there. Here an example with some real data (the example before was using simplified data and names).

I also noticed that the timestamps of the metrics belonging together have tiny differences which would prevent different metrics being on the same row, but that is easy to changes in the unit which feeds the data into the DB.

Update 2:

The timestamps issue was fixed and the WHERE statement was added again as suggested by @Charlieface, but there are still NULL in the data which makes Grafana plot the data points without lines in between.

I would like the result to look like this instead.

Update 3:

I missed that @Charlieface answer only groups by time. If I do that, then I get the result I wanted.

Best Answer

You can use conditional aggregation for this.

In PostgreSQL you can use the FILTER clause.

In other DBMSs you can use a condition inside the aggregation
avg(case when ... then value end)

SELECT
    time_bucket_gapfill('30s', time, start => '2021-07-19T09:06:26.605Z', finish => '2021-07-19T20:11:12.340Z') AS "time",
    max(value) FILTER (WHERE subsystem = 'position' AND metric = 'lat') AS "Latitude",
    min(value) FILTER (WHERE subsystem = 'position' AND metric = 'lon') AS "Longitude",
    avg(value) FILTER (WHERE subsystem = 'health' AND metric = 'temperature') AS "Temperature"
FROM DataTable
WHERE
    time BETWEEN '2021-07-19T09:06:26.605Z' AND '2021-07-19T20:11:12.340Z'
  AND (subsystem = 'position' AND metric IN ('lat', 'lon') OR
       subsystem = 'health' AND metric = 'temperature')
GROUP BY 1
ORDER BY time;

Flat Table

The option to use a flat table has a lot more to do with question (1), where if this isn't a serious or large-scale project, you'll find it much easier not to think too much about the schema, and just use a flat table, as:

CREATE flat_table(
  trip_id integer,
  tstamp timestamptz,
  speed float,
  distance float,
  temperature float,
  ,...);

There aren't many cases where I'd recommend this course, only if this is a tiny project which doesn't warrant much of your time.

Dimensions and Facts

So, if you've cleared the hurdle of question (1), and you want a more performance schema, this is one of the first options to consider. It includes some basic normailization, but extracting the 'dimensional' quantities from the measured 'fact' quantities.

Essentially, you'll want a table to record info about the trips,

CREATE trips(
  trip_id integer,
  other_info text);

and a table to record timestamps,

CREATE tstamps(
  tstamp_id integer,
  tstamp timestamptz);

and finally all of your measured facts, with foreign key references to the dimension tables (that is meas_facts(trip_id) references trips(trip_id) & meas_facts(tstamp_id) references tstamps(tstamp_id))

CREATE meas_facts(
  trip_id integer,
  tstamp_id integer,
  speed float,
  distance float,
  temperature float,
  ,...);

This may not seem like it is all that helpful at first, but if you have for example thousands of concurrent trips, then they may all be taking measurements once per second, on the second. In that case, you'd have to re-record the time stamp each time for each trip, rather than just using a single entry in the tstamps table.

Use case: This case will be good if there are many concurrent trips for which you are recording data, and you don't mind accessing all of the measurement types all together.

Since Postgres reads by rows, any time you wanted, for example, the speed measurements over a given time range, you must read the whole row from the meas_facts table, which will definitely slow down a query, though if the data set you are working with is not too large, then you wouldn't even notice the difference.

Splitting Up Your Measured Facts

To extend the last section just a bit further, you could break apart your measurements into separate tables, where for example I'll show the tables for speed and distance:

CREATE speed_facts(
  trip_id integer,
  tstamp_id integer,
  speed float);

and

CREATE distance_facts(
  trip_id integer,
  tstamp_id integer,
  distance float);

Of course, you can see how this might be extended to the other measurements.

Use case: So this won't give you a tremendous speed up for a query, perhaps only a linear increase in speed when you are querying about one measurement type. This is because when you want to look up info about speed, you need only to read rows from the speed_facts table, rather than all the extra, unneeded info that would be present in a row of the meas_facts table.

So, you need to read huge bulks of data about one measurement type only, you could get some benefit. With your proposed case of 10 hours of data at one second intervals, you'd only be reading 36,000 rows, so you'd never really find a significant benefit from doing this. However, if you were to be looking at speed measurement data for 5,000 trips that were all around 10 hours, now you're looking at reading 180 million rows. A linear increase in speed for such a query could yield some benefit, so long as you only need to access one or two of the measurement types at a time.

Arrays/HStore/ & TOAST

You probably don't need to worry about this part, but I know of cases where it does matter. If you need to access HUGE amounts of time series data, and you know you need to access all of it in one huge block, you can use a structure which will make use of the TOAST Tables, which essentially stores your data in larger, compressed segments. This leads to quicker access to the data, as long as your goal is to access all of the data.

One example implementation could be

CREATE uber_table(
  trip_id integer,
  tstart timestamptz,
  speed float[],
  distance float[],
  temperature float[],
  ,...);

In this table, tstart would store the time stamp for the first entry in the array, and each subsequent entry would be the value of a reading for the next second. This requires you to manage the relevant time stamp for each array value in a piece of application software.

Another possibility is

CREATE uber_table(
  trip_id integer,
  speed hstore,
  distance hstore,
  temperature hstore,
  ,...);

where you add your measurement values as (key,value) pairs of (timestamp, measurement).

Use case: This is an implementation probably better left to someone who is more comfortable with PostgreSQL, and only if you are sure about your access patterns needing to be bulk access patterns.

Conclusions?

Wow, this got much longer than I expected, sorry. :)

Essentially, there are a number of options, but you'll probably get the biggest bang for your buck by using the second or third, as they fit the more general case.

P.S.: Your initial question implied that you will be bulk loading your data after it has all been collected. If you are streaming the data in to your PostgreSQL instance, you will need to do some further work to handle both your data ingestion and query workload, but we'll leave that for another time. ;)

Best Answer

Related Solutions

Postgresql – Group Database Entries by time difference

PostgreSQL – How to Store Time Series Data

Flat Table

Dimensions and Facts

Splitting Up Your Measured Facts

Arrays/HStore/ & TOAST

Conclusions?

Related Question