Postgresql – Pivot time series into date period columns with array aggregate cells

aggregatepivotpostgresql

Given a time series of events where each event has a successful or unsuccessful outcome, how do I pivot the ratio of events by entity and time period columns with an aggregate array cell value? I suspect this can be done with crosstab query and array_agg.

Akin to uptime status reports, I'm trying to calculate something like this in SQL:

The volume of data is small enough that I could do the reduction client-side in a general purpose language, but it would be nice to do this efficiently in the database for a larger amount of data.

Time Series Example

+-----------------------------------------+
|  DateTime             Entity    Result  |
+-----------------------------------------+
| 2016-01-01 11:00...  :a        :success |
| 2016-01-01 17:00...  :a        :success |  -- two events for :a on same day
| 2016-01-01 11:01...  :b        :fail    |
| 2016-01-01 11:03...  :c        :success |
| 2016-01-01 13:00...  :d        :success |  -- only one event for :d
| 2016-01-02 11:00...  :a        :success |
| 2016-01-02 11:01...  :b        :fail    |
| 2016-01-02 11:03...  :c        :success |
| ...                                     |
+-----------------------------------------+

Desired Aggregate

Where each value cell after the key column is an array with shape [cnt_total cnt_success cnt_fail]:

+-----------------------------------------------+
| Entity     2016-01-01  2016-01-02  2016-01-xx |
+-----------------------------------------------+
| :a            [2 2 0]     [1 1 0]         ... |
| :b            [1 0 1]     [1 0 1]         ... |
| :c            [1 1 0]     [1 0 1]         ... |
| :d            [1 1 0]     [0 0 0]         ... |
+-----------------------------------------------+

To keep things simple, this report will never need more than 10 date window columns and I can dynamically template the SQL pivot output.

If I had to decompose this transformation:

Aggregate time series by date window (hour/day/week/month/quarter/year) and Result.
Accumulate the counted aggregate into some data structure like a hash-map or array of [count_total count_success count_fail]
Return the accumulated two-dimensional result as [entity period1 period2 ...] for displaying % on the client.

Best Answer

This question is old, but you haven’t accepted any answer yet, so I will add another one.

You need some aggregation of your data, and a pivot table. The most elegant way to do the former is by means of a CTE, and the most elegant way to do the latter is with CROSSTAB; however, as of Postgres 9.6, and unlike in other DBMS, you cannot reference a CTE from CROSSTAB. I will show an example for each of the two possible ways out: 1) Use a CTE, and reimplement a poor man’s pivoting on your own 2) Instead of a CTE, create a view once for all and reference it in your CROSSTAB queries. In both cases you’ll have to issue only one query per report, and you won’t have to create any temporary table.

The general problem of pivots is that, in pure SQL, you cannot define a query whose result has a variable number of columns, and you cannot define column headings dynamically. If you want that, you have to build your query with a procedural language, either server-side (plpgsql, as in Abelisto’s answer) or client-side (PHP, java, whatever...). My examples below are in pure SQL, so they have a fixed number of days (three, as in your example data), with fixed column headings ("day 1", "day 2", "day 3"), but they are built in a way that minimizes the needed edits when you change them.

First, the initial data. I started from the ones joanolo used, but my approach is different, because I use SMALLINT instead of BOOLEAN for result. My reason for doing this will become clear in the following.

CREATE TABLE time_series (
  date_time TIMESTAMP NOT NULL,
  entity TEXT NOT NULL,
  result SMALLINT DEFAULT 0 -- 1 means success, 0 failure.
);

INSERT INTO time_series VALUES
  ('2016-01-01 11:00', 'a', 1),
  ('2016-01-01 17:00', 'a', 1),
  ('2016-01-01 11:01', 'b', 0),
  ('2016-01-01 11:03', 'c', 1),
  ('2016-01-01 13:00', 'd', 1),
  ('2016-01-02 11:00', 'a', 1),
  ('2016-01-02 11:01', 'b', 0),
  ('2016-01-03 11:03', 'e', 1),
  ('2016-01-03 11:04', 'e', 1),
  ('2016-01-03 11:05', 'e', 1),
  ('2016-01-03 11:06', 'e', 0);

You only really need an array of two integers (a in my examples): a[1] (total count) and a[2] (success count). The failure count is simply a[1] - a[2] and the success percentage is 100*(a[2]::float)/a[1]. You can compute the total count by COUNT(result); if you define result SMALLINT you can simply use SUM(result) to keep track of the success count. If you store result as BOOLEAN, you have to use SUM(CASE WHEN result THEN 1 ELSE 0 END). If you store them as strings, SUM(CASE WHEN result = 'success' THEN 1 ELSE 0 END). If you cannot change your time_series table, edit the code below as appropriate.

Solution 1

This is inefficient ad rather ugly, but it’s worth being shown to illustrate how to use a CTE and what pains we had to suffer before CROSSTAB came along. When changing intervals, it has to be modified in four places: initial day, final day, rows in the main selection list, and rows in the list of joined tables. However, using a numeric column rn allows not to write explicit dates in the joined tables, which simplifies the task.

WITH ct AS (
  SELECT EXTRACT('days' FROM day - MIN(day) OVER()) + 1 AS rn, sub.* 
    FROM (
     SELECT 
          entity, 
          DATE_TRUNC('day', date_time) AS day, 
          ARRAY[COUNT(result), SUM(result)] AS a
       FROM time_series
      WHERE date_time BETWEEN TIMESTAMP '2016-01-01'                 -- initial day
                          AND TIMESTAMP '2016-01-03 23:59:59'        -- last second of final day
      GROUP BY 1,2
    ) AS sub
)
SELECT e.entity
       , d1.a AS "day 1"                                             -- add as many as you need
       , d2.a AS "day 2"
       , d3.a AS "day 3"
  FROM (SELECT DISTINCT entity FROM ct) e
  LEFT JOIN (SELECT entity, a FROM ct WHERE rn = 1) d1 USING(entity) -- add as many as you need
  LEFT JOIN (SELECT entity, a FROM ct WHERE rn = 2) d2 USING(entity)
  LEFT JOIN (SELECT entity, a FROM ct WHERE rn = 3) d3 USING(entity)
  ORDER BY e.entity;


 entity | day 1 | day 2 | day 3 
--------+-------+-------+-------
 a      | {2,2} | {1,1} | 
 b      | {1,0} | {1,0} | 
 c      | {1,1} |       | 
 d      | {1,1} |       | 
 e      |       |       | {4,3}

Solution 2

The code for the view is essentially the CTE code of the previous example, but it is simpler, because CROSSTAB allows us using GENERATE_SERIES with timestamp values, so we don’t need a numeric rn column to categorize the data. Note that this view, once created, will not need modifications.

CREATE VIEW ts_view AS
  SELECT 
      entity,
      DATE_TRUNC('day', date_time) AS day, 
      ARRAY[COUNT(result), SUM(result)] AS a
    FROM time_series
    GROUP BY 1,2;

This is the main query. When changing intervals, it has to be modified in three places: initial day, final day, and output columns. Formatting is best done client-side, but in this case I have done it server-side. Instructions on how to change this are in the comments

SELECT * FROM CROSSTAB ($$
    SELECT 
        entity, 
        day,
        -- You have to repeat the result type of the following expression
        -- as the type of the "day N" columns below.
        -- e.g.  a --> INTEGER[] ,  100*(a[2]::FLOAT)/a[1] --> FLOAT , etc. 
        -- In TO_CHAR, D is changed to your locale's decimal point
        TO_CHAR(100*(a[2]::float)/a[1], '990D99')||'%' 
    FROM ts_view ORDER BY 1
  $$,$$
    SELECT GENERATE_SERIES (
      TIMESTAMP '2016-01-01',  -- initial day
      TIMESTAMP '2016-01-03',  -- final day
      '1 day'
    )
  $$
) AS (
    entity TEXT
    , "day 1" TEXT             -- add as many as you need
    , "day 2" TEXT
    , "day 3" TEXT
);

 entity |  day 1   |  day 2   |  day 3   
--------+----------+----------+----------
 a      |  100.00% |  100.00% | 
 b      |    0.00% |    0.00% | 
 c      |  100.00% |          | 
 d      |  100.00% |          | 
 e      |          |          |   75.00%

As a final remark, the second argument of CROSSTAB can be something like

$$ 
WITH n(ow) AS (VALUES(DATE_TRUNC('day', NOW()))) 
SELECT GENERATE_SERIES(n.ow + '-2 days', n.ow, '1 day') FROM n
$$

and the query would always return a dynamic report for the last three days (today included).

Related Solutions

Pivot rows into multiple columns

There are a few ways that you can perform this data transformation. You have access to the PIVOT function then that will be the easiest, but if not then you can use an aggregate function and a CASE.

Aggregate /Case version:

select personid,
  max(case when optionid = 'A' then 1 else 0 end) OptionA,
  max(case when optionid = 'B' then 1 else 0 end) OptionB,
  max(case when optionid = 'C' then 1 else 0 end) OptionC
from PersonOptions
group by personid
order by personid;

See SQL Fiddle with Demo

Static Pivot:

select *
from
(
  select personid, optionid
  from PersonOptions
) src
pivot
(
  count(optionid)
  for optionid in ('A' as OptionA, 'B' OptionB, 'C' OptionC)
) piv
order by personid

See SQL Fiddle with Demo

Dynamic Version:

The two versions above work great if you have a known number of values, but if your values are unknown, then you will want to implement dynamic sql and in Oracle you can use a procedure:

CREATE OR REPLACE procedure dynamic_pivot_po(p_cursor in out sys_refcursor)
as
    sql_query varchar2(1000) := 'select personid ';

    begin
        for x in (select distinct OptionID from PersonOptions order by 1)
        loop
            sql_query := sql_query ||
                ' , min(case when OptionID = '''||x.OptionID||''' then 1 else null end) as Option_'||x.OptionID;

                dbms_output.put_line(sql_query);
        end loop;

        sql_query := sql_query || ' from PersonOptions group by personid order by personid';
        dbms_output.put_line(sql_query);

        open p_cursor for sql_query;
    end;
/

Then you return the results, you will use:

variable x refcursor
exec dynamic_pivot_po(:x)
print x

The results are the same with all versions:

| PERSONID | OPTIONA | OPTIONB | OPTIONC |
------------------------------------------
|        1 |       1 |       1 |       0 |
|        2 |       0 |       0 |       1 |
|        3 |       0 |       1 |       0 |
|        4 |       1 |       0 |       1 |

Postgresql – Select into specific array positions with array_agg()

Your answer basically gets the job done:

SELECT b.id, array_agg(b.stock) AS stock
FROM  (
   SELECT i.id, COALESCE(i_s.stock, 0) AS stock
   FROM   item i
   CROSS  JOIN unnest('{1,2}'::int[]) n
   LEFT   JOIN item_stock i_s ON i.id = i_s.item_id AND n.n = i_s.shop_id
   ORDER  BY i.id, n.n
   ) b
GROUP  BY b.id;

Two notable changes:

Order is not guaranteed without ORDER BY in the subquery or as additional clause to array_aggregate() (typically more expensive). And that's the core element of your question.
unnest('{1,2}'::int[]) instead of generate_series(1,2) as requested shop IDs will hardly be sequential all the time.

I also moved the set-returning function from the SELECT list to a separate table expression attached with CROSS JOIN. Standard SQL form, but that's just a matter of clarity and taste, not a necessity. At least in Postgres 10 or later. See:

What is the expected behaviour for multiple set-returning functions in SELECT clause?

Doing the same with LEFT JOIN LATERAL and an ARRAY constructor might be a bit faster as we don't need the outer GROUP BY and the ARRAY constructor is typically faster, too:

SELECT i.id, s.stock
FROM   item i
CROSS  JOIN LATERAL (
   SELECT ARRAY(
      SELECT COALESCE(i_s.stock, 0)
      FROM   unnest('{1,2}'::int[]) n
      LEFT   JOIN item_stock i_s ON i_s.shop_id = n.n
                                AND i_s.item_id = i.id
      ORDER  BY n.n
      ) AS stock
   ) s;

And if you have more than just the two shops, a nested crosstab() should provide top performance:

SELECT i.id, COALESCE(stock, '{0,0}') AS stock
FROM   item i
LEFT   JOIN (
   SELECT id, ARRAY[COALESCE(shop1, 0), COALESCE(shop2, 0)] AS stock
   FROM   crosstab(
     $$SELECT item_id, shop_id, stock
       FROM   item_stock
       WHERE  shop_id = ANY ('{1,2}'::int[])
       ORDER  BY 1,2$$

     , $$SELECT unnest('{1,2}'::int[])$$
      ) AS ct (id int, shop1 int, shop2 int)
   ) i_s USING (id);

Needs to be adapted in more places to cater for different shop IDs.

PostgreSQL Crosstab Query

db<>fiddle here

Index

Make sure you have at least an index on item_stock (shop_id, item_id) - typically provided by a PRIMARY KEY on those columns. For the crosstab query, it also matters that shop_id comes first. See:

Is a composite index also good for queries on the first field?

Adding stock as another index expression may allow faster index-only scans. In Postgres 11 or later consider an INCLUDE item to the PK like so:

PRIMARY KEY (shop_id, item_id) INCLUDE (stock)

But only if you need it a lot, as it makes the index a bit bigger and possibly more susceptible to bloat from updates.