Postgresql – Using start and end event logs to create a table/view containing spans between the times of each log

aggregate-filterpivotpostgresqlredshift

To be specific, I have an event table that logs when a user joins or leaves a team. It looks something like the following:

-------------------------------------
| user | event  | team | timestamp |
-------------------------------------
| A    | joined | 1    | 2016-1-1  |
| B    | joined | 1    | 2016-1-1  |
| C    | left   | 1    | 2016-1-1  |
| C    | joined | 2    | 2016-1-1  |
| A    | left   | 1    | 2016-1-2  |
| A    | joined | 2    | 2016-1-2  |
| B    | left   | 1    | 2016-1-3  |
| A    | left   | 2    | 2016-1-3  |
-------------------------------------

I need to restructure this in a view to look something like the following

--------------------------------------
| user | team | joined    | left     |
--------------------------------------
| A    | 1    | 2016-1-1  | 2016-1-2 |
| A    | 2    | 2016-1-2  | 2016-1-3 |
| B    | 1    | 2016-1-1  | 2016-1-3 |
| C    | 1    | null      | 2016-1-1 |
| C    | 2    | 2016-1-1  | null     |
--------------------------------------

How can I achieve this?

For more detail, I'm attempting to do this in Amazon Redshift (PostgreSQL)

Best Answer

Assuming all columns NOT NULL. And 'left' is never earlier than the associated 'joined'.

Simple case

If users can join teams only once (which would be enforced by a UNIQUE constraint on ("user", team), ideally), then the solution is a simple GROUP BY and works in Redshift as well as in most any RDBMS:

SELECT "user", team
     , min(CASE WHEN event = 'joined' THEN timestamp END) AS joined
     , max(CASE WHEN event = 'left'   THEN timestamp END) AS "left"
FROM   event
GROUP  BY "user", team
ORDER  BY "user", joined NULLS FIRST;

Note the NULLS FIRST clause. Seems like you want to sort an open start with joined IS NULL first. Redshift supports that, too.

Apart from that it's the most basic form of a crosstab / pivot query.

Not so simple

Judging from your column names and example data it might not be so simple. If users can join teams more than once (non-overlapping), you have to do more. You wouldn't want to merge multiple team memberships into one row like in this related answer:

Postgres: Use result of group by as column names

Instead you have to pair up adjacent 'joined' and 'left' rows somehow. There are many ways ...

Postgres 9.4+

For modern Postgres I like this best:

SELECT "user", team
     , min(timestamp) FILTER (WHERE event = 'joined') AS joined
     , max(timestamp) FILTER (WHERE event = 'left'  ) AS "left"
FROM  (
   SELECT *, count(*) FILTER (WHERE event = 'joined')
                      OVER (PARTITION BY "user", team ORDER BY timestamp) AS ct
   FROM   event
   ) sub
GROUP  BY "user", team, ct
ORDER  BY "user", joined NULLS FIRST;

Using the aggregate FILTER clause in a window function as well as in an aggregate function. Related (with links to alternatives):

Return counts for multiple ranges in a single SELECT statement

This way we count how many times the same user joined the same team, so we can group adjacent rows. Works for a missing 'joined' at the begin or a missing 'left' at the end as well.

Redshift

... does not support the new FILTER clause. We can substitute with a plain old CASE:

SELECT "user", team
     , min(CASE WHEN event = 'joined' THEN timestamp END) AS joined
     , max(CASE WHEN event = 'left'   THEN timestamp END) AS "left"
FROM  (
   SELECT *, count(CASE WHEN event = 'joined' THEN 1 END)
                      OVER (PARTITION BY "user", team ORDER BY timestamp, event) AS ct
   FROM   event
   ) sub
GROUP  BY "user", team, ct
ORDER  BY "user", joined NULLS FIRST;

SQL Fiddle.

Aside: you shouldn't use reserved words as identifiers, even if it's permitted by Redshift (or Postgres).

Related Solutions

PostgreSQL Query – Find Parent Records Where All Child Records Have a Given Value

I don't have a copy of PostgreSQL in front of me. Does this compile in your version:

...
HAVING count(*) = (select count(*) from participatns as x where x.event_id = events.id)

An alternative approach would be to remove from this list of all events those which have a participant who's status is not 'present', using the EXCEPT clause:

select
    event.id
from events

EXCEPT

select
    event.id
from event
INNER JOIN participants 
ON participants.event_id = events.id 
WHERE (participants.status != present)

The EXCEPT syntax is usually more efficiently expressed as NOT EXISTS or as a LEFT JOIN with the child identifer ends up as null. I'm sure you can re-factor the query to suit your needs.

Median Absolute Deviation for Time Series Outlier Detection in Amazon Redshift

I believe this is very doable with a CTE or subquery!

MAD is can be calculated by composing several Redshift functions:

the median of a list of values
absolute value of the difference between each value and that median
the median of those values

I wrote this in the form of:

WITH
medians AS (
  SELECT
    t.values,
    MEDIAN(t.values) OVER () as median_value,
    ABS(t.values-MEDIAN(t.values) OVER ()) AS absolute_deviation
  FROM table AS t
  GROUP BY t.values
)
SELECT
  MEDIAN(absolute_deviation) OVER () AS median_absolute_deviation
FROM medians

Hope this helps