Postgresql – Querying sums of grouped consecutive rows in PostgreSQL 9

postgresqlwindow functions

I have data about people traveling in different countries like this:

country | begintimestamp      | distance    

Germany | 2015-01-01 00:00:00 | 100
Germany | 2015-01-01 01:12:13 | 30
France  | 2015-01-01 02:13:14 | 40
France  | 2015-01-01 03:14:15 | 20
Spain   | 2015-01-01 04:15:16 | 10
France  | 2015-01-01 05:16:17 | 30
France  | 2015-01-01 05:17:18 | 5
Germany | 2015-01-01 06:18:19 | 3

What I need is to be able to receive a result like this – the distance of consecutive rows summed with the earliest begintimestamp:

country | begintimestamp      | distance

Germany | 2015-01-01 00:00:00 | 130  // 100+30, the distance of two first rows summed.
France  | 2015-01-01 02:13:14 | 60   // 40+20
Spain   | 2015-01-01 04:15:16 | 10   // 
France  | 2015-01-01 05:16:17 | 35   // 30+5
Germany | 2015-01-01 06:18:19 | 3

I've tried to play around with PG window functions but have not been able to come up with anything that would lead me closer to the result.

Best Answer

select min(country) as country,
       min(begintimestamp) as first_begin_ts, 
       sum(distance) as distance
from (
  select t1.*, 
         sum(group_flag) over (order by begintimestamp) as grp
  from (
      select *,
             case
                when lag(country) over (order by begintimestamp) = country then null
                else 1
              end as group_flag
      from travel
  ) t1
) t2
group by grp
order by first_begin_ts;

The inner most query (alias t1) creates a number each time the country changes). The second level query (alias t2) then does a running sum on those flags, which essentially gives each consecutive set of countries a different number. The outer most query then groups by that number and sums the distance. The min(country) is necessary to make the group by operator happy, but as all rows with the same grp have the same country anyway, it doesn't matter.

SQLFiddle: http://sqlfiddle.com/#!15/fe341/1

Related Solutions

Postgresql – Find “n” consecutive free numbers from table

This is a gaps-and-islands problem. Assuming there are no gaps or duplicates in the same id_set set:

WITH partitioned AS (
  SELECT
    *,
    number - ROW_NUMBER() OVER (PARTITION BY id_set) AS grp
  FROM atable
  WHERE status = 'FREE'
),
counted AS (
  SELECT
    *,
    COUNT(*) OVER (PARTITION BY id_set, grp) AS cnt
  FROM partitioned
)
SELECT
  id_set,
  number
FROM counted
WHERE cnt >= 3
;

Here's a SQL Fiddle demo^* link for this query: http://sqlfiddle.com/#!1/a2633/1.

UPDATE

To return only one set, you could add in one more round of ranking:

WITH partitioned AS (
  SELECT
    *,
    number - ROW_NUMBER() OVER (PARTITION BY id_set) AS grp
  FROM atable
  WHERE status = 'FREE'
),
counted AS (
  SELECT
    *,
    COUNT(*) OVER (PARTITION BY id_set, grp) AS cnt
  FROM partitioned
),
ranked AS (
  SELECT
    *,
    RANK() OVER (ORDER BY id_set, grp) AS rnk
  FROM counted
  WHERE cnt >= 3
)
SELECT
  id_set,
  number
FROM ranked
WHERE rnk = 1
;

Here's a demo for this one too: http://sqlfiddle.com/#!1/a2633/2.

If you ever need to make it one set per id_set, change the RANK() call like this:

RANK() OVER (PARTITION BY id_set ORDER BY grp) AS rnk

Additionally, you could make the query return the smallest matching set (i.e. first try to return the first set of exactly three consecutive numbers if it exists, otherwise four, five etc.), like this:

RANK() OVER (ORDER BY cnt, id_set, grp) AS rnk

or like this (one per id_set):

RANK() OVER (PARTITION BY id_set ORDER BY cnt, grp) AS rnk

_{* The SQL Fiddle demos linked in this answer use the 9.1.8 instance as the 9.2.1 one doesn't appear to be working at the moment.}

Postgresql querying trends

Data model

You are saving avg. and max. speed per race, not per driver? Normally this would be something like:

CREATE TABLE race (
  race_id   serial PRIMARY KEY
 ,race_date timestamp
);

CREATE TABLE driver (
  driver_id    serial PRIMARY KEY
 ,driver_name  text
 ,date_started timestamp
);

CREATE TABLE driver_race (
  driver_id int REFERENCES driver(driver_id)
 ,race_id   int REFERENCES race(race_id)
 ,average_speed numeric
 ,max_speed numeric
);

This is also how you should provide your data model: as CREATE scripts ready to use for testing.

numeric and decimal are the same in Postgres.

Query

Your query extended to all drivers:

SELECT driver_id, d.driver_name, x.slope
FROM   driver d
JOIN  (
   SELECT dr.driver_id
         ,regr_slope(dr.average_speed, EXTRACT(EPOCH FROM r.race_date)) AS slope
   FROM   race r
   JOIN   driver_race dr USING (race_id)
   WHERE  r.race_date BETWEEN '2013-3-1'::date AND '2013-3-31'::date 
   GROUP  BY dr.driver_id
   ) x USING (driver_id)
ORDER  BY slope DESC;

Simple alternative

A simple alternative would be to match the average speed of the first half of a time period to the average of the second half:

SELECT driver_id, d.driver_name
      ,round(x.avg1::numeric, 2) AS avg1
      ,round(x.avg2::numeric, 2) AS avg2
      ,round((x.avg2 / x.avg1 - 1) * 100, 2) AS pct_change
FROM   driver d
JOIN  (
   SELECT dr.driver_id
         ,avg(CASE WHEN r.race_date BETWEEN '2013-3-01'::date AND '2013-3-15'::date
              THEN dr.average_speed END) AS avg1
         ,avg(CASE WHEN r.race_date BETWEEN '2013-3-16'::date AND '2013-3-31'::date
              THEN dr.average_speed END) AS avg2
   FROM    race r
   JOIN   driver_race dr USING (race_id)
   WHERE  r.race_date BETWEEN '2013-3-1'::date AND '2013-3-31'::date 
   GROUP  BY dr.driver_id
   ) x USING (driver_id)
ORDER  BY pct_change DESC;

The CASE expression has no ELSE branch, because the default NULL serves just fine: avg() ignores NULL values.

-> SQLfiddle demo.