Postgresql querying trends

postgresqlpostgresql-9.0

Firstly apologies if this is a duplicate, I am fairly new to SQL and so Im not sure what the correct terminology to use in my searches

So I have a database which records motor races, with the following simplified schema

race_table
==========
race_id PK
race_date timestamp
average_speed Decimal
max_speed Decimal

drivers
=======
driver_id PK
driver_name text
date_started timestamp

driver_races
============
driver_id FK
race_id FK

If each driver has ~1000 races spread over 2/3 years

How would I go about querying the overall % change (positive of negative) in their average race speed for a given date range, for example

% Change in first 6 months

Joe Smith - 5% increase
Andy James - 4% increase

% Change in first 12 months

Joe Smith - 8% increase
Lewis May - 6% increase

UPDATE: More detail on % Change

The % Change could be calculated using linear regression, (least-squares-fit would be suitable), the average change is effectivly the y-difference on a line-of-best-fit, where each point is a race, x is the race_date and y is the average_speed for that race.

Postgres's regr_slope will give the gradient of the line which is effectivly the same as the %change

SELECT regr_slope(r.average_speed, EXTRACT(EPOCH FROM r.race_date)) as slope
    FROM race_table as r, driver_races as dr
    WHERE dr.race_id = r.race_id
    AND d.driver_id = 1

This gives the figure I want, but I now need to apply is against all users, sorted by 'slope'

Best Answer

Data model

You are saving avg. and max. speed per race, not per driver? Normally this would be something like:

CREATE TABLE race (
  race_id   serial PRIMARY KEY
 ,race_date timestamp
);

CREATE TABLE driver (
  driver_id    serial PRIMARY KEY
 ,driver_name  text
 ,date_started timestamp
);

CREATE TABLE driver_race (
  driver_id int REFERENCES driver(driver_id)
 ,race_id   int REFERENCES race(race_id)
 ,average_speed numeric
 ,max_speed numeric
);

This is also how you should provide your data model: as CREATE scripts ready to use for testing.

numeric and decimal are the same in Postgres.

Query

Your query extended to all drivers:

SELECT driver_id, d.driver_name, x.slope
FROM   driver d
JOIN  (
   SELECT dr.driver_id
         ,regr_slope(dr.average_speed, EXTRACT(EPOCH FROM r.race_date)) AS slope
   FROM   race r
   JOIN   driver_race dr USING (race_id)
   WHERE  r.race_date BETWEEN '2013-3-1'::date AND '2013-3-31'::date 
   GROUP  BY dr.driver_id
   ) x USING (driver_id)
ORDER  BY slope DESC;

Simple alternative

A simple alternative would be to match the average speed of the first half of a time period to the average of the second half:

SELECT driver_id, d.driver_name
      ,round(x.avg1::numeric, 2) AS avg1
      ,round(x.avg2::numeric, 2) AS avg2
      ,round((x.avg2 / x.avg1 - 1) * 100, 2) AS pct_change
FROM   driver d
JOIN  (
   SELECT dr.driver_id
         ,avg(CASE WHEN r.race_date BETWEEN '2013-3-01'::date AND '2013-3-15'::date
              THEN dr.average_speed END) AS avg1
         ,avg(CASE WHEN r.race_date BETWEEN '2013-3-16'::date AND '2013-3-31'::date
              THEN dr.average_speed END) AS avg2
   FROM    race r
   JOIN   driver_race dr USING (race_id)
   WHERE  r.race_date BETWEEN '2013-3-1'::date AND '2013-3-31'::date 
   GROUP  BY dr.driver_id
   ) x USING (driver_id)
ORDER  BY pct_change DESC;

The CASE expression has no ELSE branch, because the default NULL serves just fine: avg() ignores NULL values.

-> SQLfiddle demo.