Postgresql – Improve SQL query

postgresql

I'm looking for ideas how to speed-up my query which is based on two tables only. I've added indexes, run vacuum.

Tables:

CREATE TABLE vintage_unit
(
  id integer NOT NULL,
  vintage_unit_date date,
  origination_month timestamp with time zone,
  product character varying,
  customer_type character varying,
  balance numeric,
  CONSTRAINT pk_id_loan PRIMARY KEY (id)
);

CREATE INDEX id_vintage_unit_date  ON vintage_unit USING btree (vintage_unit_date);   
CREATE INDEX ind_customer_type ON vintage_unit USING btree (customer_type COLLATE pg_catalog."default");
CREATE INDEX ind_origination_month ON vintage_unit USING btree  (origination_month);    
CREATE INDEX ind_product ON vintage_unit USING btree (product COLLATE pg_catalog."default");


CREATE TABLE performance_event
(
  id integer,
  event_date date
);

CREATE INDEX ind_event_date ON performance_event USING btree (event_date);
CREATE INDEX ind_id ON performance_event USING btree (id);

In both tables there is 10,000 rows.

Explain plan: http://explain.depesz.com/s/2di

Query: http://sqlfiddle.com/#!1/edece/1

PostgreSQL 9.1.9

Note: In the query there is function time_distance:

CREATE OR REPLACE FUNCTION time_distance(to_date date, from_date date, granularity character varying)
  RETURNS integer AS
$BODY$

declare
  distance  int;

begin

   if granularity = 'month' then
         distance := (extract(months from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int + 12*(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int;
   elsif granularity = 'quarter' then
       distance := trunc(((extract(months from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))) + 12*(extract(years from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))))/3)::integer;
   elsif granularity = 'year' then
       distance := ((trunc(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date)))) ))::integer;
   else
       distance:= -1;
   end if;

    return distance;

end;
$BODY$
  LANGUAGE plpgsql IMMUTABLE
  COST 100;

Query:

  with vintage_descriptors as (
    select
      origination_month,product,customer_type,balance ,
      vintage_unit_date, 
      generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date as event_date,
      time_distance(generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date, vintage_unit_date,'month') as distance
    from(
      select 
        origination_month,product,customer_type,balance ,
        vintage_unit_date, max_vintage_date
      from 
        vintage_unit, 
        (select max(event_date) as max_vintage_date from performance_event) x
      group by 
        origination_month,product,customer_type,balance ,
        vintage_unit_date,
        max_vintage_date
    ) a
  )

  ,vintage_unit_sums as(
      select
        origination_month,product,customer_type,balance ,
        vintage_unit_date, 
        sum(1::int) as vintage_unit_weight,
        count(*) as vintage_unit_count
      from
        vintage_unit
      group by 
        origination_month,product,customer_type,balance ,
        vintage_unit_date
  )

  ,performance_event_sums as(
      select
        origination_month,product,customer_type,balance ,
        vintage_unit_date, 
        date_trunc('month',event_date)::date as event_date, 
        sum(1::int) as event_weight
      from
        vintage_unit
        join performance_event ve using (id)
      group by 
        origination_month,product,customer_type,balance , 
        vintage_unit_date, 
        date_trunc('month',event_date)::date
  )

  ,vintage_csums as (
      select 
        vd.*,
        vs.event_weight,
        sum(coalesce(event_weight,0)) 
          over(partition by origination_month,product,customer_type,balance ,vintage_unit_date order by event_date) as event_weight_csum
      from 
        vintage_descriptors vd
        left join performance_event_sums vs using (origination_month,product,customer_type,balance ,vintage_unit_date,event_date)
  )

  ,aggregation as (
     select
       origination_month,product,customer_type,balance ,
        vd.distance,
        sum(vintage_unit_weight) vintage_unit_weight,
        sum(vintage_unit_count) vintage_unit_count,
        sum(coalesce(event_weight,0)) as event_weight,
        sum(coalesce(event_weight,0)) / sum(vintage_unit_weight) as event_weight_pct,
        sum(coalesce(event_weight_csum,0)) as event_weight_csum,
        sum(coalesce(event_weight_csum,0))/sum(coalesce(vintage_unit_weight,0)) as event_weight_csum_pct
     from
        vintage_descriptors  vd
        join vintage_unit_sums using  (origination_month,product,customer_type,balance , vintage_unit_date)
        left join vintage_csums using (origination_month,product,customer_type,balance , vintage_unit_date, event_date)
     group by
        origination_month,product,customer_type,balance ,
        vd.distance
     order by
        origination_month,product,customer_type,balance ,
        distance
  ) 

  select * , row_number() over(partition by origination_month,product,customer_type , distance order by balance) as rn from aggregation 

Query Explanation
The purpose of this query is to calculate data for vintage analysis. Generally it works with the following data structure:

General data structure input

The query I'm using here is actually dynamically created on the fly based on user input.

Process:

  • every combination of origination_month, product, customer_type and balance is one group I would like to observe (statistics are calculated in CTE vintage_unit_sums)
  • every group has number of events (aggregated in performance_event_sums)
  • the goal is to calculate ratio of events which happened after certain number of months after vintage_unit_date (this is expressed as distance between vintage_unit_date and event_date and calculated using time_distance function) and number of units (rows) in given group, but only those rows where event could potentially happen in the past, where last available datapoint is considered to be maximum value of vintage_unit_date accross whole dataset. (Example: If vintage_unit_date is 2013-10-01 for given group and maximum vintage_unit_date is 2013-11-01 then maximum available distance is 1. However when in the same group we have row with vintage_unit_date 2013-09-01` then maximum distance is 2. Then when calculating the above mentioned ratio for distance 2, I do not want to include row where maximum distance is lower than 2.)
  • cte vintage_descriptors is used to generate placeholder for every possible combination of available groups and potential distance (because it is not true that for every distance and every group there is an event).

About year ago I thought this can be solved in OLAP and have asked this question on StackOverflow. Since then I've created my own solution in R where SQL is created dynamically based on user input and result is send to PostgreSQL.

Best Answer

It looks like raising the work_mem by some MBs (say 30) would speed things up significantly. You could also remove the ORDER BY clause from the aggregation CTE. Similarly, the GROUP BY in the subquery of the first CTE looks useless. Also, sum(1) is basically a count on a non-null column, for example the column used in the inner join. Furthermore, it looks like you carry a bunch of nonaggregated and nongrouping columns through the CTEs - it is usually a good idea to join these in the latest possible step. This way you can decrease the memory consumption in certain steps. I mean, for example, the vd.* in vintage_csums.