I'm looking for ideas how to speed-up my query which is based on two tables only. I've added indexes, run vacuum.
Tables:
CREATE TABLE vintage_unit
(
id integer NOT NULL,
vintage_unit_date date,
origination_month timestamp with time zone,
product character varying,
customer_type character varying,
balance numeric,
CONSTRAINT pk_id_loan PRIMARY KEY (id)
);
CREATE INDEX id_vintage_unit_date ON vintage_unit USING btree (vintage_unit_date);
CREATE INDEX ind_customer_type ON vintage_unit USING btree (customer_type COLLATE pg_catalog."default");
CREATE INDEX ind_origination_month ON vintage_unit USING btree (origination_month);
CREATE INDEX ind_product ON vintage_unit USING btree (product COLLATE pg_catalog."default");
CREATE TABLE performance_event
(
id integer,
event_date date
);
CREATE INDEX ind_event_date ON performance_event USING btree (event_date);
CREATE INDEX ind_id ON performance_event USING btree (id);
In both tables there is 10,000 rows.
Explain plan: http://explain.depesz.com/s/2di
Query: http://sqlfiddle.com/#!1/edece/1
PostgreSQL 9.1.9
Note: In the query there is function time_distance
:
CREATE OR REPLACE FUNCTION time_distance(to_date date, from_date date, granularity character varying)
RETURNS integer AS
$BODY$
declare
distance int;
begin
if granularity = 'month' then
distance := (extract(months from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int + 12*(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date))))::int;
elsif granularity = 'quarter' then
distance := trunc(((extract(months from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))) + 12*(extract(years from age(date_trunc('quarter',to_date), date_trunc('quarter',from_date)))))/3)::integer;
elsif granularity = 'year' then
distance := ((trunc(extract(years from age(date_trunc('month',to_date), date_trunc('month',from_date)))) ))::integer;
else
distance:= -1;
end if;
return distance;
end;
$BODY$
LANGUAGE plpgsql IMMUTABLE
COST 100;
Query:
with vintage_descriptors as (
select
origination_month,product,customer_type,balance ,
vintage_unit_date,
generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date as event_date,
time_distance(generate_series(date_trunc('month',vintage_unit_date),max_vintage_date,'1 month')::date, vintage_unit_date,'month') as distance
from(
select
origination_month,product,customer_type,balance ,
vintage_unit_date, max_vintage_date
from
vintage_unit,
(select max(event_date) as max_vintage_date from performance_event) x
group by
origination_month,product,customer_type,balance ,
vintage_unit_date,
max_vintage_date
) a
)
,vintage_unit_sums as(
select
origination_month,product,customer_type,balance ,
vintage_unit_date,
sum(1::int) as vintage_unit_weight,
count(*) as vintage_unit_count
from
vintage_unit
group by
origination_month,product,customer_type,balance ,
vintage_unit_date
)
,performance_event_sums as(
select
origination_month,product,customer_type,balance ,
vintage_unit_date,
date_trunc('month',event_date)::date as event_date,
sum(1::int) as event_weight
from
vintage_unit
join performance_event ve using (id)
group by
origination_month,product,customer_type,balance ,
vintage_unit_date,
date_trunc('month',event_date)::date
)
,vintage_csums as (
select
vd.*,
vs.event_weight,
sum(coalesce(event_weight,0))
over(partition by origination_month,product,customer_type,balance ,vintage_unit_date order by event_date) as event_weight_csum
from
vintage_descriptors vd
left join performance_event_sums vs using (origination_month,product,customer_type,balance ,vintage_unit_date,event_date)
)
,aggregation as (
select
origination_month,product,customer_type,balance ,
vd.distance,
sum(vintage_unit_weight) vintage_unit_weight,
sum(vintage_unit_count) vintage_unit_count,
sum(coalesce(event_weight,0)) as event_weight,
sum(coalesce(event_weight,0)) / sum(vintage_unit_weight) as event_weight_pct,
sum(coalesce(event_weight_csum,0)) as event_weight_csum,
sum(coalesce(event_weight_csum,0))/sum(coalesce(vintage_unit_weight,0)) as event_weight_csum_pct
from
vintage_descriptors vd
join vintage_unit_sums using (origination_month,product,customer_type,balance , vintage_unit_date)
left join vintage_csums using (origination_month,product,customer_type,balance , vintage_unit_date, event_date)
group by
origination_month,product,customer_type,balance ,
vd.distance
order by
origination_month,product,customer_type,balance ,
distance
)
select * , row_number() over(partition by origination_month,product,customer_type , distance order by balance) as rn from aggregation
Query Explanation
The purpose of this query is to calculate data for vintage analysis. Generally it works with the following data structure:
The query I'm using here is actually dynamically created on the fly based on user input.
Process:
- every combination of
origination_month, product, customer_type and balance
is one group I would like to observe (statistics are calculated in CTEvintage_unit_sums
) - every group has number of events (aggregated in
performance_event_sums
) - the goal is to calculate ratio of events which happened after certain number of months after
vintage_unit_date
(this is expressed asdistance
betweenvintage_unit_date
andevent_date
and calculated usingtime_distance
function) and number of units (rows) in given group, but only those rows where event could potentially happen in the past, where last available datapoint is considered to be maximum value ofvintage_unit_date
accross whole dataset. (Example: Ifvintage_unit_date
is 2013-10-01 for given group and maximumvintage_unit_date
is2013-11-01
then maximum available distance is 1. However when in the same group we have row withvintage_unit_date
2013-09-01` then maximum distance is 2. Then when calculating the above mentioned ratio for distance 2, I do not want to include row where maximum distance is lower than 2.) - cte
vintage_descriptors
is used to generate placeholder for every possible combination of available groups and potential distance (because it is not true that for every distance and every group there is an event).
About year ago I thought this can be solved in OLAP and have asked this question on StackOverflow. Since then I've created my own solution in R where SQL is created dynamically based on user input and result is send to PostgreSQL.
Best Answer
It looks like raising the work_mem by some MBs (say 30) would speed things up significantly. You could also remove the
ORDER BY
clause from the aggregation CTE. Similarly, theGROUP BY
in the subquery of the first CTE looks useless. Also,sum(1)
is basically a count on a non-null column, for example the column used in the inner join. Furthermore, it looks like you carry a bunch of nonaggregated and nongrouping columns through the CTEs - it is usually a good idea to join these in the latest possible step. This way you can decrease the memory consumption in certain steps. I mean, for example, thevd.*
invintage_csums
.