Sql-server – better way to create dynamic time averaged series over a dynamic frequency

querysql servert-sql

I've got a series of tables with lots of high precision data collected from various devices. The intervals that they were collected on varies and even wanders over the time series. My users want the ability to pick a date range and get an average / min / max over these variables at a specific frequency. This is the second stab I've taken at this, and it works, but I wonder if there is a better/faster way to accomplish this?

declare @start datetime
declare @end datetime
set @start = '3/1/2012'
set @end = '3/3/2012'
declare @interval int
set @interval = 300

declare @tpart table(
dt datetime
);

with CTE_TimeTable
as
(
select @start as [date]
union all
select dateadd(ss,@interval, [date])
from CTE_TimeTable
where DateAdd(ss,@interval, [date]) <= @end
)
insert into @tpart
select [date] from CTE_TimeTable
OPTION (MAXRECURSION 0);


select t.dt, avg(c.x1), min(c.x1), max(c.x2), avg(c.x2), min(c.x2), max(c.x2)  from clean.data c ,
@tpart t 
where
ABS(DateDIFF(ss, t.dt , c.Date) ) <= @interval /2
and
Date >= @start 
and 
Date <= @end
group by t.dt

Right now over 32721 rows for this 3 day period this query takes about 43 seconds to run and gives me the 577 rows I expect but I'd like to get this faster. The big hit comes from the nested loop to do the inner join.

Best Answer

Your join between the tables is embedded in a function making it really hard for the optimizer to do anything smart with it. I guess it has to compare every row in one table against every other row in the other table.

Rewriting your join with a range check should be a lot faster. I also added a primary key to your table variable to remove a sort operation from the query plan and I turned your table variable into a temp table instead. The difference in my tests was that the query plan started to use Parallelism.

declare @start datetime;
declare @end datetime;
set @start = '20120301';
set @end = '20120303';
declare @interval int;
set @interval = 300;

create table #tpart
(
  dt datetime primary key
);

with CTE_TimeTable
as
(
  select @start as [date]
  union all
  select dateadd(second ,@interval, [date])
  from CTE_TimeTable
  where dateadd(second, @interval, [date]) <= @end
)
insert into #tpart
select [date]
from CTE_TimeTable
option (maxrecursion 0);

select t.dt, avg(c.x1), min(c.x1), max(c.x2), avg(c.x2), min(c.x2), max(c.x2)
from clean.data c
  inner join #tpart t 
    on c.Date >= t.dt and
       c.Date < dateadd(second, @interval, t.dt)
group by t.dt;

drop table #tpart;

Note: This query does not return exactly the same intervals as your query does. It will divide the date range into equally sized parts where your query had one half interval at the start and one half interval at the end of the range. It is of course possible to modify the query to do be equivalent to your query if that is desired.

Update

I tested on a table with a total of 1036801 rows and with 34560 in the interval 2012-03-01 to 2012-03-03. In my tests the original query takes 4.1 seconds. The query above takes 0.1 seconds.

Script to generate the test data:

create table clean.data
(
    Date datetime primary key,
    x1 int,
    x2 int
);

go

with C as
(
  select cast('20120201' as datetime) as D
  union all
  select dateadd(second, 5, D)
  from C
  where D < '20120401'
)
insert into clean.data(Date, x1, x2)
select D, checksum(newid()) % 1000, checksum(newid()) % 1000
from C
option (maxrecursion 0);