Sql-server – better way to create dynamic time averaged series over a dynamic frequency

querysql servert-sql

I've got a series of tables with lots of high precision data collected from various devices. The intervals that they were collected on varies and even wanders over the time series. My users want the ability to pick a date range and get an average / min / max over these variables at a specific frequency. This is the second stab I've taken at this, and it works, but I wonder if there is a better/faster way to accomplish this?

declare @start datetime
declare @end datetime
set @start = '3/1/2012'
set @end = '3/3/2012'
declare @interval int
set @interval = 300

declare @tpart table(
dt datetime
);

with CTE_TimeTable
as
(
select @start as [date]
union all
select dateadd(ss,@interval, [date])
from CTE_TimeTable
where DateAdd(ss,@interval, [date]) <= @end
)
insert into @tpart
select [date] from CTE_TimeTable
OPTION (MAXRECURSION 0);


select t.dt, avg(c.x1), min(c.x1), max(c.x2), avg(c.x2), min(c.x2), max(c.x2)  from clean.data c ,
@tpart t 
where
ABS(DateDIFF(ss, t.dt , c.Date) ) <= @interval /2
and
Date >= @start 
and 
Date <= @end
group by t.dt

Right now over 32721 rows for this 3 day period this query takes about 43 seconds to run and gives me the 577 rows I expect but I'd like to get this faster. The big hit comes from the nested loop to do the inner join.

Best Answer

Your join between the tables is embedded in a function making it really hard for the optimizer to do anything smart with it. I guess it has to compare every row in one table against every other row in the other table.

Rewriting your join with a range check should be a lot faster. I also added a primary key to your table variable to remove a sort operation from the query plan and I turned your table variable into a temp table instead. The difference in my tests was that the query plan started to use Parallelism.

declare @start datetime;
declare @end datetime;
set @start = '20120301';
set @end = '20120303';
declare @interval int;
set @interval = 300;

create table #tpart
(
  dt datetime primary key
);

with CTE_TimeTable
as
(
  select @start as [date]
  union all
  select dateadd(second ,@interval, [date])
  from CTE_TimeTable
  where dateadd(second, @interval, [date]) <= @end
)
insert into #tpart
select [date]
from CTE_TimeTable
option (maxrecursion 0);

select t.dt, avg(c.x1), min(c.x1), max(c.x2), avg(c.x2), min(c.x2), max(c.x2)
from clean.data c
  inner join #tpart t 
    on c.Date >= t.dt and
       c.Date < dateadd(second, @interval, t.dt)
group by t.dt;

drop table #tpart;

Note: This query does not return exactly the same intervals as your query does. It will divide the date range into equally sized parts where your query had one half interval at the start and one half interval at the end of the range. It is of course possible to modify the query to do be equivalent to your query if that is desired.

Update

I tested on a table with a total of 1036801 rows and with 34560 in the interval 2012-03-01 to 2012-03-03. In my tests the original query takes 4.1 seconds. The query above takes 0.1 seconds.

Script to generate the test data:

create table clean.data
(
    Date datetime primary key,
    x1 int,
    x2 int
);

go

with C as
(
  select cast('20120201' as datetime) as D
  union all
  select dateadd(second, 5, D)
  from C
  where D < '20120401'
)
insert into clean.data(Date, x1, x2)
select D, checksum(newid()) % 1000, checksum(newid()) % 1000
from C
option (maxrecursion 0);

Related Solutions

Sql-server – Script to query multiple instances

There's lots of options.

PowerShell
SSMS's Multi Server Query Functionality
Linked Servers
Batch File with a loop and sqlcmd variables

Sql-server – Updating table based on conditions (overlapping dates)

Condition 1:

WITH ord as (
    SELECT ID, CustomerID, CheckInDate, CheckOutDate
        , n = ROW_NUMBER() over(partition by [CustomerID] order by [CheckInDate], [CheckOutDate])
    FROM @data d1
), first as (
    SELECT o1.ID, o1.CustomerID, o1.CheckInDate, o1.CheckOutDate, o1.n
        , m = ROW_NUMBER() over(partition by o1.[CustomerID] order by o1.[CheckInDate], o1.[CheckOutDate])
    FROM ord o1
    INNER JOIN ord o2 ON o1.CustomerID = o2.CustomerID AND o2.n+1 = o1.n AND o1.CheckInDate > o2.CheckOutDate
), groups as (
    SELECT o.ID, o.CustomerID, nx = MIN(coalesce(f.n, 1)), n = MAX(o.n)
        , p = ROW_NUMBER() over(partition by o.CustomerID, MIN(coalesce(f.n, 1)) ORDER BY o.ID)
    FROM ord o
    LEFT JOIN first f ON o.CustomerID = f.CustomerID AND o.n < f.n
    GROUP BY o.ID, o.CustomerID
), dates as (
    SELECT g.CustomerID, g.nx, CheckInDate = MIN(o.CheckInDate)
        , CheckOutDate = CASE WHEN SUM(CASE WHEN o.CheckOutDate IS NULL THEN 1 END) IS NULL THEN MAX(o.CheckOutDate) END
    FROM groups g
    INNER JOIN ord o ON g.ID = o.ID
    GROUP BY g.nx, g.CustomerID
    HAVING COUNT(g.nx) > 1
)
SELECT o.ID, o.CustomerID
    , CheckInDate = CASE WHEN g.p = 1 THEN d.CheckInDate END
    , CheckOutDate = CASE WHEN g.p = 1 THEN d.CheckOutDate END
FROM groups g
INNER JOIN ord o ON g.ID = o.ID
INNER JOIN dates d on g.CustomerID = d.CustomerID AND g.nx = d.nx
ORDER BY ID

This query output the rows that must be updated:

ord = I first partition by CustomerID and order by CheckInDate
first = I join consecutive rows from previous CTE where the next CheckInDate does not overlap previous CheckInDate and CheckDate and I partition and order them
groups = I group by the previous partitioned number in order to know to which group a row belongs
dates = I join with the original data in order to get the first and last date for each groups. Groups with only 1 row are removed
main select output the dates for p=1 or NULL otherwise

Output:

ID  CustomerID  CheckInDate CheckOutDate
1   1           2015-03-04  NULL
3   1           NULL        NULL
4   1           NULL        NULL

Condition 2:

WITH ord as (
    SELECT ID, CustomerID, CheckInDate, CheckOutDate
        , n = ROW_NUMBER() over(partition by [CustomerID] order by [CheckInDate], [CheckOutDate])
    FROM @data d1
), first as (
    SELECT o1.ID, o1.CustomerID, o1.CheckInDate, o1.CheckOutDate, o1.n
        , m = ROW_NUMBER() over(partition by o1.[CustomerID] order by o1.[CheckInDate], o1.[CheckOutDate])
    FROM ord o1
    INNER JOIN ord o2 ON o1.CustomerID = o2.CustomerID AND o2.n+1 = o1.n AND o1.CheckInDate > o2.CheckOutDate
), groups as (
    SELECT o.ID, o.CustomerID, nx = MIN(coalesce(f.n, 1)), n = MAX(o.n)
        , p = ROW_NUMBER() over(partition by o.CustomerID, MIN(coalesce(f.n, 1)) ORDER BY o.ID)
        , last = ROW_NUMBER() over(partition by o.CustomerID, MIN(coalesce(f.n, 1)) ORDER BY o.ID DESC)
    FROM ord o
    LEFT JOIN first f ON o.CustomerID = f.CustomerID AND o.n < f.n
    GROUP BY o.ID, o.CustomerID
), dates as (
    SELECT g.CustomerID, g.nx, CheckInDate = MIN(o.CheckInDate)
        , CheckOutDate = MAX(o2.CheckOutDate)
    FROM groups g
    INNER JOIN ord o ON g.ID = o.ID
    INNER JOIN (SELECT ID, CustomerID, nx FROM groups WHERE last = 1) l ON g.CustomerID = l.CustomerID AND g.nx = l.nx
    INNER JOIN ord o2 ON l.ID = o2.ID
    GROUP BY g.nx, g.CustomerID
    HAVING COUNT(g.nx) > 1
)
SELECT o.ID, o.CustomerID
    , CheckInDate = CASE WHEN g.p = 1 THEN d.CheckInDate END
    , CheckOutDate = CASE WHEN g.p = 1 THEN d.CheckOutDate END
FROM groups g
INNER JOIN ord o ON g.ID = o.ID
INNER JOIN dates d on g.CustomerID = d.CustomerID AND g.nx = d.nx
ORDER BY ID

Output:

ID  CustomerID  CheckInDate CheckOutDate
1   1           2015-03-04  2015-05-03
3   1           NULL        NULL
4   1           NULL        NULL

For Updates, replace SELECT by UPDATE:

UPDATE g SET 
    CheckInDate = CASE WHEN g.p = 1 THEN d.CheckInDate END
    , CheckOutDate = CASE WHEN g.p = 1 THEN d.CheckOutDate END
FROM ...

Your data:

declare @data table([ID] int, [CustomerID] int, [CheckInDate] date, [CheckOutDate] date);
Insert into @data([ID], [CustomerID], [CheckInDate], [CheckOutDate])
VALUES
    (1, 1, '2015-04-02', '2015-04-05'),
    (2, 2, '2015-03-04', '2015-05-02'),
    (3, 1, '2015-04-01', NULL),
    (4, 1, '2015-03-04', '2015-05-03'),
    (5, 1, '2015-01-03', '2015-02-03')
;

It works as well with this sample:

(1, 1, '2015-04-02', '2015-04-05'),
(2, 2, '2015-03-04', '2015-05-02'),
(3, 1, '2015-04-01', NULL),
(4, 1, '2015-03-04', '2015-05-03'),
(5, 1, '2015-01-03', '2015-02-03'),
(6, 1, '2015-01-02', '2015-02-03'),
(7, 1, '2015-03-04', '2015-03-06'),
(8, 1, '2015-03-04', '2015-05-06'),
(9, 1, '2014-04-02', '2014-04-05'),
(10, 1, '2014-03-04', '2014-05-02')

If it does not work with some of your data, update Input and Output table with more relevant values.

Best Answer

Related Solutions

Sql-server – Script to query multiple instances

Sql-server – Updating table based on conditions (overlapping dates)

Related Question