SQL Server – Return Most Recent Batch of Sequential Rows

sql serversql server 2014

Version:
Microsoft SQL Server 2014 (SP3-CU4) (KB4500181)

What i'm trying to accomplish here is pull in all the rows where there is a break of less than lets say 40 days between the next row.

So in the first example there is an uninterrupted break every month (roughly). So I would like the query to pull in all records here.

Example 1

CREATE table #Temp1
(
id Varchar(50),
DT8 int,
series int
) 

INSERT INTO #Temp1
VALUES
('147501',20181022,1),
('147502',20181115,1),
('147503',20181211,1),
('147504',20190110,1),
('147506',20190306,1),
('147505',20190205,1),
('147507',20190402,1),
('147508',20190429,1),
('147509',20190529,1),
('147510',20190603,1),
('147511',20190703,1),
('147512',20190802,1),
('177901',20190905,1),
('177902',20191001,1),
('177903',20191108,1),
('177904',20191205,1),
('190401',20200103,1),
('190402',20200205,1)

In example two there is a gap greater than 40 days between Sep 25 2019 and January 29 2020. So I would like the query to just pull in the most recent subsequent block. In this case it would just be the top record.

Example 2

INSERT INTO #Temp1
VALUES
('164701',20190605,2),
('164702',20190703,2),
('164703',20190802,2),
('164704',20190904,2),
('164705',20190925,2),
('189101',20200129,2)

I have started down this road, and was looking at using LEAD to calculate the number of days between the current and previous rows. I realize I probably need to break the years out to account for the case when moving to a new year or convert it to a real date so that I can use some sql functions to calculate the difference in days for me.

After that I wasn't sure how to go about only returning the most recent consecutive block. Thought I would ask here to see if anyone had any insight on how to accomplish this.

Best Answer

WITH 
cte1 AS 
(
SELECT series, 
       id, 
       `date`, 
       DATEDIFF(LEAD(STR_TO_DATE(`date`, '%Y%m%d')) OVER (PARTITION BY series
                                                          ORDER BY `date`),
                STR_TO_DATE(`date`, '%Y%m%d')) diff
FROM test
),
cte2 AS 
(
SELECT series, 
       id, 
       `date`, 
       COALESCE(SUM(diff > 40) OVER (PARTITION BY series
                                     ORDER BY `date` DESC), 0) grp
FROM cte1
)
SELECT series, 
       id, 
       `date` 
FROM cte2 
WHERE grp = 0
ORDER BY 1,3

fiddle

Related Solutions

SQL Server Recursive Query – How to Find 90-Day Gaps Between Rows

As I read the question, the basic recursive algorithm required is:

Return the row with the earliest date in the set
Set that date as "current"
Find the row with the earliest date more than 90 days after the current date
Repeat from step 2 until no more rows are found

This is relatively easy to implement with a recursive common table expression.

For example, using the following sample data (based on the question):

DECLARE @T AS table (TheDate datetime PRIMARY KEY);

INSERT @T (TheDate)
VALUES
    ('2014-01-01 11:00'),
    ('2014-01-03 10:00'),
    ('2014-01-04 09:30'),
    ('2014-04-01 10:00'),
    ('2014-05-01 11:00'),
    ('2014-07-01 09:00'),
    ('2014-07-31 08:00');

The recursive code is:

WITH CTE AS
(
    -- Anchor:
    -- Start with the earliest date in the table
    SELECT TOP (1)
        T.TheDate
    FROM @T AS T
    ORDER BY
        T.TheDate

    UNION ALL

    -- Recursive part   
    SELECT
        SQ1.TheDate
    FROM 
    (
        -- Recursively find the earliest date that is 
        -- more than 90 days after the "current" date
        -- and set the new date as "current".
        -- ROW_NUMBER + rn = 1 is a trick to get
        -- TOP in the recursive part of the CTE
        SELECT
            T.TheDate,
            rn = ROW_NUMBER() OVER (
                ORDER BY T.TheDate)
        FROM CTE
        JOIN @T AS T
            ON T.TheDate > DATEADD(DAY, 90, CTE.TheDate)
    ) AS SQ1
    WHERE
        SQ1.rn = 1
)
SELECT 
    CTE.TheDate 
FROM CTE
OPTION (MAXRECURSION 0);

The results are:

╔═════════════════════════╗
║         TheDate         ║
╠═════════════════════════╣
║ 2014-01-01 11:00:00.000 ║
║ 2014-05-01 11:00:00.000 ║
║ 2014-07-31 08:00:00.000 ║
╚═════════════════════════╝

With an index having TheDate as a leading key, the execution plan is very efficient:

Execution plan

You could choose to wrap this in a function and execute it directly against the view mentioned in the question, but my instincts are against it. Usually, performance is better when you select rows from a view into a temporary table, provide the appropriate index on the temporary table, then apply the logic above. The details depend on the details of the view, but this is my general experience.

For completeness (and prompted by ypercube's answer) I should mention that my other go-to solution for this type of problem (until T-SQL gets proper ordered set functions) is a SQLCLR cursor (see my answer here for an example of the technique). This performs much better than a T-SQL cursor, and is convenient for those with skills in .NET languages and the ability to run SQLCLR in their production environment. It may not offer much in this scenario over the recursive solution because the majority of the cost is the sort, but it is worth mentioning.

MDX Functions – Fixing Incorrect Max, Average, and StDev Results

I tried doing it in MDX but to no avail, therefore I had to use the SSAS User Defined Function to calculate the Avg of Max. The set to calculate Avg (DateSet) across, set to calculate Max across (TimeSet) and a third set to containing the records to be excluded in Max operation, are passed as parameters to the UDF.

The basic code is like (presenting the coded Avg of Max functionality, excluding error checking):

public static class AvgOfMaxMethods
    {
        public static double AvgOfMax(Set maxAcrossSet, Set avgAcrossSet, Set setToExclude, Expression measureExpression)
        {
            var setToExcludeQueryable = setToExclude.Tuples.OfType<Microsoft.AnalysisServices.AdomdServer.Tuple>();
            IList<double> maxMembers = new List<double>();
            foreach(var avgAcrossTuple in avgAcrossSet.Tuples)
            {
                var max = double.MinValue;
                foreach(var maxAcrossTuple in maxAcrossSet.Tuples)
                {
                    if (!setToExcludeQueryable.Any(tuple => tuple.Members[0].UniqueName.Equals(avgAcrossTuple.Members[0].UniqueName) && tuple.Members[1].UniqueName.Equals(maxAcrossTuple.Members[0].UniqueName)))
                    {
                        TupleBuilder tb = new TupleBuilder(avgAcrossTuple.Members[0]);
                        tb.Add(maxAcrossTuple.Members[0]);
                        var calculatedVal = measureExpression.Calculate(tb.ToTuple()).ToDouble();
                        max = calculatedVal > max ? calculatedVal : max;
                    }
                }
                if (!max.Equals(double.MinValue))
                {
                    maxMembers.Add(max);
                }
            }
            return maxMembers.Average();
        }
    }

An with this the query becomes (keeping the original Max and StDev for comparison):

WITH
MEMBER MaxMember as MAX(TimeSet, [Measures].[Value])
MEMBER AvgOfMax as Avg(DateSet, MAX(TimeSet, [Measures].[Value]))
MEMBER StDevOfMax as StDev(DateSet, MAX(TimeSet, [Measures].[Value]))
Member MaxFromUDF as UDFPoc.UDFPoc.AvgOfMaxMethods.AvgOfMax(TimeSet, DateSet, SetToExclude, [Measures].[Value])

SET DateSet as EXISTING [Dim Date].[Date Id].[Date Id]
SET TimeSet as EXISTING [Dim Time].[Time Id].[Time Id]
SET SetToExclude as {[Dim Date].[Date Id].&[20150103] * [Dim Time].[Time Id].&[1] : [Dim Time].[Time Id].&[2]}

select
{
    [Measures].[Value], MaxMember, AvgOfMax, StDevOfMax, MaxFromUDF
}
on 0,
{
    DateSet * TimeSet
} on 1

FROM 
(
    SELECT ({[Dim Date].[Date Id].&[20150101] : [Dim Date].[Date Id].&[20150103]}, {[Dim Time].[Time Id].&[1] : [Dim Time].[Time Id].&[4]})
    on 0
    FROM [Test Date Time Dimensions]
)

And the results look like (here tuples of the excluded set is highlighted in red):

enter image description here

As can be seen, the UDF excludes the members of each tuple in the SettoExclude set when calculating Max and Average. Now correct average considering the excluded set is being calculated, as can be seen in the image. Similarly StDev can be calculated in this manner.

See this as a guide to create SSAS User Defined Functions.

Best Answer

Related Solutions

SQL Server Recursive Query – How to Find 90-Day Gaps Between Rows

MDX Functions – Fixing Incorrect Max, Average, and StDev Results

Related Question