Sql-server – Query to calculate start and end dates from a list of only start dates

sql serversql-server-2008-r2t-sql

I have data in the following form that needs to be combined such that I create a single row from two rows of the input data where the second column is the value of the next row less one day.

I've struggled trying to come up with a way to do this, but not gotten very far. I hope this illustration explains the problem:

Table structure

CREATE TABLE Process(
    ProcessOrder int NOT NULL IDENTITY (1, 1),
    ProcessDate date NOT NULL
);

Query

SELECT ProcessOrder, ProcessDate From Process ORDER BY ProcessOrder;

Data

ProcessOrder  ProcessDate
------------  -----------
208           2016-01-04
209           2016-01-11
210           2016-01-18
211           2016-01-25
212           2016-02-01
213           2016-02-08

I need to get the above data into the following shape (note that ProcessEnd is one day less than ProcessBegin from the subsequent record):

ProcessOrder  ProcessBegin  ProcessEnd
------------  ------------  ----------
208           2016-01-04    2016-01-10
209           2016-01-11    2016-01-17
210           2016-01-18    2016-01-24
211           2016-01-25    2016-01-31
212           2016-02-01    2016-02-07
213           2016-02-08    *NULL*

EDIT

I am guaranteed that values in the ProcessOrder column are consecutive with no missing values.

Best Answer

Since you're on 2008, you don't have the LEAD function yet (I believe), so you can do it yourself with a CTE:

WITH RankedProcess AS
(
  SELECT ProcessOrder, ProcessDate, ROW_NUMBER() OVER(ORDER BY ProcessDate) AS Seq
  FROM Process
)
SELECT P1.ProcessOrder, P1.ProcessDate AS ProcessBegin, DATEADD(day, -1, P2.ProcessDate) AS ProcessEnd
FROM RankedProcess AS P1
  LEFT OUTER JOIN RankedProcess AS P2
    ON P1.Seq = P2.Seq - 1

In 2012 and later, you can just use LEAD:

SELECT ProcessOrder, ProcessDate AS ProcessBegin, LEAD(DATEADD(day, -1, ProcessDate)) OVER(ORDER BY ProcessDate) AS ProcessEnd
FROM Process

Given the update that ProcessOrder is guaranteed to be consecutive with no missing values (and I'm assuming they are in the same order as the dates), you can get away without the CTE and just use the ProcessOrder column. However, the solution with the CTE is more general, which I think is still useful since I am generally skeptical of data requirements that aren't enforced by the database.

Related Solutions

Sql-server – Store n:m data with most n:1 tuples

Using an intermediate table is the proper way to do this. With proper indexes and good maintenance it won't hurt performance much at all.

Avoid the multi-column solution (refid1, refid2 etc.) - that's repeating groups and not properly normalised. Plus your queries will degenerate into a dog's breakfast of OR clauses and CASE statements, which will hurt performance.

You may get some benefit from subtyping your data rows so most use your current arrangement but those with multiple reference use a different one. Your queries will UNION ALL the two sets to get the total. I really doubt the overhead from this will compensate for the mostly-redundant join to data_reference.

The subtyped schema would look like this:

create table data (
    id int identity not null,
    ref int not null,       -- set to a standard value for rows with >1 reference.
    is_single_reference bit not null
);

create table reference (
    refid int not null
);

create table data_reference (
     dataid int not null,
     refid int not null
);

The query would be something like:

select
    <whatever>
from data as d
inner join reference as r
    on d.ref = r.refid
where d.is_single_reference = 1

union all

select
    <whatever>
from data as d
inner join data_reference as dr
    on dr.dataid= d.id
inner join reference as r
    on r.refid = dr.refid
where d.is_single_reference = 0;

This is fairly ugly. You'd be lucky to avoid two scans of data. Ideally you'd want filtered indexes but IIRC these aren't availble until SQL Server 2008. The applicaiton would have to know how many reference rows there were per data, set flags accordingly and maintain those if the number changed. Just use the n:m approach.

Sql-server – varchar storage and comparisons in SQL Server 2008

Add a persistent calculated field that contains a CHECKSUM on the 5 fields, and use that to perform the comparisons.

The CHECKSUM field will be unique for that specific combination of fields, and is stored as an INT that results in a much easier target for comparisons in a WHERE clause.

USE tempdb; /* create this in tempdb since it is just a demo */

CREATE TABLE dbo.t1
(
    Id       bigint constraint PK_t1 primary key clustered identity(1,1)
    , Sequence int
    , Parent   int not null constraint df_T1_Parent DEFAULT ((0))
    , Data1    varchar(20)
    , Data2    varchar(20)
    , Data3    varchar(20)
    , Data4    varchar(20)
    , Data5    varchar(20)
    , CK AS CHECKSUM(Data1, Data2, Data3, Data4, Data5) PERSISTED
);

GO

INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
VALUES (1,1,'test','test2','test3','test4','test5');

SELECT *
FROM dbo.t1;
GO

enter image description here

/* this row will NOT get inserted since it already exists in dbo.t1 */
INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
SELECT 2, 3, 'test', 'test2', 'test3', 'test4', 'test5'
WHERE Checksum('test','test2','test3','test4','test5') NOT IN (SELECT CK FROM t1);

/* still only shows the original row, since the checksum for the row already
exists in dbo.t1 */
SELECT *
FROM dbo.t1;

In order to support a large number of rows, you'd want to create an NON-UNIQUE index on the CK field.

By the way, you neglected to mention the number of rows you are expecting in this table; that information would be instrumental in making great recommendations.

In-row data is limited to a maximum of 8060 bytes, which is the size of a single page of data, less the required overhead for each page. Any single row larger than that will result in some off-page storage of row data. I'm certain other contributors to http://dba.stackexchange.com can give you a much more concise definition of the engine internals regarding storage of large rows. How big is your largest row, presently?

If items in Data1, Data2, Data3... have the same values occurring in a different order, the checksum will be different, so you may want to take that into consideration.

Following a brief discussion with the fantastic Mark Storey-Smith on The Heap, I'd like to offer a similar, although potentially better choice for calculating a hash on the fields in question. You could alternately use the HASHBYTES() function in the calculated column. HASHBYTES() has some gotchas, such as the necessity to concatenate your fields together, including some type of delimiter between the field values, in order to pass HASHBYTES() a single value. For more information about HASHBYTES(), Mark recommended this site. Clearly, MSDN also has some great info at http://msdn.microsoft.com/en-us/library/ms174415.aspx

Best Answer

Related Solutions

Sql-server – Store n:m data with most n:1 tuples

Sql-server – varchar storage and comparisons in SQL Server 2008

Related Question