Sql-server – Writing and reading date ranges for a million items

datemergesql serversql-server-2012

I have a table with one row per day for about a million slowly changing items.

The items are looked up by Item, Date to see if they're present in the table. This is going somewhat slowly for lookups involving many items with different dates–my table has about 260 million rows right now. However, it's valuable to me to be able to insert the data quickly as well as query it quickly.

I'd like to convert this into a table that has usually one row per item, with a date range. (There might be two rows for an item if the date ranges are not contiguous.) A raw import file is available with each day's items by date, matching the existing schema. Each time a new import file is available, I'll update the date ranges of existing rows to cover the new import data. I also need to be able to reimport a file so the information is appropriately updated. In such a file, a date might be added or removed that had previously been present for an item.

While this is a bit of an annoying task, I know that I can come up with the right MERGE statement to update the data appropriately, but it will take a couple of hours of fiddling to be sure I've gotten it right, and it may not be a very fast import process.

So I'm toying with a different schema design that is a bit unusual and wonder if anyone could comment on it—good idea, bad idea, how to work with it, performance characteristics, pitfalls, etc?

Current data layout example

Item   Date 
----   ----------
SLAM   2016-11-01
SLAM   2016-11-02
SLAM   2016-11-03
SLAM   2016-11-05
GULP   2016-11-01
GULP   2016-11-02
GULP   2016-11-03
GULP   2016-11-05

(Notice that 2016-11-04 was skipped so far and will need to be loaded soon.)

This makes inserting and removing data super easy, however, the database is growing very fast–this is not the largest table in size but at almost 8 GB already, it's growing at something like 1 GB a month (there are actually two distinct item codes under different classification systems, so the index for the second item code doubles the space used). Also, querying is taking longer than I want to tolerate.

Intended new data layout

Item   FromDate     ToDateExclusive
----   ----------   ---------------
SLAM   2016-11-01   2016-11-04
SLAM   2016-11-05   2016-11-06
GULP   2016-11-01   2016-11-04
GULP   2016-11-05   2016-11-06

While this is great for space used and querying, when a new import file for 2016-11-04 comes in that has both SLAM and GULP in it, a merge process has to figure out that the new date contiguously joins to the two existing date ranges, then update one row each to contain the entire range and delete the other. That's tricky.

If later a new import file for 2016-11-04 is missing GULP, the single row would have to be split into two rows again, just like above. This situation would be rare, but isn't impossible, especially given files may be imported out of order—they are only fetched on demand—and correction to more than one historical file could yield this situation.

Unusual idea data layout

Item   DateBitPattern (varbinary)
----   --------------
SLAM   0xE8    -- bits: 11101000
GULP   0xE8    -- bits: 11101000

The idea for this unusual layout is that there is exactly one row per item. An anchor date is chosen before which no data will ever be collected (here, 2016-11-01). For every date that the item is valid, a bit will be set for that item representing that date (indexed by the number of days from the anchor date). Updates are easy, as to turn on or off a date for an item, we can just adjust one byte with some bit arithmetic and not worry what the other bytes and bits are. At lookup time, my application can just fetch all the date bit patterns for every desired item, and do its own math to figure out if that item is present for that date. Both inserting dates, removing dates, and lookup are fast–at the cost of denormalizing and having to deal with some clunkiness (opacity about what the data means, being much harder to query for a specific date directly in the database, potential consistency problems or interpretation bugs, etc.).

As time goes on, the varbinary value would get longer and longer. In 20 years, assuming daily imports, it would be about 913 bytes long. That's acceptable to me, as the most items that are looked up at once is in the tens of thousands. Alternately, I could add a Year column that would allow each row to have exactly binary(46). This keeps the table from growing so fast.

What do you think? Alternate strategies? Ideas on how to make the merge easier instead of using this denormalized schema?

Additional Information

Here's the table as it is now.

CREATE TABLE dbo.ItemDate (
   RecordId int identity(1, 1) NOT NULL,
   ReferenceDate date NOT NULL,
   ReferenceType bit NOT NULL,
   ItemOfType1Code char(12) NOT NULL,
   ItemOfType2Code char(7) NOT NULL
);

These two item codes are always exactly 12 characters and exactly 7 characters respectively, thus why char is used in the table.

The data actually arrives in two files, one for each reference type, and I have to record which one it came from so I can keep track of what data has been loaded or not. I glossed over that because it's not important to my current problem. The ItemOfType1Code or ItemOfType2Code always fully implies the ReferenceType—there are no rows with the same ItemOfType1Code that have a different ReferenceType.
The RecordId needs to die. I added it under time pressure originally so the covering non-clustered index on ItemOfType2Code did NOT have to repeat the 12 bytes of the ItemOfType1Code in every row. I now realize my mistake and am splitting this table into two, one for each item type. This should help with performance in any case.
There is another table where I record what has been loaded and what hasn't, it looks like this:
```
CREATE TABLE [dbo].[ItemDatePresence](
   [ReferenceDate] [date] NOT NULL,
   [ReferenceType] [bit] NOT NULL,
   CONSTRAINT [PK_ItemDatePresence] PRIMARY KEY CLUSTERED (
      [ReferenceDate] ASC,
      [ReferenceType] ASC
   )
);
```
I have no plans at this time to get rid of this table as it is very useful to me and helps me to know what is loaded. Though, as I think about it, perhaps I will do so, but when checking whether data is loaded for a date, I'd like to not go out to the millions of rows table but be able to quickly check for presence of loaded data in this table with only 170 x 2 rows, for 170 days of data loaded, skipping weekends. I glossed over the weekends thing because it's not relevant and I can handle that on my own.
The indexes are currently as follows:
```
CREATE UNIQUE CLUSTERED INDEX [CI_ItemDate] ON [dbo].[ItemDate] (
   [ReferenceDate] ASC,
   [ReferenceType] ASC,
   [ItemOfType1Code] ASC,
   [ItemOfType2Code] ASC,
   [RecordId] ASC
);

CREATE NONCLUSTERED INDEX [IX_ItemDate_ItemOfType2Code] ON [dbo].[ItemDate] (
   [ReferenceDate] ASC,
   [ItemOfType2Code] ASC
);
```
There are no other indexes or keys on this table.

I instantly see that my clustered index is hokey. I chose it originally because I needed to be able to load the data in segments corresponding to the ReferenceType-specific input files which can arrive at separate times or which can succeed or fail independently. But the whole reason for adding the RecordId in the first place was to NOT repeat the ItemOfType1Code in every row of the non-clustered index, which I ended up doing anyway, plus the RecordId.
One of the problems with the current data is that there can be multiple of ItemOfType2Code that correspond to a single ItemOfType1Code. This makes the table inefficient for looking up ItemOfType1Code even if my indexes were right (which I fully admit they are not). They can even be blank. For example:
```
ItemOfType1Code  ItemOfType2Code
-----------  -----------
ABC          ZORKMID
ABC          VERMICIOUS
ABC          BOROGOVE
             XYZZY
DEF
```
Thus in the new schema, I will have two tables, one for each type, into which I'll insert the distinct list of codes. There will be a third table which simply has each particular instance of each valid relation between code types 1 and 2—but I don't need help with that.

So far, my new date-ranged tables will probably look close to this:

CREATE TABLE dbo.ItemOfType1DateRanges (
   ItemOfType1Code char(12) NOT NULL,
   ReferenceType bit NOT NULL,
   FromBusinessDate date NOT NULL,
   ToBusinessDateExclusive date NOT NULL,
   PRIMARY KEY CLUSTERED (ItemOfType1Code, FromBusinessDate)
);

CREATE TABLE dbo.ItemOfType2DateRanges (
   ItemOfType2Code char(7) NOT NULL,
   ReferenceType bit NOT NULL,
   FromBusinessDate date NOT NULL,
   ToBusinessDateExclusive date NOT NULL,
   PRIMARY KEY CLUSTERED (ItemOfType2Code, FromBusinessDate)
);

These implicitly reveal that the ReferenceType is is fully implied by the Item codes–I have verified this in the data and logically, knowing the business, it seems impossible that a code could have two different reference types as each code refers to a specific kind of thing that can only be one type.

Now, the above noted index and data problems can easily be resolved and I am confident I can tune those aspects up–please focus on the actually-asked question except where this additional information may help make a recommendation.

To specifically answer @mendosi's question, space IS a concern. I'd prefer to use a strategy that does not put another row in for every item code + day, unless it truly performs well and is easy to maintain, and takes up significantly less than the current 1 GB of data per month. At run-time I only have a list of paired values Date + ItemOfType1Code or Date + ItemOfType2Code and am checking to see if each one is in the table or not. I do not know the ReferenceType—at least at this point I am not sure I can reliably produce or calculate that for each item, although I may be able to in the future from other metadata.

Another Weird Idea

SQL Server 2016 updateable column-store index, Item Code as PK and with the dates as the column-store column?

Of course it will likely be years before we get 2016. Only upgraded all remaining instances to SQL Server 2012 this year.

Best Answer

It looks like you're on the right path. Your datebit pattern idea is an interesting option, I'd like to hear if you make any more progress towards that design.

My suggestion is a pretty simple one. Why not split the ItemOfType out to their own tables and reference them with a surrogate id?. I know it isn't always a popular idea if you have good natural keys available, but in terms of space saving it could reduce your storage requirement a bit.

This was too much for a comment, please don't crucify me.

create table dbo.ItemOfType1 (
    ItemOfType1Id int identity(1,1) not null
  , ItemOfType1Code char(12) not null
  , ReferenceType bit not null
  , constraint pkc_ItemOfType1 primary key clustered (ItemOfType1Id)
  , constraint uq_ItemOfType1_ItemOfType1Code_ReferenceType unique nonclustered (ItemOfType1Code, ReferenceType)
);
create table dbo.ItemOfType2 (
    ItemOfType2Id int identity(1,1) not null
  , ItemOfType2Code char(7) not null
  , ReferenceType bit not null
  , constraint pkc_ItemOfType2 primary key clustered (ItemOfType2Id)
  , constraint uq_ItemOfType2_ItemOfType2Code_ReferenceType unique nonclustered (ItemOfType2Code, ReferenceType)
);
create table dbo.ItemOfType1Dates (
    ItemOfType1Id int  not null
  , ReferenceDate date not null
  , constraint pkc_ItemOfType1Dates primary key clustered (ReferenceDate,ItemOfType1Id)
  , constraint fk_ItemOfType1Dates_ItemOfType1_ItemOfType1Id foreign key (ItemOfType1Id) references dbo.ItemOfType1 (ItemOfType1Id)
);
create table dbo.ItemOfType2Dates (
    ItemOfType2Id int  not null
  , ReferenceDate date not null
  , constraint pkc_ItemOfType2Dates primary key clustered (ReferenceDate,ItemOfType2Id)
  , constraint fk_ItemOfType2Dates_ItemOfType2_ItemOfType2Id foreign key (ItemOfType2Id) references dbo.ItemOfType2 (ItemOfType2Id)
);
/*  -- date ranges alternate
create table dbo.ItemOfType1DateRanges (
    ItemOfType1Id int  not null
  , FromBusinessDate date not null
  , ToBusinessDate date not null     
  , constraint pkc_ItemOfType1DateRanges primary key clustered (FromBusinessDate,ItemOfType1Id)
  , constraint fk_ItemOfType1DateRanges_ItemOfType1_ItemOfType1Id foreign key (ItemOfType1Id) references dbo.ItemOfType1 (ItemOfType1Id)
);
create table dbo.ItemOfType2DateRanges (
    ItemOfType2Id int  not null
  , FromBusinessDate date not null
  , ToBusinessDate date not null     
  , constraint pkc_ItemOfType2DateRanges primary key clustered (FromBusinessDate,ItemOfType2Id)
  , constraint fk_ItemOfType2DateRanges_ItemOfType2_ItemOfType2Id foreign key (ItemOfType2Id) references dbo.ItemOfType2 (ItemOfType2Id)
);    
*/

Related Solutions

Sql-server – Best table schema to store daily hits for many items

I would want to attempt to insert a row for every hit and use SQL partitioning to store the table in managable chronological slices (by week or by month). This way, by keeping the original timestamp for each hit, you'll be able to aggregate the data using different methods as needed (by year one day, by hour the next, etc). For doing the aggregation, I would build a SSAS cube or Excel PivotTable, or T-SQL grouping functions. By using paritioning, you can store data for different time periods into individual filegroups spread across different disks to maintain scalability. When it comes to archival, you can perform what's called a sliding window to detach and later drop old unneeded time periods of data without expensive DELETE operations.

Sql-server – Optimizing queries for 25+ million rows

The question is mainly about how to optimize the select statement:

SELECT [TABLE], [FIELD], [AFTER], [DATE]
FROM mytbl WITH (NOLOCK)
WHERE [TABLE] = 'OTB' AND
[FIELD] = 'STATUS'

Removing the redundant projections and adding the presumed dbo schema:

SELECT [AFTER], [DATE] 
FROM dbo.mytbl WITH (NOLOCK)
WHERE [TABLE] = 'OTB'
AND FIELD = 'STATUS';

Without an index like ([TABLE],[FIELD]) INCLUDE ([AFTER],[DATE]) SQL Server has two main options:

Scan the heap entirely (3GB+); or
Locate rows matching [TABLE] = 'OTB' and [FIELD] = 'STATUS' (using IDX6), then perform a heap (RID) lookup per row to retrieve the [AFTER] and [DATE] columns.

Whether the optimizer chooses a heap scan or index seek with RID lookup depends on the estimated selectivity of the [TABLE] = 'OTB' and [FIELD] = 'STATUS'predicates. Check to see if the estimated number of rows from the seek matches reality. If not, update your statistics. Test the query with an table hint forcing the use of the index, if that condition is reasonably selective. If the optimizer is currently choosing the index seek, test performance with an INDEX(0) or FORCESCAN hint to scan the heap.

Beyond that, you could look to improve the scan of the heap a little by removing some of the unused space (370MB). In SQL Server 2008 this can be done by rebuilding the heap. Unused space in heaps often results from deletes performed without a table lock being taken (without a table lock, empty pages are not deallocated from a heap). Tables that experience frequent deletions are often better stored as a clustered table for this reason.

The performance of the heap scan depends on how much of the table is stored in memory, how much must be read from disk, how full the pages are, the speed of the persistent storage, whether the scan is I/O or CPU bound (parallelism can help).

If performance is still unacceptable after you have investigated all of the above, try to make the case for a new index. If available on your version of SQL Server, a possible filtered index for the given query would be:

CREATE INDEX index_name
ON dbo.mytbl ([DATE],[AFTER])
WHERE [TABLE] = 'OTB'
AND [FIELD] = 'STATUS';

Also consider index compression, if that is available and beneficial. Without a new index of some kind, there's relatively little you can do to improve the performance of the given query.