Sql-server – varchar storage and comparisons in SQL Server 2008

sql serversql-server-2008varchar

I have a large table with a structure similar to this:

Id       bigint primary key
Sequence int
Parent   int foreign key
Data1    varchar(20)
Data2    varchar(20)
Data3    varchar(20)
Data4    varchar(20)
Data5    varchar(20)

Data 1-5 are text fields (mostly numbers and dashes) and can be used to help prevent accidentally recording duplicates, but the data can actually duplicate occasionally. We have to let the user confirm that yes it's OK it's a duplicate.

This duplicate check can be time consuming with millions of rows. I've been tasked to ensure we can deal with tripling the amount of data we are keeping in this database. I've been told by previous developers that the current process will not work with that much data (there is of course a lot more going on with the data than I'm including here, and though the actual query for finding duplicates is pretty simple this part of the system is apparently sensitive to additional latency). I have not done experiments myself to prove this but I trust their judgement, and regardless I'd like to reduce the impact as much as possible. Unfortunately, I do not have a huge volume to test with until fairly late in the process. So while I have time I would like to try to be set up to test a few scenarios when I get a window of time on the test server.

All 5 fields must match to be counted as a duplicate. I'm skeptical that this could never change, but I'm being told that for the foreseeable future all 5 must match exactly. I'm thinking that by adding a indexed column that is a hash of the combined values, I will be able to find potential duplicates much faster. I would still have to compare the individual values to account for hash collisions though. Would I gain anything by combining the values themselves too? So a hash column, and a single column with all 5 values in a delimited string that I use to compare rather than compare each value separately?

It seems like I would only benefit if varchars are not stored with the rest of the row, which I always assumed they were. However that doesn't seem to be the case. I'm having a hard time finding specific information, but it seems that if my row is less than 8060 bytes, the full 20 characters will be allocated in the row. Only if the row is over 8060 will some varchar columns be chose to move into a separate page.

Any documentation that explains how small varchars are stored, and any suggestions regarding whether or not combining the columns into a single column will have a performance impact would be much appreciated.

EDIT:
This table will have millions of rows (say 10 million for an estimate). Each row will definitely have a maximum length of less than 8060 bytes, though I can't give you an exact number right now.

Best Answer

Add a persistent calculated field that contains a CHECKSUM on the 5 fields, and use that to perform the comparisons.

The CHECKSUM field will be unique for that specific combination of fields, and is stored as an INT that results in a much easier target for comparisons in a WHERE clause.

USE tempdb; /* create this in tempdb since it is just a demo */

CREATE TABLE dbo.t1
(
    Id       bigint constraint PK_t1 primary key clustered identity(1,1)
    , Sequence int
    , Parent   int not null constraint df_T1_Parent DEFAULT ((0))
    , Data1    varchar(20)
    , Data2    varchar(20)
    , Data3    varchar(20)
    , Data4    varchar(20)
    , Data5    varchar(20)
    , CK AS CHECKSUM(Data1, Data2, Data3, Data4, Data5) PERSISTED
);

GO

INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
VALUES (1,1,'test','test2','test3','test4','test5');

SELECT *
FROM dbo.t1;
GO

enter image description here

/* this row will NOT get inserted since it already exists in dbo.t1 */
INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
SELECT 2, 3, 'test', 'test2', 'test3', 'test4', 'test5'
WHERE Checksum('test','test2','test3','test4','test5') NOT IN (SELECT CK FROM t1);

/* still only shows the original row, since the checksum for the row already
exists in dbo.t1 */
SELECT *
FROM dbo.t1;

In order to support a large number of rows, you'd want to create an NON-UNIQUE index on the CK field.

By the way, you neglected to mention the number of rows you are expecting in this table; that information would be instrumental in making great recommendations.

In-row data is limited to a maximum of 8060 bytes, which is the size of a single page of data, less the required overhead for each page. Any single row larger than that will result in some off-page storage of row data. I'm certain other contributors to http://dba.stackexchange.com can give you a much more concise definition of the engine internals regarding storage of large rows. How big is your largest row, presently?

If items in Data1, Data2, Data3... have the same values occurring in a different order, the checksum will be different, so you may want to take that into consideration.

Following a brief discussion with the fantastic Mark Storey-Smith on The Heap, I'd like to offer a similar, although potentially better choice for calculating a hash on the fields in question. You could alternately use the HASHBYTES() function in the calculated column. HASHBYTES() has some gotchas, such as the necessity to concatenate your fields together, including some type of delimiter between the field values, in order to pass HASHBYTES() a single value. For more information about HASHBYTES(), Mark recommended this site. Clearly, MSDN also has some great info at http://msdn.microsoft.com/en-us/library/ms174415.aspx