Sql-server – SQL Server: compare names ignoring non-alpha

sql serversql-server-2008-r2

I have a database table with a text column "title".
New rows are inserted all the time, and every time I would like to check for potential duplicates on this column.

I define duplicate by "having the same sequence of alpha characters". So, for example, if I already have the string "Lisa's Nights":

"LISASNIGHTS" is duplicate, because case doesn't matter, spaces don't
matter, apostrophe doesn't matter
"Lisa's Night" is not duplicate because the final 's' is missing.

Of course I could store an altered version of the title and search there. But is there any other solution?

Best Answer

You can do it on the fly without storing the stripped down name, but that would mean recalculating the strings (e.g. Lisa's Nights, Lisa's Night, LISASNIGHTS) each time a new entry should to be checked against an existing string.

On the other hand, if your table also has a column for the STRIPPEDDOWNNAME, you can leverage SQL Server's constraints. Create a UNIQUE CONSTRAINT (or a UNIQUE INDEX) on the STRIPPEDDOWNNAME column. Then you will calculate the stripped down name only once, when you insert it into the table. With the UNIQUE CONSTRAINT the SQL Server will immediately let you know if an inserted string is a duplicate.

You can then capture the error and respond as appropriate to your application.

I encourage you to let the SQL Server manage the duplicate check whenever possible.

Related Solutions

Sql-server – Finding rows where a column contains [ and ] characters

One option is to use ESCAPE:

DECLARE @x TABLE(s VARCHAR(32)); INSERT @x SELECT 'Hi [Sydney]!'

SELECT s FROM @x WHERE s LIKE '%\[Sydney\]%' ESCAPE '\';

Of course, you'll want to choose a character that is unlikely to appear in the data naturally, or this could mess up other searches.

Another option is to just escape the first character by including it in its own escape sequence, e.g. in this case the end ] is not considered a part of any wildcard sequence because the first one is eliminated:

DECLARE @x TABLE(s VARCHAR(32)); INSERT @x SELECT 'Hi [Sydney]!'

SELECT s FROM @x WHERE s LIKE '%[[]Sydney]%';

One more option is to not use patterns that support wildcards or wildcard sequences. For example, CHARINDEX:

DECLARE @x TABLE(s VARCHAR(32)); INSERT @x SELECT 'Hi [Sydney]!'

SELECT s FROM @x WHERE CHARINDEX('[Sydney]', s) > 0;

SQL Server – varchar Storage and Comparisons in SQL Server 2008

Add a persistent calculated field that contains a CHECKSUM on the 5 fields, and use that to perform the comparisons.

The CHECKSUM field will be unique for that specific combination of fields, and is stored as an INT that results in a much easier target for comparisons in a WHERE clause.

USE tempdb; /* create this in tempdb since it is just a demo */

CREATE TABLE dbo.t1
(
    Id       bigint constraint PK_t1 primary key clustered identity(1,1)
    , Sequence int
    , Parent   int not null constraint df_T1_Parent DEFAULT ((0))
    , Data1    varchar(20)
    , Data2    varchar(20)
    , Data3    varchar(20)
    , Data4    varchar(20)
    , Data5    varchar(20)
    , CK AS CHECKSUM(Data1, Data2, Data3, Data4, Data5) PERSISTED
);

GO

INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
VALUES (1,1,'test','test2','test3','test4','test5');

SELECT *
FROM dbo.t1;
GO

enter image description here

/* this row will NOT get inserted since it already exists in dbo.t1 */
INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
SELECT 2, 3, 'test', 'test2', 'test3', 'test4', 'test5'
WHERE Checksum('test','test2','test3','test4','test5') NOT IN (SELECT CK FROM t1);

/* still only shows the original row, since the checksum for the row already
exists in dbo.t1 */
SELECT *
FROM dbo.t1;

In order to support a large number of rows, you'd want to create an NON-UNIQUE index on the CK field.

By the way, you neglected to mention the number of rows you are expecting in this table; that information would be instrumental in making great recommendations.

In-row data is limited to a maximum of 8060 bytes, which is the size of a single page of data, less the required overhead for each page. Any single row larger than that will result in some off-page storage of row data. I'm certain other contributors to http://dba.stackexchange.com can give you a much more concise definition of the engine internals regarding storage of large rows. How big is your largest row, presently?

If items in Data1, Data2, Data3... have the same values occurring in a different order, the checksum will be different, so you may want to take that into consideration.

Following a brief discussion with the fantastic Mark Storey-Smith on The Heap, I'd like to offer a similar, although potentially better choice for calculating a hash on the fields in question. You could alternately use the HASHBYTES() function in the calculated column. HASHBYTES() has some gotchas, such as the necessity to concatenate your fields together, including some type of delimiter between the field values, in order to pass HASHBYTES() a single value. For more information about HASHBYTES(), Mark recommended this site. Clearly, MSDN also has some great info at http://msdn.microsoft.com/en-us/library/ms174415.aspx

Best Answer

Related Solutions

Sql-server – Finding rows where a column contains [ and ] characters

SQL Server – varchar Storage and Comparisons in SQL Server 2008

Related Question