Sql-server – Speeding up “duplicate” checks

index-tuningquerysql serversql-server-2008-r2

We have an events table that is reasonably large (~35 million rows)

The original table is like this,

CREATE TABLE [dbo].[Events](
[EventID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[EventTypeID] int,
[SourceID] int,
[EventTitle] [varchar](max) NOT NULL,
[EventContent] [varchar](max) NULL,
[EventDate] [datetime] NOT NULL,
CONSTRAINT [PK_Events] PRIMARY KEY CLUSTERED 
([EventID] ASC)
WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

and it has an index that looks like

CREATE NONCLUSTERED INDEX [ix_EventDate] ON [dbo].[Events] 
(
    [EventTypeID] ASC,
    [SourceID] ASC,
    [EventDate] ASC
)
INCLUDE (EventTitle, EventContent) 

When adding new events, there is a business requirement to check to see if there's a "duplicate" event filed within the last 7 days (duplicate defined as having the same title and content). The above index is used in the check query, but it is still very slow and only rarely finds any duplicates.

The query looks like

SELECT *
FROM #NewEvents
WHERE EXISTS (SELECT *
    FROM [dbo].[Events] e
    WHERE 
    DATEDIFF(dd, #NewEvents.EventXML.value('(/Event/EventDate)[1]', 'DATETIME'), e.EventDate) BETWEEN -7 AND 0
    AND e.EventTitle = #NewEvents.EventXML.value('(/Event/EventTitle)[1]', 'VARCHAR(MAX)')
    AND e.EventContent = #NewEvents.EventXML.value('(/Event/EventContent)[1]', 'VARCHAR(MAX)')
)

I had thought maybe a couple of computed columns with hashes on the title and content might speed the check (by making the indexes much more compact). Not wanting to alter the underlying table on a theory (possibly mucking with the existing replication setup), I tried to create an indexed view with those computed columns.

I created an index on the view like the one above INCLUDING the computed columns, but for whatever reason I can't get the query to use that index on the view; it always defaults back to the base table index, and to make matters worse doesn't use the INCLUDE values for the computed values; it joins with the PK to fetch the original columns.

The view index looked like this (where TitleHash and ContentHash are HASHBYTES of the full column text)

CREATE NONCLUSTERED INDEX [ix_viewEvents_EventDate] ON [dbo].[viewEvents] 
(
    [EventTypeID] ASC,
    [SourceID] ASC,
    [EventDate] ASC
)
INCLUDE (TitleHash, ContentHash) 

Anyway, I was wondering if anyone had any suggestions to speed up the query given the business constraints.

EDIT: The first thing I tried was to change that DATEDIFF expression on the idea that it might have been blocking the use of the old index, but the plan for the old query showed that it was doing an Index Seek (not scan).

As Josh pointed out below, yeah the old query was doing an Index Seek – but only on the 2 leading columns. The date was an output column and that filtering was done later. But that stays that way no matter if I phrase the date filter as DATEDIFF() or "e.EventDate BETWEEN DATEADD(DAY, -7, #NewEvents.EventXML.value('(/Event/EventDate)1', 'DATETIME')) AND #NewEvents.EventXML.value('(/Event/EventDate)1', 'DATETIME')"

enter image description here

EDIT 2: Using WITH (NOEXPAND) did, indeed, get sql server to use the new view and the new index, though to my disappointment running the new view-based query and the old query in the same batch came back with the Actual Plan saying the view was 99% of the 2 query batch. So the experiment doesn't appear to have been successful. Changing the phrasing of the date filtering here, though, does seem to be reflected in the Index Seek node.

enter image description here

Best Answer

Improving the Original Query

The query as written will use the nonclustered index that you created, but it won't seek into it - the whole index gets scanned.

I reproduced the query and data locally - the DATEDIFF version has a plan that looks like this:

enter image description here

Table 'Events'. Scan count 1, logical reads 28323

 SQL Server Execution Times:
   CPU time = 1891 ms,  elapsed time = 1943 ms.

This plan will scan the Events table once for each row in your #NewEvents temp table. It filters rows at the Nested Loops join operator after they've already been read. This is not ideal.

To get a seek, you could write your query like this:

SELECT *
FROM #NewEvents
WHERE EXISTS (SELECT *
    FROM [dbo].[Events] e
    WHERE 
    e.EventDate >= DATEADD(dd, -7, #NewEvents.EventDate)
    AND e.EventDate <= #NewEvents.EventDate
    AND e.EventTitle = #NewEvents.EventTitle
    AND e.EventContent = #NewEvents.EventContent
);

That plans looks like this:

enter image description here

Table 'Events'. Scan count 1, logical reads 7

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 149 ms.

For each row in the temp table, there is a seek into your nonclustered index to just get the events that are within 7 days of the new event. A filter is applied to qualifying rows to check if the title and content are equal.

The nice thing about this seek is that will, at most, read all the events within a 7 day span. Which should be a fairly small number.

Thoughts about the Indexed View

I created an index on the view like the one above INCLUDING the computed columns, but for whatever reason I can't get the query to use that index on the view

I don't think the view is really necessary, if the change mentioned above provides reasonable performance. But you could try adding a WITH (NOEXPAND) hint after the view name in the query to see if you can force the view to be used.