Sql-server – Speeding up “duplicate” checks

index-tuningquerysql serversql-server-2008-r2

We have an events table that is reasonably large (~35 million rows)

The original table is like this,

CREATE TABLE [dbo].[Events](
[EventID] [int] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
[EventTypeID] int,
[SourceID] int,
[EventTitle] [varchar](max) NOT NULL,
[EventContent] [varchar](max) NULL,
[EventDate] [datetime] NOT NULL,
CONSTRAINT [PK_Events] PRIMARY KEY CLUSTERED 
([EventID] ASC)
WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

and it has an index that looks like

CREATE NONCLUSTERED INDEX [ix_EventDate] ON [dbo].[Events] 
(
    [EventTypeID] ASC,
    [SourceID] ASC,
    [EventDate] ASC
)
INCLUDE (EventTitle, EventContent)

When adding new events, there is a business requirement to check to see if there's a "duplicate" event filed within the last 7 days (duplicate defined as having the same title and content). The above index is used in the check query, but it is still very slow and only rarely finds any duplicates.

The query looks like

SELECT *
FROM #NewEvents
WHERE EXISTS (SELECT *
    FROM [dbo].[Events] e
    WHERE 
    DATEDIFF(dd, #NewEvents.EventXML.value('(/Event/EventDate)[1]', 'DATETIME'), e.EventDate) BETWEEN -7 AND 0
    AND e.EventTitle = #NewEvents.EventXML.value('(/Event/EventTitle)[1]', 'VARCHAR(MAX)')
    AND e.EventContent = #NewEvents.EventXML.value('(/Event/EventContent)[1]', 'VARCHAR(MAX)')
)

I had thought maybe a couple of computed columns with hashes on the title and content might speed the check (by making the indexes much more compact). Not wanting to alter the underlying table on a theory (possibly mucking with the existing replication setup), I tried to create an indexed view with those computed columns.

I created an index on the view like the one above INCLUDING the computed columns, but for whatever reason I can't get the query to use that index on the view; it always defaults back to the base table index, and to make matters worse doesn't use the INCLUDE values for the computed values; it joins with the PK to fetch the original columns.

The view index looked like this (where TitleHash and ContentHash are HASHBYTES of the full column text)

CREATE NONCLUSTERED INDEX [ix_viewEvents_EventDate] ON [dbo].[viewEvents] 
(
    [EventTypeID] ASC,
    [SourceID] ASC,
    [EventDate] ASC
)
INCLUDE (TitleHash, ContentHash)

Anyway, I was wondering if anyone had any suggestions to speed up the query given the business constraints.

EDIT: The first thing I tried was to change that DATEDIFF expression on the idea that it might have been blocking the use of the old index, but the plan for the old query showed that it was doing an Index Seek (not scan).

As Josh pointed out below, yeah the old query was doing an Index Seek – but only on the 2 leading columns. The date was an output column and that filtering was done later. But that stays that way no matter if I phrase the date filter as DATEDIFF() or "e.EventDate BETWEEN DATEADD(DAY, -7, #NewEvents.EventXML.value('(/Event/EventDate)1', 'DATETIME')) AND #NewEvents.EventXML.value('(/Event/EventDate)1', 'DATETIME')"

EDIT 2: Using WITH (NOEXPAND) did, indeed, get sql server to use the new view and the new index, though to my disappointment running the new view-based query and the old query in the same batch came back with the Actual Plan saying the view was 99% of the 2 query batch. So the experiment doesn't appear to have been successful. Changing the phrasing of the date filtering here, though, does seem to be reflected in the Index Seek node.

Best Answer

Improving the Original Query

The query as written will use the nonclustered index that you created, but it won't seek into it - the whole index gets scanned.

I reproduced the query and data locally - the DATEDIFF version has a plan that looks like this:

Table 'Events'. Scan count 1, logical reads 28323

 SQL Server Execution Times:
   CPU time = 1891 ms,  elapsed time = 1943 ms.

This plan will scan the Events table once for each row in your #NewEvents temp table. It filters rows at the Nested Loops join operator after they've already been read. This is not ideal.

To get a seek, you could write your query like this:

SELECT *
FROM #NewEvents
WHERE EXISTS (SELECT *
    FROM [dbo].[Events] e
    WHERE 
    e.EventDate >= DATEADD(dd, -7, #NewEvents.EventDate)
    AND e.EventDate <= #NewEvents.EventDate
    AND e.EventTitle = #NewEvents.EventTitle
    AND e.EventContent = #NewEvents.EventContent
);

That plans looks like this:

Table 'Events'. Scan count 1, logical reads 7

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 149 ms.

For each row in the temp table, there is a seek into your nonclustered index to just get the events that are within 7 days of the new event. A filter is applied to qualifying rows to check if the title and content are equal.

The nice thing about this seek is that will, at most, read all the events within a 7 day span. Which should be a fairly small number.

Thoughts about the Indexed View

I created an index on the view like the one above INCLUDING the computed columns, but for whatever reason I can't get the query to use that index on the view

I don't think the view is really necessary, if the change mentioned above provides reasonable performance. But you could try adding a WITH (NOEXPAND) hint after the view name in the query to see if you can force the view to be used.

Related Solutions

Sql-server – Query to normalize table/combine row text

This should work, I will clean it up later so its more efficient.

DECLARE @Old TABLE ( 
  id         INT, 
  rank       INT, 
  linenumber INT, 
  sometext   VARCHAR(1000)) 
DECLARE @New TABLE ( 
  id           INT, 
  rank         INT, 
  combinedtext VARCHAR(1000)) 


;WITH combinedresults(ctid, id, rank, linenumber, combinedtext) 
     AS (SELECT 0, 
                id, 
                rank, 
                linenumber, 
                CAST (sometext AS VARCHAR(8000)) 
         FROM   @Old o 
         WHERE  NOT EXISTS (SELECT TOP 1 1 
                            FROM   @Old 
                            WHERE  id = o.id 
                                   AND rank = o.rank 
                                   AND linenumber < o.linenumber) 
         UNION ALL 
         SELECT ctid + 1, 
                o.id, 
                o.rank, 
                o.linenumber, 
                ct.combinedtext + o.sometext 
         FROM   @Old o 
                INNER JOIN combinedresults ct 
                  ON ct.id = o.id 
                     AND ct.rank = o.rank 
         WHERE  o.linenumber > ct.linenumber) 

UPDATE n 
SET    combinedtext = ct.combinedtext 
FROM   @New n 
       INNER JOIN (SELECT n.id, 
                          n.rank, 
                          MAX(o.rank) orank 
                   FROM   @new n 
                          INNER JOIN @Old o 
                            ON n.id = o.id 
                               AND o.rank <= n.rank 
                   GROUP  BY n.id, 
                             n.rank) r 
         ON n.id = r.id 
            AND n.rank = r.rank 
       INNER JOIN (SELECT id, 
                          ct.rank, 
                          MAX(ctid) ctid 
                   FROM   combinedresults ct 
                   GROUP  BY ct.id, 
                             ct.rank) r2 
         ON r2.id = r.id 
            AND r2.rank = r.orank 
       INNER JOIN combinedresults ct 
         ON r.id = ct.id 
            AND ct.rank = r.orank 
            AND ct.ctid = r2.ctid 

SELECT * 
FROM   @New

Sql-server – Best way to index a table with two datetime columns that are equally queried

There is no substitute for benchmarking. To answer the question, I would create and populate several possible tables. Then I would expose these tables to your typical workload, and benchmark.

Including additional columns in you NCIs will slow down modifications and speed up selects. Based on frequency of both, we can choose which approach uses less resources. If a row is on average read twice per year, your conclusions might be different as compared to the case when every row is on average read twice per minute.

Besides, not all queries are born equal. If some queries must complete in certain time no matter what, then you must make sure these requirements are met. Obviously such requirement trump the common good approach described above. Only you can know the actual requirements.