Sql-server – Why isn’t SQL Server using the non-clustered index and doing a clustered index scan

Here are my full tables:

    CREATE TABLE [dbo].[tblCrawlUrls](
    [cl_IdUrl] [int] IDENTITY(1,1) NOT NULL,
    [cl_CrawlNormalizedUrl] [nvarchar](200) NOT NULL,
    [cl_RooSiteId] [smallint] NOT NULL,
    [cl_ExploreDate] [datetime] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_ExploreDate]  DEFAULT (sysutcdatetime()),
    [cl_LastCrawlDate] [datetime] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_LastCrawlDate]  DEFAULT ('2000-08-11 15:18:47.407'),
    [cl_CrawlSource] [nvarchar](max) NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_CrawlSource]  DEFAULT ('null'),
    [cl_CrawlOrgUrl] [nvarchar](200) NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_CrawlOrgUrl]  DEFAULT ('null'),
    [cl_ExploredURL] [nvarchar](200) NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_ExploredURL]  DEFAULT ('null'),
    [cl_Ignored_By_Containing_Word] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_Ignored_By_Containing_Word]  DEFAULT ((0)),
    [cl_CrawlFailedTimes] [int] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_CrawlFailedTimes]  DEFAULT ((0)),
    [cl_TotalCrawlTimes] [int] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_TotalCrawlTimes]  DEFAULT ((0)),
    [cl_UpdatedTimes] [int] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_UpdatedTimes]  DEFAULT ((0)),
    [cl_DuplicateUrl_ByCanonical] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_DuplicateUrl_ByCanonical]  DEFAULT ((0)),
    [cl_PageProcessed] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_PageProcessed]  DEFAULT ((0)),
    [cl_LastProcessDate] [datetime] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_LastProcessDate]  DEFAULT ('2000-08-11 15:18:47.407'),
    [cl_PossibleProductPage] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_PossibleProductPage]  DEFAULT ((0)),
    [cl_CertainlyNotProductPage] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_CertainlyNotProductPage]  DEFAULT ((0)),
    [cl_IsProductPage] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_IsProductPage]  DEFAULT ((0)),
    [cl_Determined_Not_A_Product_Page] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_NotProduct_Page]  DEFAULT ((0)),
    [cl_FreeCargo] [bit] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_FreeCargo]  DEFAULT ((0)),
    [cl_ProductPrice_TL] [int] NOT NULL CONSTRAINT [DF_tblCrawlUrls_cl_ProductPrice]  DEFAULT ((0)),
    [cl_ProductCode] [nvarchar](200) NULL,
    [cl_ProductImageLink] [nvarchar](200) NULL,
    [cl_ProductIdCode] [nvarchar](200) NULL,
    [cl_ProductCategoriesAsText] [nvarchar](200) NULL,
    [cl_ProductDetailedExplanation] [nvarchar](max) NULL,
    [cl_ProductFeatures_Wrapped] [nvarchar](max) NULL,
 CONSTRAINT [PK_tblCrawlUrls] PRIMARY KEY CLUSTERED 
(
    [cl_IdUrl] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]


CREATE TABLE [dbo].[tblRootSites](
    [cl_RootSiteId] [smallint] NOT NULL,
    [cl_SiteRootUrl] [nvarchar](200) NOT NULL,
    [cl_Disabled] [bit] NOT NULL CONSTRAINT [DF_tblRootSites_Disabled]  DEFAULT ((0)),
    [cl_AlexaRank_TR] [int] NOT NULL CONSTRAINT [DF_tblRootSites_cl_AlexaRank]  DEFAULT ((0)),
    [cl_RegisterTime] [datetime] NOT NULL CONSTRAINT [DF_tblRootSites_cl_RegisterTime]  DEFAULT (sysutcdatetime()),
    [cl_PriceDelimeter] [varchar](1) NOT NULL CONSTRAINT [DF_tblRootSites_cl_PriceDelimeter]  DEFAULT ('.'),
    [cl_PriceIgnoreDelimeter] [varchar](1) NOT NULL CONSTRAINT [DF_tblRootSites_cl_PriceIgnoreDelimeter]  DEFAULT (','),
    [cl_CertainProductPageDefiner] [nvarchar](200) NOT NULL CONSTRAINT [DF_tblRootSites_cl_CertainProductPageDefiner]  DEFAULT ('null'),
    [cl_ProductCategoryListing_Priority] [smallint] NOT NULL CONSTRAINT [DF_tblRootSites_cl_ProductCategoryListing_Priority]  DEFAULT ((0)),
    [cl_Ignore_Words_From_Category] [nvarchar](200) NOT NULL CONSTRAINT [DF_tblRootSites_cl_Ignore_From_Category]  DEFAULT ('null'),
    [cl_Ignore_Words_From_Urls] [nvarchar](200) NOT NULL CONSTRAINT [DF_tblRootSites_cl_Ignore_Urls_Parameters]  DEFAULT ('null'),
    [cl_Use_Last_Category_As_A_Product_Code] [bit] NOT NULL CONSTRAINT [DF_tblRootSites_cl_Use_Last_Category_As_A_Product_Code]  DEFAULT ((0)),
    [cl_Use_Custom_Character_Encoding] [int] NOT NULL CONSTRAINT [DF_tblRootSites_cl_Use_Custom_Character_Encoding]  DEFAULT ((0)),
    [cl_IgnorePages_Words] [nvarchar](200) NOT NULL CONSTRAINT [DF_tblRootSites_cl_IgnorePages_Words]  DEFAULT ('null'),
    [cl_IgnoreLastCategory] [bit] NOT NULL CONSTRAINT [DF_tblRootSites_cl_IgnoreLastCategory]  DEFAULT ((0)),
 CONSTRAINT [PK_tblRootSites] PRIMARY KEY CLUSTERED 
(
    [cl_RootSiteId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85) ON [PRIMARY]
) ON [PRIMARY]

Here is my SQL query:

 SELECT TOP 1000 cl_idurl,
                cl_roositeid,
                cl_crawlsource,
                cl_crawlorgurl
FROM   tblcrawlurls
WHERE  cl_pageprocessed = 0
       AND cl_totalcrawltimes > 0
       AND cl_certainlynotproductpage = 0
       AND cl_duplicateurl_bycanonical = 0
       AND cl_roositeid IN (SELECT cl_rootsiteid
                            FROM   tblrootsites
                            WHERE  cl_disabled = 0)
ORDER  BY cl_lastprocessdate ASC

Here is the index which I assumed would be used, but it is not being used.

    CREATE NONCLUSTERED INDEX [TCT-PP-LPD-CNPP-DUBC] ON [dbo].[tblCrawlUrls]
(
    [cl_TotalCrawlTimes] ASC,
    [cl_PageProcessed] ASC,
    [cl_LastProcessDate] ASC,
    [cl_CertainlyNotProductPage] ASC,
    [cl_DuplicateUrl_ByCanonical] ASC
)
INCLUDE (   [cl_IdUrl],
    [cl_RooSiteId],
    [cl_CrawlOrgUrl]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 85)

In my query I use these as restrictions (in the where clause)

cl_pageprocessed
cl_totalcrawltimes
cl_certainlynotproductpage
cl_duplicateurl_bycanonical

I am assuming that using an index that includes these columns to determine possible primary key ids should be faster than total clustered index scan. But SQL Server is not doing that.

So my logic is not correct and that is what I want to learn. Why does SQL Server choose not to use my index? Not using my index means extra unnecessary load on my database and my hard drives.

Here is the execution plan:

enter image description here

Best Answer

Your first column in the index is cl_totalcrawltimes. In the WHERE clause you're using this column to define a lower range (cl_totalcrawltimes > 0).

Depending on the data distribution this most probably will lead to an index scan instead of an index seek. Furthermore, cl_crawlsource is not part of the index (the include section, preferrably), so this would require a clustered index key lookup anyway.

Therefore, SQL Server ignores the index since it seems to be more efficient to perform a clustered index scan.

For best results, make those columns on which you query for an exact match the first columns of your nonclustered index. (Whether the next column should be your inequality column or your order by column depends on what the optimizer likes, if any.)

Best Answer

Related Solutions

Sql-server – Why is Clustered Index on Primary Key compulsory

Sql-server – PAD_INDEX and FILLFACTOR on clustered Identity index

Related Question