Sql-server – Performance refactoring: Trying to avoid the table scan

index-tuningperformancequery-performancesql serversql-server-2008

I have a proc containing a query joining a few tables and I am having some performance issues.

The main table (which is a huge table) has a PK and a few NC indexes.

CREATE TABLE [dbo].[TableA]
(
    [TableAID] [bigint] NOT NULL,
    [UserID] [int] NOT NULL,
    [IP1] [tinyint] NOT NULL,
    [IP2] [tinyint] NOT NULL,
    [IP3] [tinyint] NOT NULL,
    [IP4] [tinyint] NOT NULL
    CONSTRAINT [PK_TableA] 
        PRIMARY KEY CLUSTERED  ([TableAID] ASC)
) ON [PRIMARY]


CREATE NONCLUSTERED INDEX [idx_1] ON [dbo].[TableA] 
(
    [UserID] ASC
)


CREATE NONCLUSTERED INDEX [idx_2] ON [dbo].[TableA] 
(
    [IP1] ASC,
    [IP2] ASC,
    [IP3] ASC,
    [IP4] ASC
)

This is the query with the poor perf:

SELECT DISTINCT a.UserID, a.IP1, a.IP2, a.IP3, a.IP4
    FROM [dbo].[TableA] a WITH (NOLOCK) 
    JOIN [dbo].[TableB] b WITH (NOLOCK) ON b.UserID = a.UserID 
    JOIN [dbo].[Tablec] c WITH (NOLOCK) ON b.CountryID = c.CountryID
    JOIN ( 
        SELECT 
            IP1, IP2, IP3, IP4          
            from 
            @IPs 
            ) as ip ON 
                ((ip.IP1 is NULL) OR (ip.IP1=a.IP1)) AND 
                ((ip.IP2 is NULL) OR (ip.IP2=a.IP2)) AND
                ((ip.IP3 is NULL) OR (ip.IP3=a.IP3)) AND
                ((ip.IP4 is NULL) OR (ip.IP4=a.IP4))

The definition of @IPs table:

DECLARE @IPs TABLE (
    IP1 int,
    IP2 int,
    IP3 int,
    IP4 int
    )


INSERT INTO @IPs(IP1,IP2,IP3,IP4)
SELECT      T.v.value('(IP1/node())[1]', 'int'),
            T.v.value('(IP2/node())[1]', 'int'),
            T.v.value('(IP3/node())[1]', 'int'),
            T.v.value('(IP4/node())[1]', 'int')
FROM @IPAddresses.nodes('//IPAddresses/IPAddress') T(v)

@IPAddresses is xml. Just found out that xml could send more IPs, so that means more than one row in the IPs table.

The issue is the number of reads on TableA. Even if I have a NC index for the IP columns, that join condition is forcing a table scan…

How can I improve the performance? How can I refactor this table/query?

I am still thinking if there is a simpler and better way to rewrite this code:

SELECT 
        IP1, IP2, IP3, IP4          
        from 
        @IPs 
        ) as ip ON 
            ((ip.IP1 is NULL) OR (ip.IP1=a.IP1)) AND 
            ((ip.IP2 is NULL) OR (ip.IP2=a.IP2)) AND
            ((ip.IP3 is NULL) OR (ip.IP3=a.IP3)) AND
            ((ip.IP4 is NULL) OR (ip.IP4=a.IP4))

…in case when we have more IPs.

Best Answer

There are a few things that make this challenging. If you aren't careful the NULL checks can prevent index seeks. Also, when the columns are NULL you obviously can't search against them. So if IP1 is NULL then the four column index idx_2 won't be very useful. It doesn't really seem to be possible to define one index that will be selective for any combination of NULL variables. Also, SQL Server cannot continue an index seek after it seeks on an inequality predicate:

In the same way, if we have an index on two columns, we can only use the index to satisfy a predicate on the second column if we have an equality predicate on the first column.

That means that tricks that use the bounds of the TINYINT data type aren't likely to be effective, such as the following:

a.IP1 >= NULLIF(ip.IP1, 0) AND a.IP1 <= NULLIF(ip.IP1, 255)

In addition to that, the strategy that I'm using seems to work much better with the new cardinality estimator introduced in SQL Server 2014 and your question is tagged with SQL Server 2008.

I strongly recommend splitting up the table variable into rows and processing one row at a time. This is mostly to deal with the NULL values and because your comments hinted that most of the time you only get one row. As long as performance is good enough for one row and you don't have too many rows it should be fine. If that isn't acceptable perhaps you can check if any of the rows in the temp table (don't use a table variable) are NULL and branch your code in that case.

With all of that said, it appears to be possible to get very good performance without a clustered index scan as long as at most two of the IP address pieces aren't NULL. When three of them are NULL you're getting back most of the table and at that point it probably makes sense to do a clustered index scan.

Below is the data that I mocked up to test various solutions. I generated 100 million IP addresses by randomly picking an integer between 0 and 255 for each piece. I know that IP address distribution in real life isn't quite that random but didn't have a way to generate better data.

CREATE TABLE [dbo].[TableA](
    [TableAID] [bigint] NOT NULL,
    [UserID] [int] NOT NULL,
    [IP1] [tinyint] NOT NULL,
    [IP2] [tinyint] NOT NULL,
    [IP3] [tinyint] NOT NULL,
    [IP4] [tinyint] NOT NULL
CONSTRAINT [PK_TableA] PRIMARY KEY CLUSTERED 
( [TableAID] ASC )
);

-- insert 100 million random IP addresses
INSERT INTO [dbo].[TableA] WITH (TABLOCK)
SELECT TOP (100000000)
      ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    , ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 10000
    , ABS(BINARY_CHECKSUM(NEWID()) % 256)
    , ABS(BINARY_CHECKSUM(NEWID()) % 256)
    , ABS(BINARY_CHECKSUM(NEWID()) % 256)
    , ABS(BINARY_CHECKSUM(NEWID()) % 256)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3;

CREATE TABLE [dbo].[TableB] (
    [UserID] [int] NOT NULL,
    FILLER VARCHAR(100),
    PRIMARY KEY (UserId)
);

-- insert 10k users
INSERT INTO [dbo].[TableB] WITH (TABLOCK)
SELECT TOP (10000) -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, REPLICATE('Z', 100)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;

Notice how I didn't create either of the nonclustered indexes that you already have. Instead I'll create one index for each IP piece:

CREATE NONCLUSTERED INDEX [idx_IP1] ON [dbo].[TableA] ([IP1] ASC);
CREATE NONCLUSTERED INDEX [idx_IP2] ON [dbo].[TableA] ([IP2] ASC);
CREATE NONCLUSTERED INDEX [idx_IP3] ON [dbo].[TableA] ([IP3] ASC);
CREATE NONCLUSTERED INDEX [idx_IP4] ON [dbo].[TableA] ([IP4] ASC);

The space required for the indexes is not insignificant. Here is the comparison:

╔═══════════╦═════════════╗
║ IndexName ║ IndexSizeKB ║
╠═══════════╬═════════════╣
║ idx_1     ║     1786488 ║
║ idx_2     ║     1786480 ║
║ idx_IP1   ║     1487616 ║
║ idx_IP2   ║     1487616 ║
║ idx_IP3   ║     1487632 ║
║ idx_IP4   ║     1487608 ║
║ PK_TableA ║     2482056 ║
╚═══════════╩═════════════╝

If necessary you can weigh the pros and cons of using row or page compression to reduce the size of the indexes. However, if you don't know which IP pieces will be NULL and you need to avoid a clustered index scan then I don't see a better alternative to the four indexes. The strategy that I'm going to employ is called an index join. The nonclustered index includes the clustered key TableAID which makes it possible to join the indexes together. Each index should have a selectivity of around 0.4% and it should be relatively cheap to find those rows with a nonclustered index seek. Joining together all of the indexes should greatly reduce the result set and at that point you can do clustered index seeks against the table to get the other column values that you need such as UserID.

Here's the query:

DECLARE @ip1 TINYINT = ?;
DECLARE @ip2 TINYINT = ?;
DECLARE @ip3 TINYINT = ?;
DECLARE @ip4 TINYINT = ?;

SELECT DISTINCT a.UserID, a.IP1, a.IP2, a.IP3, a.IP4
FROM [dbo].[TableA] a 
JOIN [dbo].[TableB] b ON b.UserID = a.UserID 
WHERE
((@ip1 is NULL) OR (@ip1=a.IP1)) AND 
((@ip2 is NULL) OR (@ip2=a.IP2)) AND
((@ip3 is NULL) OR (@ip3=a.IP3)) AND
((@ip4 is NULL) OR (@ip4=a.IP4)) 
OPTION (RECOMPILE, QUERYTRACEON 9481);

With the RECOMPILE hint I'm taking advantage of the parameter embedding optimization. This optimization is only available with a certain service pack (SP4?) so make sure that you're patched up. The query optimizer is able to split up the single table access on TableA into an index join if it sees fit. Note that the estimated plan is very likely to be misleading here. You want to pay attention to the actual plan.

The QUERYTRACEON 9481 hint should not be included in your version of the query. I'm using it to force SQL Server to use the legacy CE which is only necessary because I'm testing against SQL Server 2016.

Let's run a few tests. With the following parameters:

DECLARE @ip1 TINYINT = 1;
DECLARE @ip2 TINYINT = 102;
DECLARE @ip3 TINYINT = 234;
DECLARE @ip4 TINYINT = 172;

I get merge join and loop joins.

With the following parameters:

DECLARE @ip1 TINYINT = NULL;
DECLARE @ip2 TINYINT = 102;
DECLARE @ip3 TINYINT = 234;
DECLARE @ip4 TINYINT = 172;

I get a very similar plan, except that the index on IP1 is not used in the query. The query still finishes in around 125 ms.

With the following parameters:

DECLARE @ip1 TINYINT = 88;
DECLARE @ip2 TINYINT = NULL;
DECLARE @ip3 TINYINT = NULL;
DECLARE @ip4 TINYINT = NULL;

The query finishes in around 5 seconds. I get a hash join along with a clustered index scan. This isn't necessarily a bad thing but can be avoided with some effort (see the following paragraph):

In case there is a need to effectively force index joins (perhaps SQL Server 2008 is missing an optimization that I'm taking advantage of), it can be done, but it is much more complicated.

Best Answer

Related Solutions

Sql-server – Alternative query to this (avoid DISTINCT)

Mysql – Identical query, tables, but different EXPLAIN and performance

Related Question