Why Creating Index and Running Query is Faster in SQL Server

query-performancesql serversql-server-2017

I have table with few million rows. It contains logs from external service, so I've decided not to index it (lot of inserts, sparse reads).

When I run query that reads from the table without index, it takes (unsurprisingly) very long time.

However, when I create index and run the query and then drop the index, it is considerably faster (even with creating and dropping index).

Why is it faster to create index ad hoc, instead of letting SQL Server do it's thing? It seems unintuitive (Why wouldn't SQL Server create the index itself?).
Are there any downsides to this approach?

The query in question looks something like this, but I do not think it is necessarily relevant, as I've seen similar behavior elsewhere as well.

    UPDATE Device
    SET Col1 = l.Col1
        ,Col2 = l.Col2
        ,Col3 = l.Col3
    FROM dbo.Device
        OUTER APPLY (
            SELECT MAX(Id) AS [Id]
            FROM dbo.Logs 
            WHERE Logs.Device_FK = Device.Id
            GROUP BY Logs.Device_FK
        ) lastLog
        OUTER APPLY (
            SELECT Col1, Col2, FORMAT(Col3) AS "Col3"
            FROM dbo.Logs
            WHERE Logs.Id = lastLog.Id
        ) l

Best Answer

There's not necessarily any downsides to your approach, it's completely dependent on how frequently you're writing to the table vs updating and reading from it, which is why SQL Server let's the developer choose what and when to index things instead of trying to guess what they want. (It doesn't know your future intentions with a particular table any better than you do.)

I'm sure you understand how indexing works so I won't go into too much detail, but generally speaking, indexing stores the data sorted in a B-Tree data structure. So it's very efficient when having to look up a specific set of data that is covered by that index.

Because of how B-Trees work and the algorithm used to building one with the indexed data, it's generally fastest to have all the data upfront and then index it into a B-Tree. When you already have an index (B-Tree) in place and new data is added or deleted, then that can cause additional "shuffling" events to reorganize the B-Tree which can be less efficient. (The "shuffling" that occurs makes me think of MergeSort algorithm by the way. That helps me visualize the difference of having all the data up front vs adding new data after it's been sorted already.)

Obviously usually it's not the case that Tables generally stay the same size with the exact same records, which is why more times than not it's recommended to create your indexes on the Table upfront and SQL Server does it's best to efficiently update the underlying B-Tree as changes occur in the Table (and it does a great job at it up to a certain point).

In certain cases though, yours maybe being one, if the Table has a very high frequency of INSERTs and DELETEs and low frequency of UPDATEs and SELECTs against it, then creating and dropping the indexes on that Table ad-hoc (just before and after reading from the Table) might make sense.

At the end of the day you'd have to test both ways to see what works best for your environment. The size of the Table doesn't matter so much when it comes to choosing when to index it, as it makes no difference to SQL Server if you want to store a billion records in a single index or ten records, but moreso it depends how frequently you're inserting and deleting from that Table vs updating and selecting from it. (E.g. if you insert 100,000 records every minute into that Table but only SELECT from it once a day then it may be better to create your index ad-hoc.)

Related Solutions

Sql-server – How to create implicit select statements

Not sure that this is simpler than a cursor solution, but here is what I would do:

Before we begin we need a few tables in a SQL Fiddle:

MS SQL Server 2012 Schema Setup

CREATE TABLE dbo.tbl1(c1 INT, c2 INT, c3 INT);
CREATE TABLE dbo.tbl2(c4 INT, c5 INT, c6 INT);
CREATE TABLE dbo.tbl3(c7 INT, c8 INT, c9 INT);

First we need the list of all user tables. For that we can use the sys.tables catalog view. The OBJECT_SCHEMA_NAME() function gets us the schema name, the OBJECT_NAME() function the table name. The QUOTENAME() function quotes the names correctly, in case some of those contain special characters, key words or spaces. (In this example the use of OBJECT_NAME() is not strictly necessary as sys.tables has a name column, but I left it in as you can use this pattern with any catalog view that has an object_id column.)

Query 1:

SELECT 
  QUOTENAME(OBJECT_SCHEMA_NAME(T.object_id))+'.'+
  QUOTENAME(OBJECT_NAME(T.object_id)) AS quoted_table_name,
  T.object_id
FROM sys.tables AS T;

Results:

| QUOTED_TABLE_NAME | OBJECT_ID |
|-------------------|-----------|
|      [dbo].[tbl1] | 245575913 |
|      [dbo].[tbl2] | 261575970 |
|      [dbo].[tbl3] | 277576027 |

The next step is to get the list of column names, again quoted. We can use the sys.columns catalog view for that.

Query 2:

SELECT C.name,C.column_id 
  FROM sys.columns AS C
 WHERE C.object_id = OBJECT_ID('dbo.tbl1');

Results:

| NAME | COLUMN_ID |
|------|-----------|
|   c1 |         1 |
|   c2 |         2 |
|   c3 |         3 |

The next hurdle is to get those columns in a comma separated list. There is no string concatenation aggregate function build in so we have to use a trick:

Query 3:

SELECT STUFF((
  SELECT ','+QUOTENAME(name)
    FROM sys.columns AS C
   WHERE C.object_id = OBJECT_ID('dbo.tbl1')
   ORDER BY C.column_id
     FOR XML PATH(''),TYPE
  ).value('.','NVARCHAR(MAX)'),1,1,'') AS clomun_list;

Results:

|    CLOMUN_LIST |
|----------------|
| [c1],[c2],[c3] |

With that all pieces are in place and we just have to put them all together:

Query 4:

SELECT 'SELECT ' +
       CL.column_list +
       ' FROM ' +
       QUOTENAME(OBJECT_SCHEMA_NAME(T.object_id)) + '.' +
       QUOTENAME(OBJECT_NAME(T.object_id)) +
       ';' AS select_statement
FROM sys.tables AS T
CROSS APPLY (
  SELECT STUFF((
    SELECT ','+QUOTENAME(name)
      FROM sys.columns AS C
     WHERE C.object_id = T.object_id
     ORDER BY C.column_id
       FOR XML PATH(''),TYPE
    ).value('.','NVARCHAR(MAX)'),1,1,'') AS column_list
  )CL;

Results:

|                         SELECT_STATEMENT |
|------------------------------------------|
| SELECT [c1],[c2],[c3] FROM [dbo].[tbl1]; |
| SELECT [c4],[c5],[c6] FROM [dbo].[tbl2]; |
| SELECT [c7],[c8],[c9] FROM [dbo].[tbl3]; |

This example will work in SQL 2005 and later, assuming you are on the latest service pack. There is no clean solution to achieve this with SQL Server 2000. In that case you need to go back to your cursor solution.

SQL Server Optimization – New Query Slower Despite Fewer Logical Reads

The percentage costs on an execution plan are from the optimizer estimates, even when an actual execution plan produced. The actual execution plan does use the exact plan and include both the estimated rows and the actual row counts. Discrepancies between the row counts can be useful to determine how accurate the estimate was.

Somewhere between the subquery and comparing it to the column in derived, the optimizer wasn't able to correctly estimate how many rows would match. It guessed that there would be 18 rows from derived when there were actually over 220,000. An additional clue is the warning message Cardnality Estimate: CONVERT(nvarchar(35),[mssqlsystemresource].[sys].[spt_values].[name],0) on the SELECT node.

If you were to check the query run length with something else, such as STATISTICS TIME, I would expect them to be much closer, and likely the second query running faster.

Here's another plan analysis with a somewhat similar situation. (SQL Server Plan Explorer) (with hat tip to Kendra Little on how to fool the optimizer)

The estimate shows a 93%/7% cost split, but by looking at the actual CPU, time, or IO, the difference is not that extreme. IO is about 75%/25% and CPU is roughly 60%/40%. (I tried to come up with something more even, but wasn't able to.)

Best Answer

Related Solutions

Sql-server – How to create implicit select statements

SQL Server Optimization – New Query Slower Despite Fewer Logical Reads

Related Question