Sql-server – Understanding about SQL index on DateTime Column

indexsql server

I have about 1 million records. When searching through the date range, it is surprisingly slow. Then I do the google search and someone suggested to do the index on the "DateTime" column. I am not convinced because I always thought index is for O(1) search – which means equality search; and since my query is a date range search, how can it be possible to improve the performance?

But, out of curiosity, I tried anyway and to my surprise, it does improve the performance significantly.

Here is my table

tblOrder
--------
OrderId (Primary Key)
CurrencyId
ProductId
...
...
...
OrderDate DATETIME

I am using linqtosql and the code is:

return (from rec in tblOrders.AsNoTracking()
        where (rec.CurrencyId == CurrencyId) && (rec.OrderDate >= startDate) && (rec.OrderDate < endDate)
        select rec).ToList();

I created a composite index on both CurrencyId and OrderDate. The result is a huge improvement. My question is, how is it possible? Why do indexes work on range search (not equal but >= or <=).

Can someone explain to me how does SQL index work in regards to date range search?

Thank you…

Best Answer

B-tree indexes don't work by comparing only for equality. They're designed (the algorithm is designed) to use ranges.

Figure 1.3 shows an index fragment to illustrate a search for the key “57”. The tree traversal starts at the root node on the left-hand side. Each entry is processed in ascending order until a value is greater than or equal to (>=) the search term (57). In the figure it is the entry 83. The database follows the reference to the corresponding branch node and repeats the procedure until the tree traversal reaches a leaf node.

References

The Search Tree (B-Tree) Makes the Index Fast See especially section 1.3, B-Tree traversal
SQL Server index structure and concepts

Related Solutions

Sql-server – Will an index that is a partial key of the begining of another index perform any better than the longer index

There really isn't even any guarantee that the optimizer will use the index in the first place. The difference of only one column between the two indexes is (for the most part) trivial. But if it did, any performance gain you would see (if any) would be trivial.

The reason for this is in how the B-Tree index is implemented by SQL Server. Both indexes are equally capable of satisfying the predicate (WHERE clause) and therefore locating only the rows that meet the search parameters. The only difference between the two would be how many pages in the index SQL Server would have to read in order to locate the required rows. And how many pages will ultimately be determined by the size of the index rows (Storage size in bytes of (TransactionDate, ClientID) vs. the alternative, with only the size of the State column being the difference between the two.

Just as an example, if a single index row was 60 bytes, SQL Server would only have to read 2 pages in order to locate a row up until there were around 2.5 million rows in the index. Then it would only require 3 pages to be read until the table reached more than 300 million rows.

So, what does that mean? Whether it is looking through an index of 2 or all 3 columns, SQL Server is still only looking through the same number of pages to locate the rows, unless the State column is so large as to create a significant difference in the size of a single row, thus causing the index pages to fill considerably faster. Any noticeable performance difference would only be caused if SQL Server had to read through more levels of the index to satisfy the query (3 levels vs. 4 levels, etc) and a single column of text, especially if State is a 2 char state abbreviation, simply won't be enough to make a significant difference.

If you really want to boost performance of the query, and the table is quite large, you might be better off exploring table partitioning, or if not on Enterprise Edition perhaps a partitioned view, all depending on whether there are certain ranges that are primarily searched and others that are searched much less often.

Sql-server – How to avoid using variables in WHERE clause

Parameter sniffing is your friend almost all of the time and you should write your queries so that it can be used. Parameter sniffing helps building the plan for you using the parameter values available when the query is compiled. The dark side of parameter sniffing is when the values used when compiling the query is not optimal for the queries to come.

The query in a stored procedure is compiled when the stored procedure is executed, not when the query is executed so the values that SQL Server has to deal with here...

CREATE PROCEDURE WeeklyProc(@endDate DATE)
AS
BEGIN
  DECLARE @startDate DATE = DATEADD(DAY, -6, @endDate)
  SELECT
    -- Stuff
  FROM Sale
  WHERE SaleDate BETWEEN @startDate AND @endDate
END

is a known value for @endDate and an unknown value for @startDate. That will leave SQL Server to guessing on 30% of the rows returned for the filter on @startDate combined with whatever the statistics tells it for @endDate. If you have a big table with a lot of rows that could give you a scan operation where you would benefit most from a seek.

Your wrapper procedure solution makes sure that SQL Server sees the values when DateRangeProc is compiled so it can use known values for both @endDate and @startDate.

Both your dynamic queries leads to the same thing, the values are known at compile-time.

The one with a default null value is a bit special. The values known to SQL Server at compile-time is a known value for @endDate and null for @startDate. Using a null in a between will give you 0 rows but SQL Server always guess at 1 in those cases. That might be a good thing in this case but if you call the stored procedure with a large date interval where a scan would have been the best choice it may end up doing a bunch of seeks.

I left "Use the DATEADD() function directly" to the end of this answer because it is the one I would use and there is something strange with it as well.

First off, SQL Server does not call the function multiple times when it is used in the where clause. DATEADD is considered runtime constant.

And I would think that DATEADD is evaluated when the query is compiled so that you would get a good estimate on the number of rows returned. But it is not so in this case.
SQL Server estimates based on the value in the parameter regardless of what you do with DATEADD (tested on SQL Server 2012) so in your case the estimate will be the number of rows that is registered on @endDate. Why it does that I don't know but it has to do with the use of the datatype DATE. Shift to DATETIME in the stored procedure and the table and the estimate will be accurate, meaning that DATEADD is considered at compile time for DATETIME not for DATE.

So to summarize this rather lengthy answer I would recommend the wrapper procedure solution. It will always allow SQL Server to use the values provided when compiling the the query without the hassle of using dynamic SQL.

PS:

In comments you got two suggestions.

OPTION (OPTIMIZE FOR UNKNOWN) will give you an estimate of 9% of rows returned and OPTION (RECOMPILE) will make SQL Server see the parameter values since the query is recompiled every time.

Best Answer

Related Solutions

Sql-server – Will an index that is a partial key of the begining of another index perform any better than the longer index

Sql-server – How to avoid using variables in WHERE clause

Related Question