If you need columns in the output that aren't covered by the index, the optimizer has to make a choice:
- Perform a table / clustered index scan (therefore all columns are there)
- Perform a seek, then perform lookups to retrieve the columns not covered
Which way it will choose depends on a variety of things, including how narrow the index is, how many rows match the predicate, etc. You can force a seek with the FORCESEEK
hint, but I suspect it will end up performing the same or worse than the scan SQL Server has chosen in your case.
Some options:
- Change the app to run a proper query. I listed this first for a reason.
Create a view that selects only the columns you need:
CREATE VIEW dbo.myview
WITH SCHEMABINDING
AS
SELECT col1, col2, col3 FROM dbo.tablename;
Then you can change the app to SELECT *
from this view. Or you can get even more creative and rename the original table, and change the name of this view to what the name of the table used to be. Breaking change, obviously; proceed with caution.
Add all of the other columns to the key or INCLUDE
list for the index. If these are hard-coded values and always the ones used, you may consider a filtered index.
I don't think the scan is caused by a search for an empty string (and while you could add a filtered index for that case, it will only help very specific variations of the query). You are more likely a victim of parameter sniffing and a single plan not optimized for all of the various combinations of parameters (and parameter values) that you will be providing to this query.
I call this the "kitchen sink" procedure, because you are expecting one query to provide all the things, including the kitchen sink.
I have videos about my solution to this here and here as well as a blog post about it, but essentially, the best experience I have for such queries is to:
- Build the statement dynamically - this will allow you to leave out clauses mentioning columns for which no parameters were supplied, and ensures that you will have a plan that is optimized precisely for the actual parameters that were passed with values.
- Use
OPTION (RECOMPILE)
- this prevents specific parameter values from forcing the wrong type of plan, especially helpful when you have data skew, bad statistics, or when the first execution of a statement uses an atypical value that will lead to a different plan than later and more frequent executions.
- Use the server option
optimize for ad hoc workloads
- this prevents query variations that are only used once from polluting your plan cache.
Enable optimize for ad hoc workloads:
EXEC sys.sp_configure 'show advanced options', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
EXEC sys.sp_configure 'optimize for ad hoc workloads', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
EXEC sys.sp_configure 'show advanced options', 0;
GO
RECONFIGURE WITH OVERRIDE;
Change your procedure:
ALTER PROCEDURE dbo.Whatever
@Status INT = NULL,
@IsUserGotAnActiveDirectoryUser BIT = NULL
AS
BEGIN
SET NOCOUNT ON;
DECLARE @sql NVARCHAR(MAX) = N'SELECT [IdNumber], [Code], [Status],
[Sex], [FirstName], [LastName], [Profession],
[BirthDate], [HireDate], [ActiveDirectoryUser]
FROM dbo.Employee -- please, ALWAYS schema prefix
WHERE 1 = 1';
IF @Status IS NOT NULL
SET @sql += N' AND ([Status]=@Status)'
IF @IsUserGotAnActiveDirectoryUser = 1
SET @sql += N' AND ActiveDirectoryUser <> ''''';
IF @IsUserGotAnActiveDirectoryUser = 0
SET @sql += N' AND ActiveDirectoryUser = ''''';
SET @sql += N' OPTION (RECOMPILE);';
EXEC sys.sp_executesql @sql, N'@Status INT, @Status;
END
GO
Once you have a workload based on that set of queries that you can monitor, you can analyze the executions and see which ones would most benefit from additional or different indexes - you can do this from a variety of angles, from simple "which combination of parameters are provided most often?" to "which individual queries have the longest runtimes?" We can't answer those questions based just on your code, we can only suggest that any index will only be helpful for a subset of all of the possible parameter combinations you're attempting to support. For example, if @Status
is NULL, then no seek against that non-clustered index is possible. So for those cases where users don't care about status, you're going to get a scan, unless you have an index that caters to the other clauses (but such an index won't be useful either, given your current query logic - either empty string or not empty string is not exactly selective).
In this case, depending on the set of possible Status
values and how distributed those values are, the OPTION (RECOMPILE)
might not be necessary. But if you do have some values that will yield 100 rows and some values that will yield hundreds of thousands, you might want it there (even at the CPU cost, which should be marginal given the complexity of this query), so that you can get seeks in as many cases as possible. If the range of values is finite enough, you could even do something tricky with the dynamic SQL, where you say "I have this very selective value for @Status
, so when that specific value is passed, make this slight alteration to the query text so that this is considered a different query and optimized for that param value."
Best Answer
Short version: seek is much better
Less short version: seek is generally much better, but a great many seeks (caused by bad query design with nasty correlated sub-queries for instance, or because you are making many queries in a cursor operation or other loop) can be worse than a scan, especially if your query may end up returning data from most of the rows in the affected table.
It helps to cover the whole family for data finding operations to fully understand the performance implications.
Table Scans: With no indexes at all that are relevant to your query the planner is forced to use a table scan meaning that every row is looked at. This can result in every page relating to the table's data being read from disk which is often the worst case. Note that for some queries it will use a table scan even when a useful index is present - this is usually because the data in the table is so small that it is more hassle to traverse the indexes (if this is the case you would expect the plan to change as the data grows, assuming the selectivity measure of the index is good).
Index Scans with Row Lookups: With no index that can be directly used for a seek is found but an index containing the right columns is present an index scan may be used. For instance, if you have a large table with 20 columns with an index on column1,col2,col3 and you issue
SELECT col4 FROM exampletable WHERE col2=616
, in this case scanning the index to querycol2
is better than scanning the whole table. Once matching rows are found then the data pages need to be read to pickup col4 for output (or further joining) which is what the "bookmark lookup" stage is when you see it in query plans.Index Scans without Row Lookups: If the above example was
SELECT col1, col2, col3 FROM exampletable WHERE col2=616
then the extra effort to read data pages is not needed: once index rows matchingcol2=616
are found all the requested data is known. This is why you sometimes see columns that will never be searched on, but are likely to be requested for output, added to the end of indexes - it can save row lookups. When adding columns to an index for this reason and this reason only, add them with theINCLUDE
clause to tell the engine that it doesn't need to optimise index layout for querying based on these columns (this can speed up updates made to those columns). Index scans can result from queries with no filtering clauses too:SELECT col2 FROM exampletable
will scan this example index instead of the table pages.Index Seeks (with or without row lookups): In a seek not all of the index is considered. For the query
SELECT * FROM exampletable WHERE c1 BETWEEN 1234 AND 4567
the query engine can find the first row that will match by doing a tree-based search on the index onc1
then it can navigate the index in order until it gets to the end of the range (this is the same with a query forc1=1234
as there could be many rows matching the condition even for an=
operation). This means that only relevant index pages (plus a few needed for the initial search) need to be read instead of every page in the index (or table).Clustered Indexes: With a clustered index the table data is stored in the leaf nodes of that index instead of being in a separate heap structure. This means that there will never need to be any extra row lookups after finding rows using that index no matter what columns are needed [unless you have off-page data like
TEXT
columns orVARCHAR(MAX)
columns containing long data].You can only have one clustered index for this reason[1], the clustered index is your table instead of having a separate heap structure, so if you use one[2] chose where you put it carefully in order to get maximum gain.
Also note that the clustered index because the "clustering key" for the table and is included in every non-clustered index on the table, so a wide clustered index is generally not a good idea.
[1] Actually, you can effectively have multiple clustered indexes by defining non-clustered indexes that cover or include every column on the table, but this is likely to be wasteful of space has a write performance impact so if you consider doing it make sure you really need to.
[2] When I say "if you use a clustered index", do note that it is generally recommended that you do have one on each table. There are exceptions as with all rules-of-thumb, tables that see little other than bulk inserts and unordered reads (staging tables for ETL processes perhaps) being the most common counter example.
Additional point: Incomplete Scans:
It is important to remember that depending on the rest of the query a table/index scan may not actually scan the whole table - if the logic allows the query plan may be able to cause it to abort early. The simplest example of this is
SELECT TOP(1) * FROM HugeTable
- if you look at the query plan for that you'll see that only one row was returned from the scan and if you watch the IO statistics (SET STATISTICS IO ON; SELECT TOP(1) * FROM HugeTable
) you'll see that it only read a very small number of pages (perhaps just one).The same can happen if the predicate of a
WHERE
orJOIN ... ON
clause can be run concurrently with the scan that is the source if its data. The query planner/runner can sometimes be very clever about pushing predicates back towards the data sources to allow early termination of scans in this way (and sometimes you can be clever in rearranging queries to help it do so!). While the data flows right-to-left as per the arrows in the standard query plan display, the logic runs left-to-right and each step (right-to-left) is not necessarily run to completion before the next can start. In the simple example above if you look at each block in the query plan as an agent theSELECT
agent asks theTOP
agent for a row which in turn asks theTABLE SCAN
agent for one, then theSELECT
agent asks for another but theTOP
agent knows there is no need doesn't bother to even ask the table reader, theSELECT
agent gets a "no more is relevant" response and knows all the work is done. Many operations block this sort of optimisation of course so often in more complicated examples a table/index scan really does read every row, but be careful not to jump to the conclusion that any scan must be an expensive operation.