In a world where the query optimizer considered all possible join orders, and contained all possible logical transformations, the syntax we use for our queries would not matter at all.
As it is, the optimizer generally uses heuristics to pick an initial join order and explores a number of join order rewrites from there. It does this to avoid excessive compilation time and resource usage. It doesn't take all that many joins for the number of possible combinations to become unreasonable to explore exhaustively.
To take an extreme example, 42 joins are enough to generate more alternatives than there are atoms in the observable universe. More realistically, even 7 tables are enough to produce 665,280 alternatives. Although this is not a mind-boggling number, it would still take very significant time (and memory) to explore those alternatives completely.
Although the heuristics are largely based on the type of join (inner, outer, cross...) and cardinality estimates, the textual order of the query can also have an impact. Sometimes, this is an optimizer limitation - NOT EXISTS
clauses are not reordered, and outer join reordering is very limited. Even with simple inner joins, the interaction between textual order, initial join order heuristics, and optimizer internals can be difficult to predict with certainty.
To take an example using the AdventureWorks sample database, I can write a query using the a common syntax form as:
SELECT
P.Name,
PS.Name,
SUM(TH.Quantity),
SUM(INV.Quantity)
FROM Production.Product AS P
JOIN Production.ProductSubcategory AS PS
ON PS.ProductSubcategoryID = P.ProductSubcategoryID
JOIN Production.TransactionHistory AS TH
ON TH.ProductID = P.ProductID
JOIN Production.ProductInventory AS INV
ON INV.ProductID = P.ProductID
GROUP BY
P.ProductID,
P.Name,
PS.ProductSubcategoryID,
PS.Name;
Before cost-based optimization, the logical query tree looks like this (note the join order is not the same as the written order):
I can (carefully) rewrite the query to use 'nested' syntax:
SELECT
P.Name,
PS.Name,
SUM(TH.Quantity),
SUM(INV.Quantity)
FROM Production.ProductSubcategory AS PS
JOIN Production.Product AS P
JOIN Production.TransactionHistory AS TH
JOIN Production.ProductInventory AS INV
ON INV.ProductID = TH.ProductID
ON TH.ProductID = P.ProductID
ON P.ProductSubcategoryID = PS.ProductSubcategoryID
GROUP BY
P.ProductID,
P.Name,
PS.ProductSubcategoryID,
PS.Name;
In which case the logical tree at the same point is:
The two different syntaxes produce a different initial join order in this case. After cost-based optimization, both produce the same output plan shape:
There are detailed differences between the two plans, with the 'nested' syntax producing a plan with a somewhat lower estimated cost:
The two inputs took a slightly different path through the optimizer, so it isn't all that surprising there are slight differences.
In general, using different syntax will sometimes (definitely not always!) produce different plan results. There is no broad correlation between one syntax and better plans. Most people write and maintain queries using something like the non-nested join syntax, so it often makes practical sense to use that.
To summarize, my advice is to write queries using whichever syntax seems most natural (and maintainable!) to you and your peers. If you get a better plan for a specific query using a particular syntax, by all means use it - but be sure to test that you still get the better plan whenever you patch or upgrade SQL Server :)
Even though the index is suggested by the SQL Server, why does it slow things down by a significant difference?
Index suggestions are made by the query optimizer. If it comes across a logical selection from a table which is not well served by an existing index, it may add a "missing index" suggestion to its output. These suggestions are opportunistic; they are not based on a full analysis of the query, and do not take account of wider considerations. At best, they are an indication that more helpful indexing may be possible, and a skilled DBA should take a look.
The other thing to say about missing index suggestions is that they are based on the optimizer's costing model, and the optimizer estimates by how much the suggested index might reduce the estimated cost of the query. The key words here are "model" and "estimates". The query optimizer knows little about your hardware configuration or other system configuration options - its model is largely based on fixed numbers that happen to produce reasonable plan outcomes for most people on most systems most of the time. Aside from issues with the exact cost numbers used, the results are always estimates - and estimates can be wrong.
What is the Nested Loop join which is taking most of the time and how to improve its execution time?
There is little to be done to improve the performance of the cross join operation itself; nested loops is the only physical implementation possible for a cross join. The table spool on the inner side of the join is an optimization to avoid rescanning the inner side for each outer row. Whether this is a useful performance optimization depends on various factors, but in my tests the query is better off without it. Again, this is a consequence of using a cost model - my CPU and memory system likely has different performance characteristics than yours. There is no specific query hint to avoid the table spool, but there is an undocumented trace flag (8690) that you can use to test execution performance with and without the spool. If this were a real production system problem, the plan without the spool could be forced using a plan guide based on the plan produced with TF 8690 enabled. Using undocumented trace flags in production is not advised because the installation becomes technically unsupported and trace flags can have undesirable side-effects.
Is there something that I am doing wrong or have missed?
The main thing you are missing is that although the plan using the nonclustered index has a lower estimated cost according to the optimizer's model, it has a significant execution-time problem. If you look at the distribution of rows across threads in the plan using the Clustered Index, you will likely see a reasonably good distribution:
In the plan using the Nonclustered Index Seek, the work ends up being performed entirely by one thread:
This is a consequence of the way work is distributed among threads by parallel scan/seek operations. It is not always the case that a parallel scan will distribute work better than an index seek - but it does in this case. More complex plans might include repartitioning exchanges to redistribute work across threads. This plan has no such exchanges, so once rows are assigned to a thread, all related work is performed on that same thread. If you look at the work distribution for the other operators in the execution plan, you will see that all work is performed by the same thread as shown for the index seek.
There are no query hints to affect row distribution among threads, the important thing is to be aware of the possibility and to be able to read enough detail in the execution plan to determine when it is causing a problem.
With the default index (on primary key only) why does it take less time, and with the non clustered index present, for each row in the joining table, the joined table row should be found quicker, because join is on Name column on which the index has been created. This is reflected in the query execution plan and Index Seek cost is less when IndexA is active, but why still slower? Also what is in the Nested Loop left outer join that is causing the slowdown?
It should now be clear that the nonclustered index plan is potentially more efficient, as you would expect; it is just poor distribution of work across threads at execution time that accounts for the performance issue.
For the sake of completing the example and illustrating some of the things I have mentioned, one way to get a better work distribution is to use a temporary table to drive parallel execution:
SELECT
val1,
val2
INTO #Temp
FROM dbo.IndexTestTable AS ITT
WHERE Name = N'Name1';
SELECT
N'Name1',
SUM(T.val1),
SUM(T.val2),
MIN(I2.Name),
SUM(I2.val1),
SUM(I2.val2)
FROM #Temp AS T
CROSS JOIN IndexTestTable I2
WHERE
I2.Name = 'Name1'
OPTION (FORCE ORDER, QUERYTRACEON 8690);
DROP TABLE #Temp;
This results in a plan that uses the more efficient index seeks, does not feature a table spool, and distributes work across threads well:
On my system, this plan executes significantly faster than the Clustered Index Scan version.
If you're interested in learning more about the internals of parallel query execution, you might like to watch my PASS Summit 2013 session recording.
Best Answer
A RID Lookup is a lookup into a heap table using a Row ID. The Row ID is included in a non-clustered index in order to find the rest of a table's data in the heap table. Since a heap table is a table without a clustered index and is sorted unordered, a Row ID is required for the correlation.
My guess is that view3.UID is found easily via a non-clustered index with that column. When you actually start asking for other columns (by specifying them in the
SELECT
list that are not part of that non-clustered index, Sql Server has to go find the other data items in the unordered heap - that's the RID Lookup - which can be quite expensive depending on the amount of data you're dealing with.You 'might' see improved performance if you can identify the 'source' tables referenced in view3 and possibly 'including' your
SELECT
columns in supporting indexes on those source tables - this is called a 'covering index' in that the index is able to 'cover' or retrieve all of the information you're asking for without having to go elsewhere (clustered-index or heap)This link has some good information on RID Lookups.
Here is a brief summary of the main points of the link:
Something else to 'try' - use a Common Table Expression (CTE) in an attempt to force Sql Server to 'materialize' view3. Before your main SELECT, add the CTE - this 'assumes' that no more than 2147483648 (max int value) could be returned - adjust as needed.
Then, use the CTE View3Materialzed in the regular joins. I have used this technique with some success. It may or may not help you.