SampleTable is contained on 4714 pages, taking about 36MB. Case 1 scans them all which is why we get 4714 reads. Further, it must perform 1 million hashes, which are CPU intensive, and which ultimately drives the time up proportionally. It is all this hashing which seems to drive the time up in case 1.
There is a start-up cost to a hash join (building the hash table, which is also a blocking operation), but hash join ultimately has the lowest theoretical per-row cost of the three physical join types supported by SQL Server, both in terms of IO and CPU. Hash join really comes into its own with a relatively small build input and a large probe input. That said, no physical join type is 'better' in all scenarios.
Now consider case 2. It is not doing any hashing, but instead it is doing 50000 separate seeks, which is what is driving up the reads. But how expensive are the reads comparatively? One might say that if those are physical reads, it could be quite expensive. But keep in mind 1) only the first read of a given page could be physical, and 2) even so, case 1 would have the same or worse problem since it is guaranteed to hit every page.
Each seek requires navigating a b-tree to the root, which is computationally expensive compared with a single hash probe. In addition, the general IO pattern for the inner side of a nested loops join is random, compared with the sequential access pattern of the probe-side scan input to a hash join. Depending on the underlying physical IO subsystem, sequential reads may be faster than random reads. Also, the SQL Server read-ahead mechanism works better with sequential IO, issuing larger reads.
So accounting for the fact that both cases have to access each page at least once, it seems to be a question of which is faster, 1 million hashes or about 155000 reads against memory? My tests seem to say the latter, but SQL Server consistently chooses the former.
The SQL Server query optimizer makes a number of assumptions. One is that the first access to a page made by a query will result in a physical IO (the 'cold cache assumption'). The chance that a later read will come from a page already read into memory by the same query is modelled, but this is no more than an educated guess.
The reason the optimizer's model works this way is that it is generally better to optimize for the worst case (physical IO is needed). Many shortcomings can be covered up by parallelism and running things in memory. The query plans the optimizer would produce if it assumed all data was in memory might perform very poorly if that assumption proved to be invalid.
The plan produced using the cold cache assumption may not perform quite as well as if a warm cache were assumed instead, but its worst-case performance will usually be superior.
Should I keep on forcing this LOOP JOIN hint when testing shows these kinds of results, or am I missing something in my analysis? I am hesitant to go against SQL Server's optimizer, but it feels like it switches to using a hash join much earlier than it should in cases like these.
You should be very careful about doing this for two reasons. First, join hints also silently force the physical join order to match the written order of the query (just as if you had also specified OPTION (FORCE ORDER)
. This severely limits the alternatives available to the optimizer, and may not always be what you want. OPTION (LOOP JOIN)
forces nested loops joins for the query, but does not enforce the written join order.
Second, you are making the assumption that the data set size will remain small, and most of the logical reads will come from cache. If these assumptions become invalid (perhaps over time), performance will degrade. The built-in query optimizer is quite good at reacting to changing circumstances; removing that freedom is something you should think hard about.
Overall, unless there is a compelling reason to force loops joins, I would avoid it. The default plans are usually fairly close to optimal, and tend to be more resilient in the face of changing circumstances.
What Bob said.
Couple of ways to get round this.
Firstly, you could copy the files from the FTP server to your test server and then use xp_cmdshell and the DIR command to read in the contents of that folder (full, diff and log backups) and order them accordingly (you need the date/time in the filename to achieve this). Once you have this information, you can dynamically build your restore statements.
I did this once, but, I have xp_cmdshell disabled in my environment here, so it's not an option.
The way I currently do it. I build the restore statements based on the metadata in msdb and export that out to a text file which is stored on the FTP server. I simply read that text file on along with my backup files on my test server and away I go.
My solution performs a point in time on a daily basis to my test box. What it can't guarantee is if the subsequent log backups are valid and I can restore from those.
I also perform manual restores to my test box (another instance) just to keep myself fresh when it comes to a disaster - I vary the restores too.
Best Answer
It seems that you would like to reference ZipCode in the OUTER query from the scalar subquery. You cannot reference the alias
z
since it has only just been defined in the same SELECT clause. The correct way to do this is just to use the base column, but to alias the table differently to prefix the column names with the aliases.