I prefer TCP/IP over Named Pipes, even though in most situations there will be no noticeable difference. You can do this by adjusting the protocols supported by the instance in SQL Server Configuration Manager rather than hard-coding things in your connection string (this makes it easier to make changes or to troubleshoot).
Essentially the routing and other overhead involved with named pipes (unless your apps are on the same machine as SQL Server, in which case there is only a little extra overhead) make it the less efficient option, especially at scale, in a slower network environment (100MB or less), or if your workloads come in bursts.
If your apps are on the same box as SQL Server, you should also keep shared memory in mind - if you have applications on the SQL Server box directly communicating with SQL Server, this is going to be the most efficient option.
You can read about the performance advantages of TCP/IP in more detail.
The HASHBYTES
function only takes up to 8000 bytes as input. Because your inputs are potentially larger than that, duplicates in the range of the field that gets hashed will cause collisions, regardless of the algorithm chosen. Carefully consider the range of data you plan to hash -- using the first 4000 characters is the obvious choice, but may not be the best choice for your data.
In any event, because of what a hash function is, even if the inputs are 8000 bytes or less, the only way to ensure 100% correctness in the results is to compare the base values at some point (read: not necessarily first). Period.
The business will dictate whether or not 100% accuracy is required. This will tell you that either (a) comparing the base values is required, or (b) you should consider not comparing the base values -- how much accuracy should be traded off for performance.
While hash collisions are possible in a unique input set, they are infinitesimally rare, regardless of the algorithm chosen. The whole idea of using a hash value in this scenario is to efficiently narrow down the join results to a more manageable set, not to necessarily arrive at the final set of results immediately. Again, for 100% accuracy, this cannot be the final step in the process. This scenario isn't using hashing for the purpose of cryptography, so an algorithm such as MD5 will work fine.
It would be extremely hard for me to justify moving up to a SHA-x algorithm for "accuracy" purposes because if the business is going to freak out about the miniscule collision possibilities of MD5, chances are they're also going to freak out that the SHA-x algorithms aren't perfect either. They either have to come to terms with the slight inaccuracy, or mandate that the query be 100% accurate and live with the associated technical implications. I suppose if the CEO sleeps better at night knowing you used SHA-x instead of MD5, well, fine; it still doesn't mean much from a technical point of view in this case.
Speaking of performance, if the tables are read-mostly and the join result is needed frequently, consider implementing an indexed view to eliminate the need to compute the entire join every time it's requested. Of course you trade off storage for that, but it may be well worth it for the performance improvement, particularly if 100% accuracy is required.
For further reading on indexing long string values, I published an article that walks through an example of how to do this for a single table, and presents things to consider when attempting the full scenario in this question.
Best Answer
From Microsoft's Documentation:
Another part of that page says:
Depending on how you're using MD5 and HASHBYTES, you may have other options available aside from using SHA_256.
You may be interested in using this methodology I documented to determine which rows have changes. It uses a
rowversion
column in the source table to allow you to instantly see rows that are new or have changes. This avoids using a deprecated feature, and is much faster than any hashing function will ever be.