SQL Server – Avoid Using Complicated Strings in Joins

data-warehousedatabase-designjoin;sql servert-sql

I will create a couple of tables in SQL Server for a client. Of special interest for this question is the FactStoreSale table to be created and DimProduct in which the first table will include all store's sales and include for instance StoreKey, DateKey, TimeKey, ProductKey and sales data.

The DimProduct table will use complementary data in regards to Products, i.e. ProductName, ProductGroup, ProductCategory etc. This table will be a slowly changing dimension, and store Product history in terms of price etc. for each store. The table will include the columns FromDate and ToDate which describes when the data was entered, and when it was replaced (with default value year 2999 in ToDate until a new update of the Product is inserted) respectively.

I Believe this is very common. However, the ProductKeys from the underlying database are very complex strings, such as 140-xx4449CH-4.44,9.

My alternatives as I see it, is to use Checksum in SQL Server to convert the complex strings to integers, and store these in a separate column ProductKeyInt
and use these columns when doing JOINS. But checksum does not guarantee unique integer values, which would likely cause problems. When I tried Checksum on a Virtual PC and my own PC, doing Checksum on the same values returned different results which is a concern as well. I Believe these two together rules out Checksum, unless it can be manipulated somehow to make it more unique?

Another alternative is to use more complex functions to ensure that the string values are converted into integers, such as the one provided on a question of mine here. A solution such as this however also has issues, the values 100-xx3 and 1003 will for instance get the same result. It is also complicated in the sense that if someone who isn't that savvy in SQL need to try to find any issues with the Query that uses the function, it may be very hard to understand what is happening.

My third option seem to be to use the DimProduct table and update this table first from the SQL Agent, and using an Indexing Key on this table and use this index as the ProductKeyInt in the FactStoreSale table (where ProductKeyInt will be some kind of subquery in for instance FactStoreSale to fetch the value with the biggest ToDate in DimProduct corresponding to the ProductKey.

Does anyone have any input? Is there a simpler way? I don't want to have strings as JOIN keys because of for example increased CPU time

Best Answer

  1. You don't need to use the ProductKey column as the key column in DimProduct. You can create a surrogate key that is a composite key made up of DimProductID that is an IDENTITY column and increments per each row of the SCD, and a ProductID column that is unique per each ProductKey. Both would be INT (most likely, unless you will never have very many unique products, in which case I might opt for SMALLINT for ProductID which gives you 65k values to use).

    This approach allows you to have both the ability to reference any particular version of a product (using the IDENTITY column, or both as both will probably work with indexes better), and the ability to aggregate over all versions of a particular product using just the non-IDENTITY column.

    In either case, you generally don't need to be JOINing on the ProductKey column / value.

  2. When you do need to filter or JOIN on ProductKey, assuming that there is no functional difference between upper-case and lower-case letters, you should force everything to be upper-case and use a binary Collation (any Collation ending in _BIN2, not _BIN, though most likely Latin1_General_100_BIN2). Most people do not consider the Collation being used for string columns that contain alpha-numeric codes where there is no purpose, in any way, for having case (or Kana or width or accent insensitivity) and no purpose in evaluating complex linguistic rules, such as which character sequences equate to other characters in other languages. Non-binary Collations have A LOT of complex rules that they apply, and those take time. Using a binary Collation will be much faster as it just compares the bytes, which is why you need to force upper-case just in case someone enters in an alpha character in lower-case.

    So for this, make sure to set the column's Collation to use Latin1_General_100_BIN2 when creating the table. If the table has already been created, use ALTER TABLE ... ALTER COLUMN to set the Collation so that it will be used by default.