I will create a couple of tables in SQL Server for a client. Of special interest for this question is the FactStoreSale
table to be created and DimProduct
in which the first table will include all store's sales and include for instance StoreKey
, DateKey
, TimeKey
, ProductKey
and sales data.
The DimProduct
table will use complementary data in regards to Products, i.e. ProductName
, ProductGroup
, ProductCategory
etc. This table will be a slowly changing dimension, and store Product history in terms of price etc. for each store. The table will include the columns FromDate
and ToDate
which describes when the data was entered, and when it was replaced (with default value year 2999 in ToDate
until a new update of the Product is inserted) respectively.
I Believe this is very common. However, the ProductKeys
from the underlying database are very complex strings, such as 140-xx4449CH-4.44,9
.
My alternatives as I see it, is to use Checksum
in SQL Server to convert the complex strings to integers, and store these in a separate column ProductKeyInt
and use these columns when doing JOINS. But checksum does not guarantee unique integer values, which would likely cause problems. When I tried Checksum on a Virtual PC and my own PC, doing Checksum on the same values returned different results which is a concern as well. I Believe these two together rules out Checksum, unless it can be manipulated somehow to make it more unique?
Another alternative is to use more complex functions to ensure that the string values are converted into integers, such as the one provided on a question of mine here. A solution such as this however also has issues, the values 100-xx3
and 1003
will for instance get the same result. It is also complicated in the sense that if someone who isn't that savvy in SQL need to try to find any issues with the Query that uses the function, it may be very hard to understand what is happening.
My third option seem to be to use the DimProduct
table and update this table first from the SQL Agent, and using an Indexing Key on this table and use this index as the ProductKeyInt in the FactStoreSale
table (where ProductKeyInt will be some kind of subquery in for instance FactStoreSale
to fetch the value with the biggest ToDate
in DimProduct corresponding to the ProductKey.
Does anyone have any input? Is there a simpler way? I don't want to have strings as JOIN keys because of for example increased CPU time
Best Answer
You don't need to use the
ProductKey
column as the key column inDimProduct
. You can create a surrogate key that is a composite key made up ofDimProductID
that is anIDENTITY
column and increments per each row of the SCD, and aProductID
column that is unique per eachProductKey
. Both would beINT
(most likely, unless you will never have very many unique products, in which case I might opt forSMALLINT
forProductID
which gives you 65k values to use).This approach allows you to have both the ability to reference any particular version of a product (using the
IDENTITY
column, or both as both will probably work with indexes better), and the ability to aggregate over all versions of a particular product using just the non-IDENTITY
column.In either case, you generally don't need to be JOINing on the
ProductKey
column / value.When you do need to filter or JOIN on
ProductKey
, assuming that there is no functional difference between upper-case and lower-case letters, you should force everything to be upper-case and use a binary Collation (any Collation ending in_BIN2
, not_BIN
, though most likelyLatin1_General_100_BIN2
). Most people do not consider the Collation being used for string columns that contain alpha-numeric codes where there is no purpose, in any way, for having case (or Kana or width or accent insensitivity) and no purpose in evaluating complex linguistic rules, such as which character sequences equate to other characters in other languages. Non-binary Collations have A LOT of complex rules that they apply, and those take time. Using a binary Collation will be much faster as it just compares the bytes, which is why you need to force upper-case just in case someone enters in an alpha character in lower-case.So for this, make sure to set the column's Collation to use
Latin1_General_100_BIN2
when creating the table. If the table has already been created, useALTER TABLE ... ALTER COLUMN
to set the Collation so that it will be used by default.