Sql-server – How to index alphanumeric foreign keys on large fact tables

data-warehouseetlolapsql server

I have a field that is an alphanumeric field that ideally is a encrypted field for a non-unique identifier. It's used to relate other rather large fact tables in a many-to-many relationship. I do not have the related dimension for this field because there is no other attributes for this FK.

Example: Abcdefgh12345

This field is in a rather large and growing data warehouse where the Fact tables are clustered on time and non-clustered on keys like these.

The column is VARCHAR(50) and only vary between 45 and 50. Would have to check, but I would assume the collation is SQL_Latin1_General_CP1_CI_AS. I use no FK for optimization reasons. All controlled by the ETL.

Fragmentation

Due to the type of key, it makes it hard to index. It's fragmentation is managed by a series of tests I did recently that showed 75% fill factor at least manageable with reducing the fragmentation on daily incremental loads for a least a week until a full rebuild may be needed, which is fine once a week.

Performance

With the reducing of the fill factor from 100% to 75%, the inserts and reads have become slower. The records have grown bigger as expected too. Any indexes with includes are pretty much driving the performance into the ground on inserts, but of course helping queries out that need them 10x better.

Question

Does anyone have any good experience working with alphanumerics in a data warehousing environment? How it's handled and index now is fine, but I think it could be better. I was dabbling with the idea of ripping out the keys, forming a new dimension and adding a more manageable key during the ETL process.

Best Answer

Assuming that your assumption is correct regarding the VARCHAR(50) field using a collation of SQL_Latin1_General_CP1_CI_AS, then you should consider altering those alphanumeric "code" fields in each of the tables where it exists, to have a collation of Latin1_General_BIN2. Since the value is derived from an algorithm, the casing of any alpha characters should be consistent so you don't need to worry about case insensitive searching. Using a binary collation will perform better than a non-binary collation (case insensitive or even case sensitive) since it doesn't need to handle any culturally-aware linguistic rules.

Also, since you have tried FILLFACTOR settings of both 100 and 75 and see pros and cons in each case, you should try a setting of 90 to see if that helps.

I would try each of those changes separately (first one, then add the other) so that you can test their effects individually. That way you know how impacting each change was.

I think in general the system would benefit from replacing the 45 - 50 byte key with a 4 byte key (assuming you use INT since this is dimensional data, though even an 8 byte BIGINT would be an improvement). But, adding a new dimension table and switching out the VARCHAR(50) to be an INT (or BIGINT) requires changes in both the data model and the code, whereas the first two changes affect only the data model and are fairly minimal at that.