SQL Server – Choosing Prefixes for Normalized Data Store

database-designforeign keynaming conventionnatural-keysql server

I'm designing a Staging+NDS+DDS Data Warehouse system, where an ETL is going to normalize data from [Staging] and load it into [NDS], which will hold all history.

I've pretty much finished the T-SQL script that will create the tables and constraints in the [NDS] database, which contains Master and Transactional tables, that will respectively feed [DDS] Dimension and Fact tables in what I'm intending to be a star schema.

I'm given myself the following rules to follow:

  • Tables sourcing [DDS] dimensions are prefixed with DWD_
  • Tables sourcing [DDS] facts are prefixed with DWF_
  • Foreign key columns are prefixed with DWK_
  • Surrogate key column is prefixed with the same prefix as the table. Which means the surrogate key is always either:
    • DWD_Key for a DWD_ table, or
    • DWF_Key for a DWF_ table.
  • Control columns are prefixed with the same prefix as the table. For example…
    • The DWD_Customers table has control columns:
      • DWD_IsLastImage
      • DWD_EffectiveFrom
      • DWD_EffectiveTo
      • DWD_DateInserted
      • DWD_DateUpdated
      • DWD_DateDeleted
    • The DWF_InvoiceHeaders table has control columns:
      • DWF_DateInserted
      • DWF_DateUpdated
      • DWF_DateDeleted
  • Primary keys (/surrogate keys) are always prefixed with PK_ followed by the table name (including the table prefix) – e.g. PK_DWD_Customers and PK_DWF_InvoiceHeaders.
  • I also added a unique constraint on natural keys, and those are always prefixed with NK_ followed by the table name (including the table prefix) – e.g. NK_DWD_Customers and NK_DWF_InvoiceHeaders.
  • Foreign key columns are always prefixed with DWK_ followed by the name of the referenced table (without its prefix) and the word "Key" – e.g. DWK_CustomerKey.
  • Foreign key constraints are always named FK_[ParentTableNameWithPrefix]_[ChildTableNameWithPrefix].
  • When a table has multiple FK's to the same table, the name of the FK column is appended to the constraint's name, e.g. FK_DWD_FiscalCalendar_DWF_OrderDetails_DeliveryDate.

All prefixed columns have no business meaning and should never appear in views; this leaves me with, I find, a pretty clean and consistent design, and create table scripts looking like this:

create table DWD_SubCategories (
     DWD_Key int not null identity(1,1)
    ,DWD_DateInserted datetime not null
    ,DWD_DateUpdated datetime null
    ,DWK_CategoryKey int not null
    ,Code nvarchar(5) not null
    ,Name nvarchar(50) not null
    ,constraint PK_DWD_SubCategories primary key clustered (DWD_Key asc)
    ,constraint NK_DWD_SubCategories unique (Code)
);

So, my question is, is there anything I should know (or unlearn) before I continue and implement the ETL to load data into this database? Would anyone inheriting this database want to chase me down and rip my head off in the future? What should I change to avoid this? The reason I'm asking about prefixes, is because I'm using DWD and DWF, but the tables are technically not "dimension" and "fact" tables. Is that confusing?

Also, I'm unsure about the concept of natural key – am I correct to presume it should be a unique combination of columns that the source system might consider its "key" columns, that I can use in the ETL process to locate, say, a specific record to update?

Best Answer

There is always at least something else you should also know and almost equally, always something else you should be consciously putting a stop to. Specifically in the context of data warehousing, which is a relatively fledgling sector, leveraging relatively new technologies.

In regards to what I've seen in the real world, walking into a company for the first time and seeing what I'm understanding about your design would be genuinely tear-inducing: Tears of joy and relief. From the outset, you are well on your way to beginning what appears to be a well thought-out ( well engineered ) ETL / data warehousing system. As with the implementation of any software product, your mileage may vary as the solution grows and is consumed by the business, but fundamentally, you are on The Right Trackā„¢ ( and yes, you know what a natural key is ).

I've found there to be a number of challenges with these type of solutions, which I will touch upon to reinforce some of your decisions and perhaps lend some insight into the road ahead of you. Firstly, the number of times I've found myself in a predicament on account of a developer ( even fellow database administrators / data professionals ) misunderstanding the context of a control column ( using, for example running a process against the DateInserted column, a mere time stamp of insertion, over the DateReceived or similarly named column, intending to relate a row to a particular date of occurance ), that while I agree completely with the cautions @Aaron Bertrand raises, I feel that the prefixes for your control columns could actually be leveraged as a sort of flag to help prevent their misuse. Obvious should be obvious of course, but much like writing code in general, explicit is preferable. That said, I would almost certainly leave such prefixes out of the indexes and such ( probably even keys - PK types can and should stay in my opinion, but unless there's a real threat of DWD_SubCategories and DWF_SubCategories existing in the same schema, they really are just fluff ). I think the concern about the DWD and DWF prefixes is valid, but they'll be living in the [NDS] catalog and would serve to indicate intent, making it completely fine to use the nomenclature in that manner.

The second ( and perhaps most infuriating ) challenge is one of cross-training your coworkers. All of the software engineering, usage flags and design practice rules are completely for naught if your striving-for-paycheque-over-excellence colleagues get involved and do their less than very best ( or to be fair, are even just simply having a bad day ). Do keep in mind that large projects generally have many fingers in the pot, so it is imperative that those fingers are behaving well.

The last thing I'll touch on here is to always keep in mind the actual value of any ETL system to a business. Of the Extract, Transform and Load paradigm, the first and final letters have absolutely no business value, so you will want to work on making the development and maintenance of both the Extract and Load processes as minimal as possible - the "real" work will be done in the Transform phase, so you will want to automate the E and L steps as much as possible so that you can focus on making ( and keeping ) your solution valuable to the business unit by actively working on the transforms.

All of that said, I've only had the opportunity to work on a handful of different warehousing solutions so perhaps a more knowledgeable user could step in and remove my foot from my mouth if I need correcting. As I said initially, this is one of those areas where one can always learn or unlearn something, and I am absolutely no exception.

Oh, one more thing ( and probably the most important ) - Unit Test! Once your E and L are working as intended and you've had the opportunity to put a few domains through your T solution, get somebody to vet the results. If they're good, save the result set somewhere, so that when you make changes ( and you will, without a doubt ) you can ensure you haven't broken something, somewhere else. Again, automate this process as much as you possibly can ( it's another 0-value process to the business, until they go without it at least ;) ). I generally set up a separate schema or catalog for this purpose.

Hopefully some of what I've said will be useful to you!

As an update, @Aaron Bertrand's schema separation seems like it would be quite a good way to avoid unnecessary prefixing as well, so certainly consider that ( I know I will haha ).