There is always at least something else you should also know and almost equally, always something else you should be consciously putting a stop to. Specifically in the context of data warehousing, which is a relatively fledgling sector, leveraging relatively new technologies.
In regards to what I've seen in the real world, walking into a company for the first time and seeing what I'm understanding about your design would be genuinely tear-inducing: Tears of joy and relief. From the outset, you are well on your way to beginning what appears to be a well thought-out ( well engineered ) ETL / data warehousing system. As with the implementation of any software product, your mileage may vary as the solution grows and is consumed by the business, but fundamentally, you are on The Right Track™ ( and yes, you know what a natural key is ).
I've found there to be a number of challenges with these type of solutions, which I will touch upon to reinforce some of your decisions and perhaps lend some insight into the road ahead of you. Firstly, the number of times I've found myself in a predicament on account of a developer ( even fellow database administrators / data professionals ) misunderstanding the context of a control column ( using, for example running a process against the DateInserted
column, a mere time stamp of insertion, over the DateReceived
or similarly named column, intending to relate a row to a particular date of occurance ), that while I agree completely with the cautions @Aaron Bertrand raises, I feel that the prefixes for your control columns could actually be leveraged as a sort of flag to help prevent their misuse. Obvious should be obvious of course, but much like writing code in general, explicit is preferable. That said, I would almost certainly leave such prefixes out of the indexes and such ( probably even keys - PK
types can and should stay in my opinion, but unless there's a real threat of DWD_SubCategories
and DWF_SubCategories
existing in the same schema, they really are just fluff ). I think the concern about the DWD
and DWF
prefixes is valid, but they'll be living in the [NDS]
catalog and would serve to indicate intent, making it completely fine to use the nomenclature in that manner.
The second ( and perhaps most infuriating ) challenge is one of cross-training your coworkers. All of the software engineering, usage flags and design practice rules are completely for naught if your striving-for-paycheque-over-excellence colleagues get involved and do their less than very best ( or to be fair, are even just simply having a bad day ). Do keep in mind that large projects generally have many fingers in the pot, so it is imperative that those fingers are behaving well.
The last thing I'll touch on here is to always keep in mind the actual value of any ETL system to a business. Of the Extract, Transform and Load paradigm, the first and final letters have absolutely no business value, so you will want to work on making the development and maintenance of both the Extract and Load processes as minimal as possible - the "real" work will be done in the Transform phase, so you will want to automate the E and L steps as much as possible so that you can focus on making ( and keeping ) your solution valuable to the business unit by actively working on the transforms.
All of that said, I've only had the opportunity to work on a handful of different warehousing solutions so perhaps a more knowledgeable user could step in and remove my foot from my mouth if I need correcting. As I said initially, this is one of those areas where one can always learn or unlearn something, and I am absolutely no exception.
Oh, one more thing ( and probably the most important ) - Unit Test! Once your E and L are working as intended and you've had the opportunity to put a few domains through your T solution, get somebody to vet the results. If they're good, save the result set somewhere, so that when you make changes ( and you will, without a doubt ) you can ensure you haven't broken something, somewhere else. Again, automate this process as much as you possibly can ( it's another 0-value process to the business, until they go without it at least ;) ). I generally set up a separate schema or catalog for this purpose.
Hopefully some of what I've said will be useful to you!
As an update, @Aaron Bertrand's schema separation seems like it would be quite a good way to avoid unnecessary prefixing as well, so certainly consider that ( I know I will haha ).
Provided that Sku and ItemNumber will always imply unique values
I consider that you found the answer already by discovering that, conceptually speaking, ItemNumber is an optional property; i.e., when you determined that it does not apply to each and every one of the occurrences —represented by logical-level rows— of the Product entity type. Therefore, the item_number
column should not be declared as an ALTERNATE KEY (AK for brevity) in the product
table, as you rightly pointed out.
In this respect, your Scenario B is quite reasonable, as the following conceptual-level formulation demonstrates:
- A product may or may not have an item number.
In other words, there is a one to zero or one (1:0/1) cardinality ratio between Product and ItemNumber.
Then, yes, you should introduce a new table to deal with the optional column, and I agree that product_item_number
is a very descriptive name for it. This table should have sku
constrained as its PRIMARY KEY (PK), so as to ensure that no more than one row with the same sku
value is inserted into it, just like you did.
It is also important to mention that product_item_number.sku
should as well be a constrained as a FOREIGN KEY (FK) making a reference to product.sku
.
Here is a sample SQL-DDL logical-level design that illustrates the previous suggestions:
-- You should determine which are the most fitting
-- data types and sizes for all your table columns
-- depending on your business context characteristics.
-- Also, you should make accurate tests to define
-- the most convenient INDEXing strategies.
CREATE TABLE product (
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
CREATE TABLE product_item_number (
sku TEXT NOT NULL, -- To be constrained as PK and FK to ensure the 1:0/1 correspondence ratio between the relevant rows.
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (sku),
CONSTRAINT product_item_number_AK UNIQUE (item_number), -- In this context, ‘item_number’ is an AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (sku)
REFERENCES product (sku)
);
Tested on PostgreSQL 11 in this db<>fiddle.
Moreover, there is another conceptual formulation that guides in shaping the database design presented above:
- If it exists, the ItemNumber of a Product must be unique.
So, where the item_number
column should actually be declared as an AK is right there, in the product_item_number
table, because said column requires uniqueness protection only when the pertinent value is provided, hence the UNIQUE and NOT NULL constraints have to be configured accordingly.
Missing values and the “Closed World Interpretation”
The logical SQL-DDL arrangement previously described is an example of the relational approach to handle missing values, although it is not the most popular —or usual—. This approach is related to the “Closed World Interpretation” —or “Assumption”—. Adopting this position, (a) the information recorded in the database is always deemed true, and (b) the information that is not recorded in it is, at all times, deemed false. In this way, one is exclusively retaining facts that are known.
In the present business scenario, when a user supplies all the data points that are comprised in the product
table you have to INSERT the corresponding row and if, and only if, the user made the item_number
datum available you also have to INSERT the product_item_number
counterpart. In case that the item_number
value is unknown or it simply does not apply, you do not INSERT a product_item_number
row, and that is it.
With this method you avoid holding NULL marks/markers in your base tables —and the logical-level consequences that I will detail in the next section—, but you should be aware that this is a “controversial” topic in the database administration ambit. On this point, you might find of value the answers for the Stack Overflow question entitled:
The popular course of action
I guess, however, that the popular —or common— proceeding would be to have a single product
table that includes the item_number
column which, in turn, would be set as NULLable and, at the same time, defined with a UNIQUE constraint. The way I see it, this approach would make your database and the applicable data manipulation operations less elegant (as shown, e.g., in this outstanding Stack Overflow answer), but it is a possibility.
See the successive DDL statements that exemplify this course of action:
CREATE TABLE product (
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
item_number TEXT NULL, -- Accepting NULL marks.
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK1 UNIQUE (name), -- AK.
CONSTRAINT product_AK2 UNIQUE (item_number), -- Being ‘NULLable’, this is not an AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
Tested on PostgreSQL 11 in this db<>fiddle.
So, having established item_number
as a column that can contain NULLs, it is not correct to say, logically speaking, that it is an AK. Furthermore, you would be storing ambiguous NULL marks —which are not values, no matter if the PostgreSQL documentation labels them that way—, thus it can be argued that the table would not be a proper representation of an adapted mathematical relation and normalization rules cannot be applied to it.
Since a NULL indicates that a column value is (1) unknown or (2) inapplicable, it cannot be rightly stated that said mark belongs to the item_number
valid domain of values. As you know, this kind of mark tells something about the “status” of a real value, but it is not a value itself and, naturally, it does not behave as such —and, by the way, it is worth to mention that NULLs behave differently across the distinct SQL database management systems, even across distinct versions of the same database management system—.
Then, if (i) the domain of values of a certain column and (ii) the meaning that said column carries is not entirely clear as a result of the inclusion of NULLs:
Despite both the theoretical and practical —e.g. regarding data manipulation— implications that concern to the retention of NULL marks in a database, this is the approach to handle missing data that you will find in the vast majority of the databases built on SQL platforms, since it permits attaching columns for optional values to the base tables of significance and, as an effect, eludes the creation of (a) a complementary table and (b) the associated tasks.
The decision
I have presented the two alternatives so that you can determine by yourself which one is more suitable to achieve your objectives.
Assuming that the Sku and ItemNumber values can eventually be duplicated
There are some points of your question that caught my attention in an particular way, so I listed them:
Sometimes (maybe 3% to 5% of the time), the item_number is actually equal to the SKU. That is, one of my suppliers in particular affixes to their products what I suspect is not a globally unique SKU, fashioned after their item number.
[…] there may be cases where a supplier recycles a catalog number with a different sku (maybe?), or situations where two manufacturer's both make a "d57-red" or something like that. In that case, I think I'd have to programmatically prefix offending item_numbers with manufacturer names or something like that.
A sku will always be unique in my domain (The small amount of non-globablly unique supplier-provided SKUs are unlikely to ever collide).
Those points can have remarkable repercussions because they seem to suggest that:
The ItemNumber values can eventually become duplicated and, when that happens, you might evaluate combining two different pieces of information that bear different meanings in the same column.
It is probable that, after all, the Sku values might be repeated (even if it is a small amount of repeated Sku instances).
In this regard, it is worth to note that two paramount objectives of a data modelling exercise are (1) determining each individual datum of significance and (2) preventing the retention of more than one of them in the same column. These factors, e.g., facilitate the delineation of a stable and versatile database structure and assist in the avoidance of duplicated information —which helps to maintain the data values consistent with the business rules, via the respective constraints—.
Alternative to handle Sku duplicates: Introducing a manufacturer
table to the scenario
Consequently, on condition that the same Sku value can be shared across different Manufacturers, you could make use of a composite PK constraint in the product
table, and it would be made up of (i) the manufacturer PK column and (ii) sku
. E.g.:
CREATE TABLE manufacturer (
manufacturer_number INTEGER NOT NULL, -- This could be something more meaningful, e.g., ‘manufacturer_code’.
name TEXT NOT NULL,
--
CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number),
CONSTRAINT manufacturer_AK UNIQUE (name) -- AK.
);
CREATE TABLE product (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT product_TO_manufacturer_FK FOREIGN KEY (manufacturer_number)
REFERENCES manufacturer (manufacturer_number),
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
And, if the ItemNumber demands uniqueness preservation when it is applicable, then the product_item_number
table can be structured as follows:
CREATE TABLE product_item_number (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_item_number_AK UNIQUE (item_number), -- AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
REFERENCES product (manufacturer_number, sku)
);
Tested on PostgreSQL 11 in this db<>fiddle.
In case that ItemNumber does not require preventing duplicates, you simply remove the UNIQUE constraint declared for such a column, as shown in the next DDL statements:
CREATE TABLE product_item_number (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
item_number TEXT NOT NULL, -- In this case, ‘item_number’ does not require a UNIQUE constraint.
--
CONSTRAINT product_item_number_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
REFERENCES product (manufacturer_number, sku)
);
On the other hand, supposing that ItemNumber does actually entail avoiding repeated values exclusively with regards to the associated Manufacturer, you can set up a composite UNIQUE constraint which would consist of manufacturer_number
and item_number
, as demonstrated in the code lines below:
CREATE TABLE product_item_number (
manufacturer_number INTEGER NOT NULL,
sku TEXT NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
CONSTRAINT product_item_number_AK UNIQUE (manufacturer_number, item_number), -- Composite AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku) -- Composite FK.
REFERENCES product (manufacturer_number, sku)
);
When Sku values are always unique but a specific ItemNumber value can be shared among distinct Manufacturers
If you can guarantee that Product.Sku will never imply duplicates but an ItemNumber might be used by distinct Manufacturers, you can configure your database as exposed here:
CREATE TABLE manufacturer (
manufacturer_number INTEGER NOT NULL,
name TEXT NOT NULL,
--
CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number),
CONSTRAINT manufacturer_AK UNIQUE (name) -- AK.
);
CREATE TABLE product (
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
CREATE TABLE product_item_number (
sku TEXT NOT NULL,
manufacturer_number INTEGER NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (sku, manufacturer_number),
CONSTRAINT product_item_number_AK UNIQUE (manufacturer_number, item_number), -- In this context, ‘manufacturer_number’ and ‘item_number’ compose an AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (sku)
REFERENCES product (sku),
CONSTRAINT product_item_number_TO_manufacturer_FK FOREIGN KEY (manufacturer_number)
REFERENCES manufacturer (manufacturer_number)
);
Tested on PostgreSQL 11 in this db<>fiddle.
Physical-level considerations
We have not discussed the exact type and size of the product.sku
column but, if it is “big” in terms of bytes, then it may end up undermining the data retrieval speed of your system —due to aspects of the physical level of abstraction, associated with, e.g., the sizes of the indexes and disk space usage—.
In this manner, you might like to assess the incorporation of an INTEGER column which can offer a faster response than a possibly “heavy” TEXT one —but it all depends on the precise features of the compared columns—. It may well be a product_number
that, as expected, would represent a numeric value in a sequence standing for the set of recorded products
.
An expository arrangement that incorporates this new element is the one that follows:
CREATE TABLE product (
product_number INTEGER NOT NULL,
sku TEXT NOT NULL,
name TEXT NOT NULL,
price NUMERIC NOT NULL,
quantity NUMERIC NOT NULL,
--
CONSTRAINT product_PK PRIMARY KEY (sku),
CONSTRAINT product_AK UNIQUE (name), -- AK.
CONSTRAINT valid_price_CK CHECK (price > 0),
CONSTRAINT valid_quantity_CK CHECK (quantity > 0)
);
CREATE TABLE product_item_number
(
product_number INTEGER NOT NULL,
item_number TEXT NOT NULL,
--
CONSTRAINT product_item_number_PK PRIMARY KEY (product_number),
CONSTRAINT product_item_number_AK UNIQUE (item_number), -- AK.
CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (product_number)
REFERENCES product (product_number)
);
I highly recommend carrying out substantial testing sessions with a considerable data load in order to decide which keys are more convenient —physically speaking—, always taking into account the overall database features (the number of columns of all the tables, the types and sizes of the columns, the constraints and the underlying indexes, etc.).
Similar scenario
You business environment of interest presents a certain resemblance to the scenario dealt with in these posts, so you might find of relevance some of the discussed points.
Best Answer
It depends. There are cases whether one design or the other are satisfactory.
If you are doing a lot of queries only on the preferred or default entry, then the second option seems better.
If, however, you are normally doing lots of queries for all the entries, regardless of whether they are preferred/default or not, the first option is a little easier to set up and manage.
And if most of your queries want to retrieve all the entries except the preferred/default one, you'll find the first option a lot easier. I can't think of such a case off the top of my head.
And, in most cases, these tables are so tiny that performance considerations will be secondary.