forgive me if this has been asked and answered before.
I'm roughing out a schema for an inventory management system, to be implemented in PostgreSQL. All of our products and services have a sku. Most of our products come from the manufacturer or distributor with a separate "item number" (whether it be a distributor's catalog number, manufacturer's model number, whatever). However, not all of them have such a number. We have small assemblies that we make in-house that, generally, don't have item numbers. Our services don't have item numbers. For these reasons, the following CREATE TABLE makes sense to me.
Scenario A:
CREATE TABLE product (
sku text PRIMARY KEY,
name text UNIQUE NOT NULL, -- alternate key
price numeric NOT NULL CHECK (price > 0),
quantity numeric NOT NULL CHECK (quantity > 0),
item_number text -- hmmm...
);
However, I have two problems with this.
-
Sometimes (maybe 3% to 5% of the time), the item_number is actually equal to the SKU. That is, one of my suppliers in particular affixes to their products what I suspect is not a globally unique SKU, fashioned after their item number.
-
Whether equal to the SKU or not, the item_number (when existent) is in virtually every case sufficient to uniquely identify a product in the domain of my small store.
I'm worried about normalizing this to 3NF. If item_number is sometimes null, it obviously cannot be declared an alternate key. But, semantically, it is a unique identifier, where it exists, in every case I can think of. So does my above table, where every attribute is functionally dependent upon the non-prime attribute item_number whenever item_number exists, normalized? I'm thinking no, but I'm certainly not an expert. I thought of doing the following:
Scenario B
CREATE TABLE product (
sku text PRIMARY KEY REFERENCES product_item_number (sku),
name text UNIQUE NOT NULL, -- alternate key
price numeric NOT NULL CHECK (price > 0),
quantity numeric NOT NULL CHECK (quantity > 0),
);
CREATE TABLE product_item_number (
sku text PRIMARY KEY,
item_number text
);
Since it's really not a requirement that I preserve the functional dependency item_number -> price, item_number -> quantity, etc., scenario B kinda sorta seems reasonable to me. I won't have a non-prime attribute determining any other non-prime attributes.
My final idea was to simply use the sku as the item number in all cases where the item_number is otherwise non-existent, but I wonder whether that's a good practice.
Scenario C
CREATE TABLE product (
sku text PRIMARY KEY,
name text UNIQUE NOT NULL, -- alternate key
price numeric NOT NULL CHECK (price > 0),
quantity numeric NOT NULL CHECK (quantity > 0),
item_number text UNIQUE NOT NULL -- alternate key???
);
My concern with scenario C is that there may be cases where a supplier recycles a catalog number with a different sku (maybe?), or situations where two manufacturer's both make a "d57-red" or something like that. In that case, I think I'd have to programmatically prefix offending item_numbers with manufacturer names or something like that.
Of course, maybe I'm over thinking all this.
Thanks for reading.
A couple clarifications, as per MDCCL's comments:
- A sku will always be unique in my domain (The small amount of non-globablly unique supplier-provided SKUs are unlikely to ever collide).
- The item_number will be a public-facing attribute, used both by customers and sometimes myself to identify products. For example, say a customer skips my website and calls me to ask if I have xyz-white; the item_number helps remove ambiguity. The item numbers are unique in my experience (that is, there are no counter examples in my inventory), but that's not a rule, per se. I could have an item_number name space collision one day. Perhaps, if that happened, I would prefix the first three letters of the manufacturer's name to the item_number.
- item_numbers don't always exist. I suppose I could provide some sort of "surrogate item_number" for those without one, but an arbitrary item_number would be counter-productive. As explained immediately above, where an item_number exists, it should exist to help myself and my customers disambiguate between products. They might believe they're looking at the wrong product if the item_number is something I concocted myself. I'm not sure.
Best Answer
Provided that Sku and ItemNumber will always imply unique values
I consider that you found the answer already by discovering that, conceptually speaking, ItemNumber is an optional property; i.e., when you determined that it does not apply to each and every one of the occurrences —represented by logical-level rows— of the Product entity type. Therefore, the
item_number
column should not be declared as an ALTERNATE KEY (AK for brevity) in theproduct
table, as you rightly pointed out.In this respect, your Scenario B is quite reasonable, as the following conceptual-level formulation demonstrates:
In other words, there is a one to zero or one (1:0/1) cardinality ratio between Product and ItemNumber.
Then, yes, you should introduce a new table to deal with the optional column, and I agree that
product_item_number
is a very descriptive name for it. This table should havesku
constrained as its PRIMARY KEY (PK), so as to ensure that no more than one row with the samesku
value is inserted into it, just like you did.It is also important to mention that
product_item_number.sku
should as well be a constrained as a FOREIGN KEY (FK) making a reference toproduct.sku
.Here is a sample SQL-DDL logical-level design that illustrates the previous suggestions:
Tested on PostgreSQL 11 in this db<>fiddle.
Moreover, there is another conceptual formulation that guides in shaping the database design presented above:
So, where the
item_number
column should actually be declared as an AK is right there, in theproduct_item_number
table, because said column requires uniqueness protection only when the pertinent value is provided, hence the UNIQUE and NOT NULL constraints have to be configured accordingly.Missing values and the “Closed World Interpretation”
The logical SQL-DDL arrangement previously described is an example of the relational approach to handle missing values, although it is not the most popular —or usual—. This approach is related to the “Closed World Interpretation” —or “Assumption”—. Adopting this position, (a) the information recorded in the database is always deemed true, and (b) the information that is not recorded in it is, at all times, deemed false. In this way, one is exclusively retaining facts that are known.
In the present business scenario, when a user supplies all the data points that are comprised in the
product
table you have to INSERT the corresponding row and if, and only if, the user made theitem_number
datum available you also have to INSERT theproduct_item_number
counterpart. In case that theitem_number
value is unknown or it simply does not apply, you do not INSERT aproduct_item_number
row, and that is it.With this method you avoid holding NULL marks/markers in your base tables —and the logical-level consequences that I will detail in the next section—, but you should be aware that this is a “controversial” topic in the database administration ambit. On this point, you might find of value the answers for the Stack Overflow question entitled:
The popular course of action
I guess, however, that the popular —or common— proceeding would be to have a single
product
table that includes theitem_number
column which, in turn, would be set as NULLable and, at the same time, defined with a UNIQUE constraint. The way I see it, this approach would make your database and the applicable data manipulation operations less elegant (as shown, e.g., in this outstanding Stack Overflow answer), but it is a possibility.See the successive DDL statements that exemplify this course of action:
Tested on PostgreSQL 11 in this db<>fiddle.
So, having established
item_number
as a column that can contain NULLs, it is not correct to say, logically speaking, that it is an AK. Furthermore, you would be storing ambiguous NULL marks —which are not values, no matter if the PostgreSQL documentation labels them that way—, thus it can be argued that the table would not be a proper representation of an adapted mathematical relation and normalization rules cannot be applied to it.Since a NULL indicates that a column value is (1) unknown or (2) inapplicable, it cannot be rightly stated that said mark belongs to the
item_number
valid domain of values. As you know, this kind of mark tells something about the “status” of a real value, but it is not a value itself and, naturally, it does not behave as such —and, by the way, it is worth to mention that NULLs behave differently across the distinct SQL database management systems, even across distinct versions of the same database management system—.Then, if (i) the domain of values of a certain column and (ii) the meaning that said column carries is not entirely clear as a result of the inclusion of NULLs:
How could one evaluate and define the relevant functional dependencies?
How can it be identified and declared as PRIMARY or ALTERNATE KEY (as in the case of the
item_number
)?Despite both the theoretical and practical —e.g. regarding data manipulation— implications that concern to the retention of NULL marks in a database, this is the approach to handle missing data that you will find in the vast majority of the databases built on SQL platforms, since it permits attaching columns for optional values to the base tables of significance and, as an effect, eludes the creation of (a) a complementary table and (b) the associated tasks.
The decision
I have presented the two alternatives so that you can determine by yourself which one is more suitable to achieve your objectives.
Assuming that the Sku and ItemNumber values can eventually be duplicated
There are some points of your question that caught my attention in an particular way, so I listed them:
Those points can have remarkable repercussions because they seem to suggest that:
The ItemNumber values can eventually become duplicated and, when that happens, you might evaluate combining two different pieces of information that bear different meanings in the same column.
It is probable that, after all, the Sku values might be repeated (even if it is a small amount of repeated Sku instances).
In this regard, it is worth to note that two paramount objectives of a data modelling exercise are (1) determining each individual datum of significance and (2) preventing the retention of more than one of them in the same column. These factors, e.g., facilitate the delineation of a stable and versatile database structure and assist in the avoidance of duplicated information —which helps to maintain the data values consistent with the business rules, via the respective constraints—.
Alternative to handle Sku duplicates: Introducing a
manufacturer
table to the scenarioConsequently, on condition that the same Sku value can be shared across different Manufacturers, you could make use of a composite PK constraint in the
product
table, and it would be made up of (i) the manufacturer PK column and (ii)sku
. E.g.:And, if the ItemNumber demands uniqueness preservation when it is applicable, then the
product_item_number
table can be structured as follows:Tested on PostgreSQL 11 in this db<>fiddle.
In case that ItemNumber does not require preventing duplicates, you simply remove the UNIQUE constraint declared for such a column, as shown in the next DDL statements:
On the other hand, supposing that ItemNumber does actually entail avoiding repeated values exclusively with regards to the associated Manufacturer, you can set up a composite UNIQUE constraint which would consist of
manufacturer_number
anditem_number
, as demonstrated in the code lines below:When Sku values are always unique but a specific ItemNumber value can be shared among distinct Manufacturers
If you can guarantee that Product.Sku will never imply duplicates but an ItemNumber might be used by distinct Manufacturers, you can configure your database as exposed here:
Tested on PostgreSQL 11 in this db<>fiddle.
Physical-level considerations
We have not discussed the exact type and size of the
product.sku
column but, if it is “big” in terms of bytes, then it may end up undermining the data retrieval speed of your system —due to aspects of the physical level of abstraction, associated with, e.g., the sizes of the indexes and disk space usage—.In this manner, you might like to assess the incorporation of an INTEGER column which can offer a faster response than a possibly “heavy” TEXT one —but it all depends on the precise features of the compared columns—. It may well be a
product_number
that, as expected, would represent a numeric value in a sequence standing for the set of recordedproducts
.An expository arrangement that incorporates this new element is the one that follows:
I highly recommend carrying out substantial testing sessions with a considerable data load in order to decide which keys are more convenient —physically speaking—, always taking into account the overall database features (the number of columns of all the tables, the types and sizes of the columns, the constraints and the underlying indexes, etc.).
Similar scenario
You business environment of interest presents a certain resemblance to the scenario dealt with in these posts, so you might find of relevance some of the discussed points.