Postgresql – Normalizing a table with a field that generally uniquely identifies a row, but is sometimes null

database-designnormalizationpostgresql

forgive me if this has been asked and answered before.

I'm roughing out a schema for an inventory management system, to be implemented in PostgreSQL. All of our products and services have a sku. Most of our products come from the manufacturer or distributor with a separate "item number" (whether it be a distributor's catalog number, manufacturer's model number, whatever). However, not all of them have such a number. We have small assemblies that we make in-house that, generally, don't have item numbers. Our services don't have item numbers. For these reasons, the following CREATE TABLE makes sense to me.

Scenario A:

CREATE TABLE product (
   sku            text PRIMARY KEY,
   name           text UNIQUE NOT NULL, -- alternate key
   price          numeric NOT NULL CHECK (price > 0),
   quantity       numeric NOT NULL CHECK (quantity > 0),
   item_number   text -- hmmm...
);

However, I have two problems with this.

Sometimes (maybe 3% to 5% of the time), the item_number is actually equal to the SKU. That is, one of my suppliers in particular affixes to their products what I suspect is not a globally unique SKU, fashioned after their item number.
Whether equal to the SKU or not, the item_number (when existent) is in virtually every case sufficient to uniquely identify a product in the domain of my small store.

I'm worried about normalizing this to 3NF. If item_number is sometimes null, it obviously cannot be declared an alternate key. But, semantically, it is a unique identifier, where it exists, in every case I can think of. So does my above table, where every attribute is functionally dependent upon the non-prime attribute item_number whenever item_number exists, normalized? I'm thinking no, but I'm certainly not an expert. I thought of doing the following:

Scenario B

CREATE TABLE product (
   sku            text PRIMARY KEY REFERENCES product_item_number (sku),
   name           text UNIQUE NOT NULL, -- alternate key
   price          numeric NOT NULL CHECK (price > 0),
   quantity       numeric NOT NULL CHECK (quantity > 0),
);

CREATE TABLE product_item_number (
   sku            text PRIMARY KEY,
   item_number    text
);

Since it's really not a requirement that I preserve the functional dependency item_number -> price, item_number -> quantity, etc., scenario B kinda sorta seems reasonable to me. I won't have a non-prime attribute determining any other non-prime attributes.

My final idea was to simply use the sku as the item number in all cases where the item_number is otherwise non-existent, but I wonder whether that's a good practice.

Scenario C

CREATE TABLE product (
   sku            text PRIMARY KEY,
   name           text UNIQUE NOT NULL, -- alternate key
   price          numeric NOT NULL CHECK (price > 0),
   quantity       numeric NOT NULL CHECK (quantity > 0),
   item_number    text UNIQUE NOT NULL -- alternate key???
);

My concern with scenario C is that there may be cases where a supplier recycles a catalog number with a different sku (maybe?), or situations where two manufacturer's both make a "d57-red" or something like that. In that case, I think I'd have to programmatically prefix offending item_numbers with manufacturer names or something like that.

Of course, maybe I'm over thinking all this.

Thanks for reading.
A couple clarifications, as per MDCCL's comments:

A sku will always be unique in my domain (The small amount of non-globablly unique supplier-provided SKUs are unlikely to ever collide).
The item_number will be a public-facing attribute, used both by customers and sometimes myself to identify products. For example, say a customer skips my website and calls me to ask if I have xyz-white; the item_number helps remove ambiguity. The item numbers are unique in my experience (that is, there are no counter examples in my inventory), but that's not a rule, per se. I could have an item_number name space collision one day. Perhaps, if that happened, I would prefix the first three letters of the manufacturer's name to the item_number.
item_numbers don't always exist. I suppose I could provide some sort of "surrogate item_number" for those without one, but an arbitrary item_number would be counter-productive. As explained immediately above, where an item_number exists, it should exist to help myself and my customers disambiguate between products. They might believe they're looking at the wrong product if the item_number is something I concocted myself. I'm not sure.

Best Answer

Provided that Sku and ItemNumber will always imply unique values

I consider that you found the answer already by discovering that, conceptually speaking, ItemNumber is an optional property; i.e., when you determined that it does not apply to each and every one of the occurrences —represented by logical-level rows— of the Product entity type. Therefore, the item_number column should not be declared as an ALTERNATE KEY (AK for brevity) in the product table, as you rightly pointed out.

In this respect, your Scenario B is quite reasonable, as the following conceptual-level formulation demonstrates:

A product may or may not have an item number.

In other words, there is a one to zero or one (1:0/1) cardinality ratio between Product and ItemNumber.

Then, yes, you should introduce a new table to deal with the optional column, and I agree that product_item_number is a very descriptive name for it. This table should have sku constrained as its PRIMARY KEY (PK), so as to ensure that no more than one row with the same sku value is inserted into it, just like you did.

It is also important to mention that product_item_number.sku should as well be a constrained as a FOREIGN KEY (FK) making a reference to product.sku.

Here is a sample SQL-DDL logical-level design that illustrates the previous suggestions:

-- You should determine which are the most fitting 
-- data types and sizes for all your table columns 
-- depending on your business context characteristics.

-- Also, you should make accurate tests to define
-- the most convenient INDEXing strategies.

CREATE TABLE product ( 
    sku      TEXT    NOT NULL, 
    name     TEXT    NOT NULL, 
    price    NUMERIC NOT NULL, 
    quantity NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK        UNIQUE      (name), -- AK.
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
); 

CREATE TABLE product_item_number ( 
    sku         TEXT NOT NULL, -- To be constrained as PK and FK to ensure the 1:0/1 correspondence ratio between the relevant rows.
    item_number TEXT NOT NULL, 
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (sku),
    CONSTRAINT product_item_number_AK            UNIQUE      (item_number), -- In this context, ‘item_number’ is an AK. 
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (sku) 
        REFERENCES product (sku)  
);

Tested on PostgreSQL 11 in this db<>fiddle.

Moreover, there is another conceptual formulation that guides in shaping the database design presented above:

If it exists, the ItemNumber of a Product must be unique.

So, where the item_number column should actually be declared as an AK is right there, in the product_item_number table, because said column requires uniqueness protection only when the pertinent value is provided, hence the UNIQUE and NOT NULL constraints have to be configured accordingly.

Missing values and the “Closed World Interpretation”

The logical SQL-DDL arrangement previously described is an example of the relational approach to handle missing values, although it is not the most popular —or usual—. This approach is related to the “Closed World Interpretation” —or “Assumption”—. Adopting this position, (a) the information recorded in the database is always deemed true, and (b) the information that is not recorded in it is, at all times, deemed false. In this way, one is exclusively retaining facts that are known.

In the present business scenario, when a user supplies all the data points that are comprised in the product table you have to INSERT the corresponding row and if, and only if, the user made the item_number datum available you also have to INSERT the product_item_number counterpart. In case that the item_number value is unknown or it simply does not apply, you do not INSERT a product_item_number row, and that is it.

With this method you avoid holding NULL marks/markers in your base tables —and the logical-level consequences that I will detail in the next section—, but you should be aware that this is a “controversial” topic in the database administration ambit. On this point, you might find of value the answers for the Stack Overflow question entitled:

“How can I avoid NULLs in my database, while also representing missing data?”

The popular course of action

I guess, however, that the popular —or common— proceeding would be to have a single product table that includes the item_number column which, in turn, would be set as NULLable and, at the same time, defined with a UNIQUE constraint. The way I see it, this approach would make your database and the applicable data manipulation operations less elegant (as shown, e.g., in this outstanding Stack Overflow answer), but it is a possibility.

See the successive DDL statements that exemplify this course of action:

CREATE TABLE product ( 
    sku         TEXT    NOT NULL, 
    name        TEXT    NOT NULL, 
    price       NUMERIC NOT NULL, 
    quantity    NUMERIC NOT NULL, 
    item_number TEXT    NULL, -- Accepting NULL marks. 
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK1       UNIQUE      (name), -- AK.
    CONSTRAINT product_AK2       UNIQUE      (item_number), -- Being ‘NULLable’, this is not an AK. 
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
);

Tested on PostgreSQL 11 in this db<>fiddle.

So, having established item_number as a column that can contain NULLs, it is not correct to say, logically speaking, that it is an AK. Furthermore, you would be storing ambiguous NULL marks —which are not values, no matter if the PostgreSQL documentation labels them that way—, thus it can be argued that the table would not be a proper representation of an adapted mathematical relation and normalization rules cannot be applied to it.

Since a NULL indicates that a column value is (1) unknown or (2) inapplicable, it cannot be rightly stated that said mark belongs to the item_number valid domain of values. As you know, this kind of mark tells something about the “status” of a real value, but it is not a value itself and, naturally, it does not behave as such —and, by the way, it is worth to mention that NULLs behave differently across the distinct SQL database management systems, even across distinct versions of the same database management system—.

Then, if (i) the domain of values of a certain column and (ii) the meaning that said column carries is not entirely clear as a result of the inclusion of NULLs:

How could one evaluate and define the relevant functional dependencies?
How can it be identified and declared as PRIMARY or ALTERNATE KEY (as in the case of the item_number)?

Despite both the theoretical and practical —e.g. regarding data manipulation— implications that concern to the retention of NULL marks in a database, this is the approach to handle missing data that you will find in the vast majority of the databases built on SQL platforms, since it permits attaching columns for optional values to the base tables of significance and, as an effect, eludes the creation of (a) a complementary table and (b) the associated tasks.

The decision

I have presented the two alternatives so that you can determine by yourself which one is more suitable to achieve your objectives.

Assuming that the Sku and ItemNumber values can eventually be duplicated

There are some points of your question that caught my attention in an particular way, so I listed them:

Sometimes (maybe 3% to 5% of the time), the item_number is actually equal to the SKU. That is, one of my suppliers in particular affixes to their products what I suspect is not a globally unique SKU, fashioned after their item number.

[…] there may be cases where a supplier recycles a catalog number with a different sku (maybe?), or situations where two manufacturer's both make a "d57-red" or something like that. In that case, I think I'd have to programmatically prefix offending item_numbers with manufacturer names or something like that.

A sku will always be unique in my domain (The small amount of non-globablly unique supplier-provided SKUs are unlikely to ever collide).

Those points can have remarkable repercussions because they seem to suggest that:

The ItemNumber values can eventually become duplicated and, when that happens, you might evaluate combining two different pieces of information that bear different meanings in the same column.
It is probable that, after all, the Sku values might be repeated (even if it is a small amount of repeated Sku instances).

In this regard, it is worth to note that two paramount objectives of a data modelling exercise are (1) determining each individual datum of significance and (2) preventing the retention of more than one of them in the same column. These factors, e.g., facilitate the delineation of a stable and versatile database structure and assist in the avoidance of duplicated information —which helps to maintain the data values consistent with the business rules, via the respective constraints—.

Alternative to handle Sku duplicates: Introducing a manufacturer table to the scenario

Consequently, on condition that the same Sku value can be shared across different Manufacturers, you could make use of a composite PK constraint in the product table, and it would be made up of (i) the manufacturer PK column and (ii) sku. E.g.:

CREATE TABLE manufacturer (
    manufacturer_number INTEGER  NOT NULL, -- This could be something more meaningful, e.g., ‘manufacturer_code’.
    name                TEXT NOT NULL,
    --
    CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number), 
    CONSTRAINT manufacturer_AK UNIQUE      (name) -- AK.
);

CREATE TABLE product (
    manufacturer_number INTEGER NOT NULL, 
    sku                 TEXT    NOT NULL,
    name                TEXT    NOT NULL, 
    price               NUMERIC NOT NULL,
    quantity            NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK                 PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
    CONSTRAINT product_AK                 UNIQUE      (name), -- AK.
    CONSTRAINT product_TO_manufacturer_FK FOREIGN KEY (manufacturer_number)
        REFERENCES manufacturer (manufacturer_number),
    CONSTRAINT valid_price_CK             CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK          CHECK       (quantity > 0)
);

And, if the ItemNumber demands uniqueness preservation when it is applicable, then the product_item_number table can be structured as follows:

CREATE TABLE product_item_number (
    manufacturer_number INTEGER NOT NULL,  
    sku                 TEXT    NOT NULL,
    item_number         TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
    CONSTRAINT product_item_number_AK            UNIQUE      (item_number), -- AK.  
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
        REFERENCES product (manufacturer_number, sku)  
);

Tested on PostgreSQL 11 in this db<>fiddle.

In case that ItemNumber does not require preventing duplicates, you simply remove the UNIQUE constraint declared for such a column, as shown in the next DDL statements:

CREATE TABLE product_item_number (
    manufacturer_number INTEGER NOT NULL,  
    sku                 TEXT    NOT NULL,
    item_number         TEXT    NOT NULL, -- In this case, ‘item_number’ does not require a UNIQUE constraint.
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
        REFERENCES product (manufacturer_number, sku)  
);

On the other hand, supposing that ItemNumber does actually entail avoiding repeated values exclusively with regards to the associated Manufacturer, you can set up a composite UNIQUE constraint which would consist of manufacturer_number and item_number, as demonstrated in the code lines below:

CREATE TABLE product_item_number (
    manufacturer_number INTEGER NOT NULL,  
    sku                 TEXT    NOT NULL,
    item_number         TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (manufacturer_number, sku),         -- Composite PK.
    CONSTRAINT product_item_number_AK            UNIQUE      (manufacturer_number, item_number), -- Composite AK.
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)          -- Composite FK.
        REFERENCES product (manufacturer_number, sku)  
);

When Sku values are always unique but a specific ItemNumber value can be shared among distinct Manufacturers

If you can guarantee that Product.Sku will never imply duplicates but an ItemNumber might be used by distinct Manufacturers, you can configure your database as exposed here:

CREATE TABLE manufacturer (
    manufacturer_number INTEGER NOT NULL, 
    name                TEXT    NOT NULL,
    --
    CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number), 
    CONSTRAINT manufacturer_AK UNIQUE      (name) -- AK.
);

CREATE TABLE product ( 
    sku      TEXT    NOT NULL, 
    name     TEXT    NOT NULL, 
    price    NUMERIC NOT NULL, 
    quantity NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK        UNIQUE      (name), -- AK. 
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
); 

CREATE TABLE product_item_number ( 
    sku                 TEXT    NOT NULL,
    manufacturer_number INTEGER NOT NULL,
    item_number         TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK                 PRIMARY KEY (sku, manufacturer_number),  
    CONSTRAINT product_item_number_AK                 UNIQUE      (manufacturer_number, item_number), -- In this context, ‘manufacturer_number’ and ‘item_number’ compose an AK. 
    CONSTRAINT product_item_number_TO_product_FK      FOREIGN KEY (sku)
        REFERENCES product (sku),  
    CONSTRAINT product_item_number_TO_manufacturer_FK FOREIGN KEY (manufacturer_number) 
        REFERENCES manufacturer (manufacturer_number)  
);

Tested on PostgreSQL 11 in this db<>fiddle.

Physical-level considerations

We have not discussed the exact type and size of the product.sku column but, if it is “big” in terms of bytes, then it may end up undermining the data retrieval speed of your system —due to aspects of the physical level of abstraction, associated with, e.g., the sizes of the indexes and disk space usage—.

In this manner, you might like to assess the incorporation of an INTEGER column which can offer a faster response than a possibly “heavy” TEXT one —but it all depends on the precise features of the compared columns—. It may well be a product_number that, as expected, would represent a numeric value in a sequence standing for the set of recorded products.

An expository arrangement that incorporates this new element is the one that follows:

CREATE TABLE product ( 
    product_number INTEGER NOT NULL,
    sku            TEXT    NOT NULL, 
    name           TEXT    NOT NULL, 
    price          NUMERIC NOT NULL, 
    quantity       NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK        UNIQUE      (name), -- AK. 
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
); 

CREATE TABLE product_item_number 
( 
    product_number INTEGER NOT NULL,
    item_number    TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (product_number),  
    CONSTRAINT product_item_number_AK            UNIQUE      (item_number), -- AK.
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (product_number)
       REFERENCES product (product_number)   
);

I highly recommend carrying out substantial testing sessions with a considerable data load in order to decide which keys are more convenient —physically speaking—, always taking into account the overall database features (the number of columns of all the tables, the types and sizes of the columns, the constraints and the underlying indexes, etc.).

Similar scenario

You business environment of interest presents a certain resemblance to the scenario dealt with in these posts, so you might find of relevance some of the discussed points.

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Anomalous Updates in Normalized Database

Normalization is the formal process for removing redundancy from relations by taking projections which when joined back form the original relational and thus eliminate some redundancy without data loss. It is the science underlying database design. The first three normal forms, and BCNF, deal specifically with eliminating redundancy due by ensuring that every non-trivial functional dependency is fully dependent only on candidate keys. Higher normal forms deal with other kinds of dependencies to further eliminate redundancies. Even when fully normalized (5NF is generally considered the "final" normal form although there are four others in the literature) redundancy can still remain as not all redundancies can be removed by taking projections.

Another tool to address eliminating redundancy is the principle of orthogonal design which states that two distinct relvars cannot have in them a tuple with the property that if it appears in the first relvar it must also appear in the second and vice versa. But this principle only addresses redundancy across relvars whereas normalization addresses redundancy within them so it doesn't help with your example.

Ultimately Date contends we just need more science to guide database design as that which we have today as you show isn't quite enough. One practical point to your example is that although there is redundancy, at least it can be controlled redundancy if a table is defined to hold the dancers, all key, with name and birth date. Then, name and birth date become a foreign key to the dances table, and that foreign key can be defined to cascade updates. Then, if a particular dancer's birthdate is found to be in error and corrected, the DBMS will automatically handle updating all the places in the dance table where that dancer was listed. Moving the control of the redundancy from the user to the system is a big step forward that you can get with today's SQL DBMS'.

All of this information is paraphrased from Date's excellent book Database Design and Relational Theory which provides a significant amount of thinking and detail around just this issue. It is indeed the case that we stand on the shoulders of giants.

Best Answer

Provided that Sku and ItemNumber will always imply unique values

The decision

Assuming that the Sku and ItemNumber values can eventually be duplicated

Physical-level considerations

Similar scenario

Related Solutions

Postgresql – Delete duplicate records with no change in between

Notes

Anomalous Updates in Normalized Database

Related Question