Why wouldn’t I put the single preferred option into its own table, instead of having a ‘Preferred’ column

database-designnormalizationschema

In similar spirit to How to implement a 'default' flag that can only be set on a single row I would like to specify that, for any given list of possible names for a business, only one of the names is considered 'canonical'.

Two appealing options for representing this would include:

a single table, with a Preferred column (a boolean flag) and a 'filtered unique index' on BusinessId, Preferred(when true) to ensure that for any given business, only one of the names could be preferred, but all others could be unpreferred
```
BusinessNames

BusinessID | Name          | Preferred
--------------------------------------
1          | Sony Pictures | 1
1          | Oh no! Inc    | 0
1          | Facepalm Co   | 0
```
; OR

the plainer version of that same table (BusinessID, Name) and a peer table for the 'preferred names' (also BusinessID, Name, but unique only on BusinessID)

BusinessNames                       PreferredBusinessNames

BusinessID | Name                   BusinessID | Name
--------------------------          --------------------------
1          | Sony Pictures          1          | Sony Pictures
1          | Oh no! Inc    
1          | Facepalm Co

I lean towards option 2, which seems cleaner to me. Never mind that BusinessNames could reference a surrogate primary key in a Names table for possible space savings. (I am going on an anti-surrogate-key binge right now to see where I end up).

I am aware of some implications for my app, such as:

To insert/update/delete a name, I have to touch 2 tables.
To do a query where I want to return only a single row per business and 'the name' of that business, I would simply involve PreferredBusinessNames, and leave BusinessNames alone.

I think I can live with those implications, but am I shooting myself in the foot by this ~~sensible~~ extreme normalization'? Am I creating a monster, or will I be pleasantly surprised? I've never seen anyone else do it that way, and surely there is a reason?

Best Answer

It depends. There are cases whether one design or the other are satisfactory.

If you are doing a lot of queries only on the preferred or default entry, then the second option seems better.

If, however, you are normally doing lots of queries for all the entries, regardless of whether they are preferred/default or not, the first option is a little easier to set up and manage.

And if most of your queries want to retrieve all the entries except the preferred/default one, you'll find the first option a lot easier. I can't think of such a case off the top of my head.

And, in most cases, these tables are so tiny that performance considerations will be secondary.

Provided that Sku and ItemNumber will always imply unique values

I consider that you found the answer already by discovering that, conceptually speaking, ItemNumber is an optional property; i.e., when you determined that it does not apply to each and every one of the occurrences —represented by logical-level rows— of the Product entity type. Therefore, the item_number column should not be declared as an ALTERNATE KEY (AK for brevity) in the product table, as you rightly pointed out.

In this respect, your Scenario B is quite reasonable, as the following conceptual-level formulation demonstrates:

A product may or may not have an item number.

In other words, there is a one to zero or one (1:0/1) cardinality ratio between Product and ItemNumber.

Then, yes, you should introduce a new table to deal with the optional column, and I agree that product_item_number is a very descriptive name for it. This table should have sku constrained as its PRIMARY KEY (PK), so as to ensure that no more than one row with the same sku value is inserted into it, just like you did.

It is also important to mention that product_item_number.sku should as well be a constrained as a FOREIGN KEY (FK) making a reference to product.sku.

Here is a sample SQL-DDL logical-level design that illustrates the previous suggestions:

-- You should determine which are the most fitting 
-- data types and sizes for all your table columns 
-- depending on your business context characteristics.

-- Also, you should make accurate tests to define
-- the most convenient INDEXing strategies.

CREATE TABLE product ( 
    sku      TEXT    NOT NULL, 
    name     TEXT    NOT NULL, 
    price    NUMERIC NOT NULL, 
    quantity NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK        UNIQUE      (name), -- AK.
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
); 

CREATE TABLE product_item_number ( 
    sku         TEXT NOT NULL, -- To be constrained as PK and FK to ensure the 1:0/1 correspondence ratio between the relevant rows.
    item_number TEXT NOT NULL, 
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (sku),
    CONSTRAINT product_item_number_AK            UNIQUE      (item_number), -- In this context, ‘item_number’ is an AK. 
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (sku) 
        REFERENCES product (sku)  
);

Tested on PostgreSQL 11 in this db<>fiddle.

Moreover, there is another conceptual formulation that guides in shaping the database design presented above:

If it exists, the ItemNumber of a Product must be unique.

So, where the item_number column should actually be declared as an AK is right there, in the product_item_number table, because said column requires uniqueness protection only when the pertinent value is provided, hence the UNIQUE and NOT NULL constraints have to be configured accordingly.

Missing values and the “Closed World Interpretation”

The logical SQL-DDL arrangement previously described is an example of the relational approach to handle missing values, although it is not the most popular —or usual—. This approach is related to the “Closed World Interpretation” —or “Assumption”—. Adopting this position, (a) the information recorded in the database is always deemed true, and (b) the information that is not recorded in it is, at all times, deemed false. In this way, one is exclusively retaining facts that are known.

In the present business scenario, when a user supplies all the data points that are comprised in the product table you have to INSERT the corresponding row and if, and only if, the user made the item_number datum available you also have to INSERT the product_item_number counterpart. In case that the item_number value is unknown or it simply does not apply, you do not INSERT a product_item_number row, and that is it.

With this method you avoid holding NULL marks/markers in your base tables —and the logical-level consequences that I will detail in the next section—, but you should be aware that this is a “controversial” topic in the database administration ambit. On this point, you might find of value the answers for the Stack Overflow question entitled:

“How can I avoid NULLs in my database, while also representing missing data?”

The popular course of action

I guess, however, that the popular —or common— proceeding would be to have a single product table that includes the item_number column which, in turn, would be set as NULLable and, at the same time, defined with a UNIQUE constraint. The way I see it, this approach would make your database and the applicable data manipulation operations less elegant (as shown, e.g., in this outstanding Stack Overflow answer), but it is a possibility.

See the successive DDL statements that exemplify this course of action:

CREATE TABLE product ( 
    sku         TEXT    NOT NULL, 
    name        TEXT    NOT NULL, 
    price       NUMERIC NOT NULL, 
    quantity    NUMERIC NOT NULL, 
    item_number TEXT    NULL, -- Accepting NULL marks. 
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK1       UNIQUE      (name), -- AK.
    CONSTRAINT product_AK2       UNIQUE      (item_number), -- Being ‘NULLable’, this is not an AK. 
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
);

Tested on PostgreSQL 11 in this db<>fiddle.

So, having established item_number as a column that can contain NULLs, it is not correct to say, logically speaking, that it is an AK. Furthermore, you would be storing ambiguous NULL marks —which are not values, no matter if the PostgreSQL documentation labels them that way—, thus it can be argued that the table would not be a proper representation of an adapted mathematical relation and normalization rules cannot be applied to it.

Since a NULL indicates that a column value is (1) unknown or (2) inapplicable, it cannot be rightly stated that said mark belongs to the item_number valid domain of values. As you know, this kind of mark tells something about the “status” of a real value, but it is not a value itself and, naturally, it does not behave as such —and, by the way, it is worth to mention that NULLs behave differently across the distinct SQL database management systems, even across distinct versions of the same database management system—.

Then, if (i) the domain of values of a certain column and (ii) the meaning that said column carries is not entirely clear as a result of the inclusion of NULLs:

How could one evaluate and define the relevant functional dependencies?
How can it be identified and declared as PRIMARY or ALTERNATE KEY (as in the case of the item_number)?

Despite both the theoretical and practical —e.g. regarding data manipulation— implications that concern to the retention of NULL marks in a database, this is the approach to handle missing data that you will find in the vast majority of the databases built on SQL platforms, since it permits attaching columns for optional values to the base tables of significance and, as an effect, eludes the creation of (a) a complementary table and (b) the associated tasks.

The decision

I have presented the two alternatives so that you can determine by yourself which one is more suitable to achieve your objectives.

Assuming that the Sku and ItemNumber values can eventually be duplicated

There are some points of your question that caught my attention in an particular way, so I listed them:

Sometimes (maybe 3% to 5% of the time), the item_number is actually equal to the SKU. That is, one of my suppliers in particular affixes to their products what I suspect is not a globally unique SKU, fashioned after their item number.

[…] there may be cases where a supplier recycles a catalog number with a different sku (maybe?), or situations where two manufacturer's both make a "d57-red" or something like that. In that case, I think I'd have to programmatically prefix offending item_numbers with manufacturer names or something like that.

A sku will always be unique in my domain (The small amount of non-globablly unique supplier-provided SKUs are unlikely to ever collide).

Those points can have remarkable repercussions because they seem to suggest that:

The ItemNumber values can eventually become duplicated and, when that happens, you might evaluate combining two different pieces of information that bear different meanings in the same column.
It is probable that, after all, the Sku values might be repeated (even if it is a small amount of repeated Sku instances).

In this regard, it is worth to note that two paramount objectives of a data modelling exercise are (1) determining each individual datum of significance and (2) preventing the retention of more than one of them in the same column. These factors, e.g., facilitate the delineation of a stable and versatile database structure and assist in the avoidance of duplicated information —which helps to maintain the data values consistent with the business rules, via the respective constraints—.

Alternative to handle Sku duplicates: Introducing a manufacturer table to the scenario

Consequently, on condition that the same Sku value can be shared across different Manufacturers, you could make use of a composite PK constraint in the product table, and it would be made up of (i) the manufacturer PK column and (ii) sku. E.g.:

CREATE TABLE manufacturer (
    manufacturer_number INTEGER  NOT NULL, -- This could be something more meaningful, e.g., ‘manufacturer_code’.
    name                TEXT NOT NULL,
    --
    CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number), 
    CONSTRAINT manufacturer_AK UNIQUE      (name) -- AK.
);

CREATE TABLE product (
    manufacturer_number INTEGER NOT NULL, 
    sku                 TEXT    NOT NULL,
    name                TEXT    NOT NULL, 
    price               NUMERIC NOT NULL,
    quantity            NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK                 PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
    CONSTRAINT product_AK                 UNIQUE      (name), -- AK.
    CONSTRAINT product_TO_manufacturer_FK FOREIGN KEY (manufacturer_number)
        REFERENCES manufacturer (manufacturer_number),
    CONSTRAINT valid_price_CK             CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK          CHECK       (quantity > 0)
);

And, if the ItemNumber demands uniqueness preservation when it is applicable, then the product_item_number table can be structured as follows:

CREATE TABLE product_item_number (
    manufacturer_number INTEGER NOT NULL,  
    sku                 TEXT    NOT NULL,
    item_number         TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
    CONSTRAINT product_item_number_AK            UNIQUE      (item_number), -- AK.  
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
        REFERENCES product (manufacturer_number, sku)  
);

Tested on PostgreSQL 11 in this db<>fiddle.

In case that ItemNumber does not require preventing duplicates, you simply remove the UNIQUE constraint declared for such a column, as shown in the next DDL statements:

CREATE TABLE product_item_number (
    manufacturer_number INTEGER NOT NULL,  
    sku                 TEXT    NOT NULL,
    item_number         TEXT    NOT NULL, -- In this case, ‘item_number’ does not require a UNIQUE constraint.
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (manufacturer_number, sku), -- Composite PK.
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)
        REFERENCES product (manufacturer_number, sku)  
);

On the other hand, supposing that ItemNumber does actually entail avoiding repeated values exclusively with regards to the associated Manufacturer, you can set up a composite UNIQUE constraint which would consist of manufacturer_number and item_number, as demonstrated in the code lines below:

CREATE TABLE product_item_number (
    manufacturer_number INTEGER NOT NULL,  
    sku                 TEXT    NOT NULL,
    item_number         TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (manufacturer_number, sku),         -- Composite PK.
    CONSTRAINT product_item_number_AK            UNIQUE      (manufacturer_number, item_number), -- Composite AK.
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (manufacturer_number, sku)          -- Composite FK.
        REFERENCES product (manufacturer_number, sku)  
);

When Sku values are always unique but a specific ItemNumber value can be shared among distinct Manufacturers

If you can guarantee that Product.Sku will never imply duplicates but an ItemNumber might be used by distinct Manufacturers, you can configure your database as exposed here:

CREATE TABLE manufacturer (
    manufacturer_number INTEGER NOT NULL, 
    name                TEXT    NOT NULL,
    --
    CONSTRAINT manufacturer_PK PRIMARY KEY (manufacturer_number), 
    CONSTRAINT manufacturer_AK UNIQUE      (name) -- AK.
);

CREATE TABLE product ( 
    sku      TEXT    NOT NULL, 
    name     TEXT    NOT NULL, 
    price    NUMERIC NOT NULL, 
    quantity NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK        UNIQUE      (name), -- AK. 
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
); 

CREATE TABLE product_item_number ( 
    sku                 TEXT    NOT NULL,
    manufacturer_number INTEGER NOT NULL,
    item_number         TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK                 PRIMARY KEY (sku, manufacturer_number),  
    CONSTRAINT product_item_number_AK                 UNIQUE      (manufacturer_number, item_number), -- In this context, ‘manufacturer_number’ and ‘item_number’ compose an AK. 
    CONSTRAINT product_item_number_TO_product_FK      FOREIGN KEY (sku)
        REFERENCES product (sku),  
    CONSTRAINT product_item_number_TO_manufacturer_FK FOREIGN KEY (manufacturer_number) 
        REFERENCES manufacturer (manufacturer_number)  
);

Tested on PostgreSQL 11 in this db<>fiddle.

Physical-level considerations

We have not discussed the exact type and size of the product.sku column but, if it is “big” in terms of bytes, then it may end up undermining the data retrieval speed of your system —due to aspects of the physical level of abstraction, associated with, e.g., the sizes of the indexes and disk space usage—.

In this manner, you might like to assess the incorporation of an INTEGER column which can offer a faster response than a possibly “heavy” TEXT one —but it all depends on the precise features of the compared columns—. It may well be a product_number that, as expected, would represent a numeric value in a sequence standing for the set of recorded products.

An expository arrangement that incorporates this new element is the one that follows:

CREATE TABLE product ( 
    product_number INTEGER NOT NULL,
    sku            TEXT    NOT NULL, 
    name           TEXT    NOT NULL, 
    price          NUMERIC NOT NULL, 
    quantity       NUMERIC NOT NULL,
    --
    CONSTRAINT product_PK        PRIMARY KEY (sku), 
    CONSTRAINT product_AK        UNIQUE      (name), -- AK. 
    CONSTRAINT valid_price_CK    CHECK       (price > 0),
    CONSTRAINT valid_quantity_CK CHECK       (quantity > 0)
); 

CREATE TABLE product_item_number 
( 
    product_number INTEGER NOT NULL,
    item_number    TEXT    NOT NULL,
    --
    CONSTRAINT product_item_number_PK            PRIMARY KEY (product_number),  
    CONSTRAINT product_item_number_AK            UNIQUE      (item_number), -- AK.
    CONSTRAINT product_item_number_TO_product_FK FOREIGN KEY (product_number)
       REFERENCES product (product_number)   
);

I highly recommend carrying out substantial testing sessions with a considerable data load in order to decide which keys are more convenient —physically speaking—, always taking into account the overall database features (the number of columns of all the tables, the types and sizes of the columns, the constraints and the underlying indexes, etc.).

Similar scenario

You business environment of interest presents a certain resemblance to the scenario dealt with in these posts, so you might find of relevance some of the discussed points.

Best Answer

Related Solutions

SQL Server – Choosing Prefixes for Normalized Data Store

PostgreSQL Normalization – Handling a Field That Generally Uniquely Identifies a Row but Is Sometimes Null

Provided that Sku and ItemNumber will always imply unique values

The decision

Assuming that the Sku and ItemNumber values can eventually be duplicated

Physical-level considerations

Similar scenario

Related Question