Sql-server – Definition of approach to store temporarily unknown data

database-designsql server

What structure would be more reasonable to store temporarily unknown data? Initially we do not know some aspects of the data we are working with, but such aspects will be filled in at some point.

Problem description

I am developing a database for product refurbishment, in which I have to store data such as customer, model, serial number, etc. (this part is working well so far).

Said database has to handle shipment of products (incomingoutgoing). There is no problem for the outgoing part because we already have the data, the problem we are facing is with the incoming aspect, since some customers send products as a bulk or lot, and we only receive a summary of the shipment (i.e., model and quantity), other customers do send the complete information of the products.

Alternatives I have considered

So far, I have a Shipment table, and a linking table that stores the relationship between Products and Shipments, the options that I have come up with are the following:

  • Option A. Insert products in the Product table with NULL marks (except Model & Customer) and relate them to a Shipment; later, when we receive the products we fill in the columns that were marked as NULL. This option will allow the products to be treated as the rest and will add minimal changes to the system.

  • Option B. Have a separate table that stores the ProductSummaries to be received, and then when the shipment is received they are properly stored in the Product table. This option will add more tables and complexity, but I will not have to deal with NULLable columns.

Questions

  • Which of my two described approaches will be more suitable?

  • Is there any other design or structure that I can implement or follow?

Best Answer

Option A is similar to what BI developers have dealt with in terms of incomplete data. In our case it's called a late arriving dimension and accommodates the fact that we often times get incomplete data, that is nevertheless important for immediate storage and analysis.

You can implement some form of the following: Instead of leaving NULLs create a default record in related tables so you can make inserts immediately with as much information as you have and later execute minimal updates without breaking your referential integrity design. Product would have a default or unknown row, and this would be used whenever you have no idea what product is coming in. Shipment maintains an FK to Products and once you know the product you can update Shipment with a new FK to Products. You may wish to add date columns to Shipment so you have insight into when these updates occur.

If you have incomplete product information, but enough to tell you've never seen it before (e.g. UPC but no color) do not use the default/unknown row. Insert a new product as you've described using defaults in the Product table, but generate a new PK. Later when your information arrives you should only need to update the product table - since that record has already been created and referenced elsewhere, you can perform minimal updates when your information finally arrives.