PostgreSQL – Best Practices for Storing Many Iterations of a Product

change-trackingdatabase-designpostgresqlrdbmsschema

We are building a tool to track the prices of products over time, and using Postgres as our RDBMS. It is important that product attributes can be changed, and that the history of an product's attributes be preserved forever. Here is a schema we designed based on OpenStreetMap's internal schema:
Our schema so far

We have a 'products' table on the left storing every version of every product, and a 'current_products' table on the right storing only the most recent version of each product. Every time we want to change a store, we:

  1. create an entry in changesets
  2. read the latest entry of the product in 'products', increment version by one, and create another entry with the changes
  3. delete the corresponding entry in 'current_products' and create a new one with the changes and the latest version number from 'products'

We want to enforce as many business rules in the database engine as possible rather than relying on our software to keep things consistent, and this schema feels pretty "off", so we welcome any suggestions. Thanks in advance!


Edit:
Revised the schema based a response from @bbaird . Also decided to include versioning of stores and users. Tied products, stores, and users together with price table.schema_v2

Best Answer

You're right to feel the schema is off, because it is - the way it is designed now will not guarantee the minimum criteria required for consistency: As of a point in time, only one value can exist for a given attribute.

There are two ways to handle this, depending on the use case:

  1. Different versions of the attributes need to be accessed by the application
  2. Changes must be tracked for audit reasons only

Solution: Case 1

You would have a Product table and a Product_Version to store the necessary information. You will need a view/function to return the proper value.

Since you are dealing with food (and a standard source), I'm going to make certain assumptions about keys/datatypes. Feel free to comment to clarify.

CREATE TABLE Product
(
  Barcode  VARCHAR(13)  NOT NULL
  /* Store all invariant attributes in this table */
 ,CONSTRAINT PK_Product PRIMARY KEY (Barcode) /* This uniquely defines a product and is compact enough - no other key is necessary */
)
;

CREATE TABLE Product_Version
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL /* Adjust as necessary */
 ,Currency_Cd    CHAR(3)        NOT NULL /* Should reference a Currency table with ISO codes (USD, EUR, GBP, etc) */
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,CONSTRAINT FK_Product_Version_Version_Of_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
 ,CONSTRAINT PK_Product_Version PRIMARY KEY (Barcode, Change_Dtm)
 ,CONSTRAINT CK_Product_Version_Price_GT_Zero CHECK (Price > 0)
 ,CONSTRAINT CK_Product_Version_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

To get the values for a specific product as of a point in time, you would use the following query:

SELECT
  PV.Barcode
 ,PV.Name
 ,PV.Price
 ,PV.Currency_Cd
FROM
  Product_Version PV
WHERE
  PV.Barcode = '8076809513388'
    AND PV.Change_Dtm =
      (
        SELECT
          MAX(Change_Dtm)
        FROM
          Product_Version
        WHERE
          Barcode = PV.Barcode
            AND Change_Dtm <= '2020-10-29 12:30:00.000000'
      )

You can also make a view to mimic the function of a table with static values:

CREATE VIEW v_Product AS
SELECT
  PV.Barcode
 ,PV.Name
 ,PV.Price
 ,PV.Currency_Cd
FROM
  Product_Version PV
WHERE
  PV.Change_Dtm =
    (
      SELECT
        MAX(Change_Dtm)
      FROM
        Product_Version
      WHERE
        Barcode = PV.Barcode
    )

For one-to-many relationships (let's use Ingredient for this example) you would follow a pattern like so:

CREATE TABLE Product_Ingredient
(
  Barcode     VARCHAR(13)   NOT NULL
 ,Ingredient  VARCHAR(50)   NOT NULL  /* Should reference an Ingredient table */
 ,Rank        SMALLINT      NOT NULL  /* Uniqueness of this value needs to be handled through transaction logic */
 ,Change_Dtm  TIMESTAMP(6)  NOT NULL
 ,Delete_Ind  CHAR(1)       NOT NULL
 ,CONSTRAINT FK_Product_Ingredient_Used_In_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
 ,CONSTRAINT PK_Product_Ingredient PRIMARY KEY (Barcode, Change_Dtm)
 ,CONSTRAINT CK_Product_Ingredient_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

Then to get a list of Ingredients for a Product at a point in time, you would use the following query:

SELECT
  PI.Barcode
 ,PI.Ingredient
 ,PI.Rank
FROM
  Product_Ingredient PI
WHERE
  PI.Barcode = '8076809513388'
    AND PI.Change_Dtm =
      (
        SELECT
          MAX(Change_Dtm)
        FROM
          Product_Ingredient
        WHERE
          Barcode = PI.Barcode
            AND Ingredient = PI.Ingredient
            AND Change_Dtm <= '2020-10-29 12:30:00.000000' /* Or whatever */
      )
    AND PI.Delete_Ind = 'N'

Similar to the prior example, you can create a view to provide the current values for each of the one-to-many relations.

Solution: Case 2

If you merely need to store history, you simply make a small modification to the structure:

CREATE TABLE Product
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL
 ,Currency_Cd    CHAR(3)        NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,CONSTRAINT PK_Product PRIMARY KEY (Barcode)
 ,CONSTRAINT CK_Product_Price_GT_Zero CHECK (Price > 0)
 ,CONSTRAINT CK_Product_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

CREATE TABLE Product_Audit
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL
 ,Currency_Cd    CHAR(3)        NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,CONSTRAINT PK_Product_Audit PRIMARY KEY (Barcode, Change_Dtm)
)
;

In this case, whenever an update or delete is called for a Product, the following operations are followed:

  1. Insert into the audit table the current row from Product
  2. Update the Product table with the new values

Notes:

  1. What's implicit in this discussion is that new data is written only if the data changes. You can enforce this either through transaction/ETL logic, or triggers to rollback attempts to insert data that is exactly the same as the prior values. This won't effect the data returned for a given query, but it goes a long way to making sure your table sizes don't explode unnecessarily.
  2. If you have a lot of attributes, and some change frequently (such as Price), but others do not (Name, Description), you can always split things into more tables (Product_Price, Product_Name, etc.) and just create a view that incorporates all of those elements. This level of effort generally isn't necessary unless the entities have a lot of attributes or you will have a lot of ad-hoc queries that are asking time-specific questions that rely on knowing the prior value was actually different, such as "Which products increased price during this time frame?"
  3. It's crucial you do not follow the pattern of just sticking an Id on every table and thinking that provides any sort of value. Time-variant data always requires composite keys and only returns consistent results if the data is properly normalized to at least 3NF. Do not use any sort of ORM that does not support composite keys.