PostgreSQL – Best Practices for Storing Many Iterations of a Product

change-trackingdatabase-designpostgresqlrdbmsschema

We are building a tool to track the prices of products over time, and using Postgres as our RDBMS. It is important that product attributes can be changed, and that the history of an product's attributes be preserved forever. Here is a schema we designed based on OpenStreetMap's internal schema:

We have a 'products' table on the left storing every version of every product, and a 'current_products' table on the right storing only the most recent version of each product. Every time we want to change a store, we:

create an entry in changesets
read the latest entry of the product in 'products', increment version by one, and create another entry with the changes
delete the corresponding entry in 'current_products' and create a new one with the changes and the latest version number from 'products'

We want to enforce as many business rules in the database engine as possible rather than relying on our software to keep things consistent, and this schema feels pretty "off", so we welcome any suggestions. Thanks in advance!

Edit:
Revised the schema based a response from @bbaird . Also decided to include versioning of stores and users. Tied products, stores, and users together with price table.

Best Answer

You're right to feel the schema is off, because it is - the way it is designed now will not guarantee the minimum criteria required for consistency: As of a point in time, only one value can exist for a given attribute.

There are two ways to handle this, depending on the use case:

Different versions of the attributes need to be accessed by the application
Changes must be tracked for audit reasons only

Solution: Case 1

You would have a Product table and a Product_Version to store the necessary information. You will need a view/function to return the proper value.

Since you are dealing with food (and a standard source), I'm going to make certain assumptions about keys/datatypes. Feel free to comment to clarify.

CREATE TABLE Product
(
  Barcode  VARCHAR(13)  NOT NULL
  /* Store all invariant attributes in this table */
 ,CONSTRAINT PK_Product PRIMARY KEY (Barcode) /* This uniquely defines a product and is compact enough - no other key is necessary */
)
;

CREATE TABLE Product_Version
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL /* Adjust as necessary */
 ,Currency_Cd    CHAR(3)        NOT NULL /* Should reference a Currency table with ISO codes (USD, EUR, GBP, etc) */
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,CONSTRAINT FK_Product_Version_Version_Of_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
 ,CONSTRAINT PK_Product_Version PRIMARY KEY (Barcode, Change_Dtm)
 ,CONSTRAINT CK_Product_Version_Price_GT_Zero CHECK (Price > 0)
 ,CONSTRAINT CK_Product_Version_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

To get the values for a specific product as of a point in time, you would use the following query:

SELECT
  PV.Barcode
 ,PV.Name
 ,PV.Price
 ,PV.Currency_Cd
FROM
  Product_Version PV
WHERE
  PV.Barcode = '8076809513388'
    AND PV.Change_Dtm =
      (
        SELECT
          MAX(Change_Dtm)
        FROM
          Product_Version
        WHERE
          Barcode = PV.Barcode
            AND Change_Dtm <= '2020-10-29 12:30:00.000000'
      )

You can also make a view to mimic the function of a table with static values:

CREATE VIEW v_Product AS
SELECT
  PV.Barcode
 ,PV.Name
 ,PV.Price
 ,PV.Currency_Cd
FROM
  Product_Version PV
WHERE
  PV.Change_Dtm =
    (
      SELECT
        MAX(Change_Dtm)
      FROM
        Product_Version
      WHERE
        Barcode = PV.Barcode
    )

For one-to-many relationships (let's use Ingredient for this example) you would follow a pattern like so:

CREATE TABLE Product_Ingredient
(
  Barcode     VARCHAR(13)   NOT NULL
 ,Ingredient  VARCHAR(50)   NOT NULL  /* Should reference an Ingredient table */
 ,Rank        SMALLINT      NOT NULL  /* Uniqueness of this value needs to be handled through transaction logic */
 ,Change_Dtm  TIMESTAMP(6)  NOT NULL
 ,Delete_Ind  CHAR(1)       NOT NULL
 ,CONSTRAINT FK_Product_Ingredient_Used_In_Product FOREIGN KEY (Barcode) REFERENCES Product (Barcode)
 ,CONSTRAINT PK_Product_Ingredient PRIMARY KEY (Barcode, Change_Dtm)
 ,CONSTRAINT CK_Product_Ingredient_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

Then to get a list of Ingredients for a Product at a point in time, you would use the following query:

SELECT
  PI.Barcode
 ,PI.Ingredient
 ,PI.Rank
FROM
  Product_Ingredient PI
WHERE
  PI.Barcode = '8076809513388'
    AND PI.Change_Dtm =
      (
        SELECT
          MAX(Change_Dtm)
        FROM
          Product_Ingredient
        WHERE
          Barcode = PI.Barcode
            AND Ingredient = PI.Ingredient
            AND Change_Dtm <= '2020-10-29 12:30:00.000000' /* Or whatever */
      )
    AND PI.Delete_Ind = 'N'

Similar to the prior example, you can create a view to provide the current values for each of the one-to-many relations.

Solution: Case 2

If you merely need to store history, you simply make a small modification to the structure:

CREATE TABLE Product
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL
 ,Currency_Cd    CHAR(3)        NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,CONSTRAINT PK_Product PRIMARY KEY (Barcode)
 ,CONSTRAINT CK_Product_Price_GT_Zero CHECK (Price > 0)
 ,CONSTRAINT CK_Product_Delete_Ind_IsValid CHECK (Delete_Ind IN ('Y','N'))
)
;

CREATE TABLE Product_Audit
(
  Barcode        VARCHAR(13)    NOT NULL
 ,Name           VARCHAR(50)    NOT NULL
 ,Price          DECIMAL(8,2)   NOT NULL
 ,Currency_Cd    CHAR(3)        NOT NULL
 ,Change_Dtm     TIMESTAMP(6)   NOT NULL
 ,Change_UserId  VARCHAR(32)    NOT NULL
 ,Delete_Ind     CHAR(1)        NOT NULL
 ,CONSTRAINT PK_Product_Audit PRIMARY KEY (Barcode, Change_Dtm)
)
;

In this case, whenever an update or delete is called for a Product, the following operations are followed:

Insert into the audit table the current row from Product
Update the Product table with the new values

Notes:

What's implicit in this discussion is that new data is written only if the data changes. You can enforce this either through transaction/ETL logic, or triggers to rollback attempts to insert data that is exactly the same as the prior values. This won't effect the data returned for a given query, but it goes a long way to making sure your table sizes don't explode unnecessarily.
If you have a lot of attributes, and some change frequently (such as Price), but others do not (Name, Description), you can always split things into more tables (Product_Price, Product_Name, etc.) and just create a view that incorporates all of those elements. This level of effort generally isn't necessary unless the entities have a lot of attributes or you will have a lot of ad-hoc queries that are asking time-specific questions that rely on knowing the prior value was actually different, such as "Which products increased price during this time frame?"
It's crucial you do not follow the pattern of just sticking an Id on every table and thinking that provides any sort of value. Time-variant data always requires composite keys and only returns consistent results if the data is properly normalized to at least 3NF. Do not use any sort of ORM that does not support composite keys.

Related Solutions

Sql-server – Is adding rows better then adding columns in case their value is not consistent

First, it's about normalization. Today you have max 30 special products, tomorrow (or after 10 years for that matter) you might need another extra 5. You will need to write more code, but you will have greater flexibility after all. At least you won't need to code (or design) 30 columns! You can have a schema like this:

CREATE TABLE Product (ProductId...)
CREATE TABLE Customer (CustomerId...)
CREATE TABLE CustomerProductPrice (ProductId, CustomerId, Price...)

In the third table you can put price for product/customer, but remember to create a unique constraint on foreign key columns like ProductId, CustomerId.

Postgresql – Storing data in PostgreSQL: One table or two

If you mix the history data in with the current like that in order to speed up queries over a time period, you do so at the expense of slowing down queries for current data. You can add an extra column to explicitly mark the relevant rows as the current prices (and have id+flag as the PK) but that adds extra work to your business logic to both keep it maintained and respect it in all reports.

Of course if you move the price completely out of the main table you have a similar problem in finding the latest price becoming more expensive unless you have a "latest" flag or do-normalise slightly and keep a copy of the current price in the main table as well as the price history table. Personally I would do the latter, and use a trigger on that product table to automatically update the price history table when a new product is added or the price updated (I'm assuming the products tables does no see massive write activity most of the time so the performance impact of using a trigger here should be minimal) in order to remove that auditing task from your other logic to avoid bugs caused by new code turning up that forgets to update the history/audit.

_{Caveat: I'm answering this from a general PoV as I've not used postgres much in anger, so do some benchmarks before taking anything I've said regarding performance as fact in that system!}