Storing Procedure Fields That Were Used

best practicesdatabase-designdenormalizationnormalization

I am creating a database for production work (specifically lab testing).

Most Work is for production, therefore performed strictly according to the Procedure for that Product. By itself, this is easy to model. The Work references the Procedure as it contains how the work is done:

Example Schema
Work:      Work_id, Procedure_id, {other non-relevant fields}
Procedure: Procedure_id, Product_id, Machine_id, Material_id, RunMinutes

Two exceptions (overrides and special testing) add much complexity to the design.

Question: Given the two exceptions below, how should I store the Procedure fields that were actually used for each Work?

Exception – Overrides:

Sometimes the required equipment or components are not available. In these cases, the manager can approve a one-time override for equivalent substitutes. Examples:

Machine X was broken. Perform the Work by hand.
We ran out of Material Y. Use Material Z instead.
Keep runtime at 45 minutes

The database must capture how the Work was actually performed.

I see three possible options:

Option 1: Store Locally: The Work references the original Procedure. Each Work also locally stores the Procedure fields used, including any modifications. This creates many duplicates, but you have a local "snapshot" for each Work.

Example Schema
Work_id | Procedure_id    | Machine | Material | RunMinutes
1       | 1               | By-Hand | Z        | 45  

Procedure_id | Product_id | Machine | Material | RunMinutes
1            | 1          | X       | Y        | 45

Option 2: Single Use Procedure: The original Procedure is copied to a new Procedure, marked inactive, and modified with the overrides. The Work then references the new Procedure. This maintains the Work.Procedure_id for how the Work was performed.

Example Schema
Work_id | Procedure_id
1       | 2

Procedure_id | Product_id| Active| Machine | Material | RunMinutes
1            | 1         | Y     | X       | Y        | 45
2            | 1         | N     | By-Hand | Z        | 45

Option 3: Store as Overrides: The Work points to the Procedure and optionally points to a ProcedureOverride table. For each field in Procedure, if there is an override, then use it otherwise, use the Procedure value.

Example Schema 
Work_id| Procedure_id| Override_id
1      | 1           | 1

Procedure_id| Product_id| Machine | Material | RunMinutes
1           | 1         | X       | Y        | 45

Override_id             | Machine | Material | RunMinutes
1                       | By-Hand | Z        | NULL  

Query: ActualWork
Work_id   |Procedure_id | Machine | Material | RunMinutes
1         |             | By-Hand | Z        | 45

Exception – Special Testing:

For non-standard work (such as research and development), there is no specific Procedure. Again, the database again must capture how the Work was actually performed.

I see two options (equivalent to the respective options above)

Option 1: Store Locally: Each Work locally stores all Procedure fields used. The user must input values for each field.

Option 2: Single Use Procedure: A new Procedure is created, marked inactive, and populated by the user. The Work then references the new Procedure. This maintains the Work.Procedure_id for how the Work was performed.

Keep in mind however, there is no actual (real world) Procedure for the non-standard Work.

Best Answer

I'm assuming you could have multiple exceptions for a given work/procedure pair, with each exception potentially modifying one or more different procedure fields.

While you could certainly store each modification in its own row (an 'override'), any queries attempting to get the 'current' settings, or a historical view, is going to get complicated (especially since the current design - as presented - doesn't appear to have any means of showing the order in which changes were applied).

Based solely on the info provided, I would probably opt for a typical history/audit solution:

maintain the 1-1 relationship between Work and Procedure
when an exception is required you update Procedure with the new value(s)
a trigger on Procedure will write the old record to an audit/history table (eg, Procedure_hist) with the same columns plus one additional column to designate an ordering (eg, seq_no, modification_datetime, etc)
when you want to see the current Work/Procedure config you join Work and Procedure, eg:
```
select ...
from   Work w
join   Procedure p
on     w.Procedure_id = p.Procedure_id
```

if you need to see a historical/audit view then you can pull in the Procedure_hist table, perhaps as a outer join, eg:

select ...
from   Work w
join   Procedure p
on     w.Procedure_id = p.Procedure_id
left
join   Procedure_hist h
on     w.Procedure_id = h.Procedure_id

This is a slight variation on your Option #2. Yes, it means some duplication of data but it also allows for easier coding (updating, retrieving), which in turn will likely make it a bit easier to maintain down the road (especially if someone comes along behind you to maintain the system). [The K.I.S.S. principle comes to mind.]

Without knowing more about the use cases, upstream/downstream requirements, etc ... I'd probably want to see if it made sense to maintain the special testing with the same Work-Procedure-Procedure_hist relationship.

You could add a flag to designate if the Work (or Procedure?) is a standard or special case.

Other considerations that could affect the model ... is Work-Procedure a strictly one-to-one relationship or could there be a many-to-one/many-to-many relationship, eg:

could a single Procedure be (re)used by different Work efforts?
could a single Work effort be broken into multiple Procedures?

Related Solutions

SQL Server – Best Way to Store Immutable Read-Only Data for Logging

You could treat your reference tables as slowly changing dimensions and perform type 2 maintenance i.e. add a new reference row when any reference data changes. Over time you may choose to horizontally partition your reference tables into "active" and "archive" parts, depending on the data churn you experience. This can be achieved using using SQL Server's built-in partitioning functionality or a roll-your-own approach with two separate tables. Your needs at the time will dictate which.

There is no good reason why your active OLTP table and your logging table should look the same. They perform different roles and have different read and write requirements. If the logging table needs to be wide and sparse then that's what it needs to be. Create the objects to solve the problem you have.

And one last suggestion, which is defininetly from the "clutching at straws" bucket: define TASK_DONE (id int, reference_values xml). Extract your pertinent values at the point in time and save them away. This allows for changes in the reference data schema without having to bring existing log records up to the new schema. That will make historical searching more complicated, of course.

Sql-server – Efficiently storing sets of key-value pairs with wildly different keys

So what I am looking for is a way to store a large number of activities that have almost no fields in common in a way that makes reporting easier.

Not enough rep to comment first, so here we go!

If the primary purpose is reporting and you have a DW (even if it isn't star schema) I'd recommend attempting to get this into a star schema. The benefits are fast, simple queries. The downside is ETL, but you're already considering moving the data to a new design and ETL to star schema is likely simpler to build and maintain than an XML wrapper solution (and SSIS is included in your SQL Server licensing). Plus it starts the process of a recognized reporting/analytics design.

So how to do that... It sounds like you have what is known as a Factless Fact. This is an intersection of attributes that define an event with no associated measure (such as a sales price). You have dates available for some or all of your activities? Likely you should really have an intersection of an Activity, Site, and Date(s).

DimActivity - I'm guessing there is a pattern, something that can allow you to break these down into at least relatively shared columns. If so, you may have three? five? dimensions for classes of activities. At worst you have a couple consistent columns, such as activity name, you can filter on, and you leave general headings such as "Attribute1" etc. for the remaining random details.

You don't need everything in the dimension - there (likely) shouldn't be any dates in the Activity dimension - they should all be in the fact, as Surrogate Key references to the Date dimension. As an example, a Date that would stay in a person dimension would be a date of birth because it's an attribute of a person. A hospital visit date would reside in a fact, as it is a point in time event associated with a person, among other things, but it is not an attribute of the person visiting the hospital. More date discussion in the fact.

DimSite - seems straight forward, so we'll describe Surrogate Keys here. Essentially this is just an incrementing, unique ID. Integer Identity column is common. This allows separation of DW and source systems and ensures optimal joins in the data warehouse. Your Natural Key or Business Key is usually kept, but for maintenance/design not analysis and joins. Example schema:

CREATE TABLE [DIM].[Site]
(
 SiteSK INT NOT NULL IDENTITY PRIMARY KEY
,SiteNK INT NOT NULL --source system key
,SiteName VARCHAR(500) NOT NULL
)

DimDate - date attributes. Make a "smart key" instead of an Identity. This means you can type a meaningful integer that relates to a date for queries such as WHERE DateSK = 20150708. There are lots of free scripts to load DimDate and most have this smart key included. (one option)

DimEmployee - your XML included this, if it is more general change to DimPerson, and fill with relevant person attributes as they are available and pertinent to reporting.

And your fact is:

FactActivitySite
DimSiteSK - FK to DimSite
DimActivitySK - FK to DimActivity
DimEmployee - FK to DimEmployee
DimDateSK - FK to DimDate

You can Rename these in the Fact, and you can have multiple date keys per event. Facts are typically very large so avoiding updates is typically good... if you have multiple date updates to a single event you may want to try a Delete/Insert design by adding a SK to the fact which allows selection of "update" rows to be deleted then inserting latest data.

Expand your Fact dates to whatever you need: StartDateSK, EndDateSK, ScheduledStartDateSK.

All dimensions should have an Unknown row typically with a hardcoded -1 SK. When you load the fact, and an activity doesn't have any of the included Dates it should simply load a -1.

The fact is a collection of integer references to your attributes stored in the dimensions, join them together and you get all your details, in a very clean join pattern, and the fact, due to it's data types, is exceptionally small and fast. Since you are in SQL Server, add a columnstore index to increase performance further. You can just drop it and rebuild during ETL. Once you get to SQL 2014+ you can write to columnstore indexes.

enter image description here

If you go this route research Dimensional Modelling. I'd recommend Kimball methodology. There are lots of free guides out there too, but if this will be anything other than a one off solution, the investment is likely worth it.

Best Answer

Related Solutions

SQL Server – Best Way to Store Immutable Read-Only Data for Logging

Sql-server – Efficiently storing sets of key-value pairs with wildly different keys

Related Question