SQL Server 2008 – Table Design for Tracking Significant Events

best practicesdatabase-designsql-server-2008

An Orders table I'm dealing with at present in design phase has about 10 or more columns.

Each OrderID in the Orders table will have certain significant events or milestones. The number of Orders could go into the millions. At the moment, I'm aware of about 20 important events to track for each Order, but that number may increase or decrease.

Of these 20 or so events, only a handful will always take place, while the rest are just potential events. Also, many – but not all – of these events have at least two points which need storage:

the Date the event is supposed to happen (if it is no longer just a "potential" event)
the Date the event did happen

The events data will be updated in the database from external sources in the form of .xls or .csv files. For all events, it is only possible for the given event to happen once.

What would be the best design for storing this milestone event data?
Would you try to use a somewhat normalized design, where there is an Events table designed similar to this…

CREATE TABLE dbo.Events (
    EventID             INT
    , OrderID           INT
    , EventExpectedDate DATEIME2(7)
    , EventActualDate   DATETIME2(7)
    , EventTypeID       INT
    , EventSkip         BIT
    ...
    )

Or would you try a less normalized approach and flatten all the possible events out into one very wide events table that mimics what the .xls/.csv update files will look like?

CREATE TABLE dbo.Events (
    EventID             INT
    , OrderID           INT
    , Event_1_ExpDate   DATEIME2(7)  /* No, the actual names would not include 1, 2, 3 */
    , Event_1_ActDate   DATETIME2(7) /* The actual names would name type of Event */
    , Event_2_ExpDate   DATETIME2(7)
    , Event_2_ActDate   DATETIME2(7)
    ...
    , Event_20_ExpDate   DATETIME2(7)
    , Event_20_ActDate   DATETIME2(7)
    )

Or is there another pattern you would use?

Best Answer

Normalize always unless you have a very good reason not to. It should not be flipping a coin.

Neither of your proposed designs seems optimal to me for a variable / unknown number of events, since when "that number may increase or decrease," in both cases, you have to change the schema and the code.

I also don't believe that you have to base your core table design around the format of the input files. This is why we have staging tables, ETL processes, SSIS, C#, etc. There are many ways to transform data from flat files into a format that is more appropriate for a relational database. Don't design your database because of what your flat files look like.

So you could have staging tables that look like the input files, but then you would load the data into real tables that look perhaps like this:

CREATE TABLE dbo.Events -- lookup table describing events
(
  EventID INT PRIMARY KEY,
  Description VARCHAR(32) NOT NULL UNIQUE
);

CREATE TABLE dbo.OrderEventLog -- actual event data
(
  EventID INT FOREIGN KEY REFERENCES dbo.Events(EventID),
  OrderID INT FOREIGN KEY REFERENCES dbo.Orders(OrderID), -- guess
  ExpectedDate DATETIME2(7),
  ActualDate DATETIME2(7)
);

Now when you have a new type of event, you just insert it into the dbo.Events table, and then you can insert related events into dbo.OrderEventLog without having to change the schema of either table, and without having to change the interface to your stored procedures or bulk inserts into the staging tables either. This also keeps the history table (and all of its indexes) nice and narrow, and allows you to build indexes that favor searches for specific types of events.

You can always flatten the data later - which is typically what you do when you get to a point where you are building a data warehouse around the data.

Related Solutions

Importing order information

I think the best way to approach this would be to have tables like so:

table Customer
    ID
    Name
    ...

table CustomerFields
    CustomerID
    FieldID
    FieldName
    ... (could have things like type, required/optional, size, etc.)

table CustomerOrders
    CustomerID
    OrderID
    OrderNumber
    DeliveryDate
    ... (all your generics and mappable fields)

table CustomerOrderFields
    CustomerID
    OrderID
    FieldID
    FieldValue

By having the table CustomerOrders, you satisfy the ability to map common customer order data to common fields (simplifying reporting at the expense of making the import a little more painful since the fields must be mapped), and the CustomerOrderFields gives you the ability to have the custom fields per customer necessary for the un-mappable data.

The custom fields are still reportable, but not as easily as your generics as they'll come to you in multiple rows (instead of multiple columns). There are some ways around all that depending on your report creator (e.g., pivoting the results).

The only other option would be to do something like this (which, personally, I would avoid):

table CustomerFields
    CustomerID
    Field1Name
    Field2Name
    Field3Name
    ...
    Field99Name

Table CustomerOrders
    (all your generics)
    Field1Value
    Field2Value
    ...
    Field99Value

This has the advantage that all your data is in one row, but also has the disadvantage that it isn't immediately apparent from the row what each value means. The first method can always be joined to the field list to give a good definition of the data in each custom field. In addition, what happens if some company requires 100 custom fields? In the above example, you'd be making changes to your data structures and code whereas in the first example, you'd never have the issue -- customers could have infinite custom fields.

I've seen it done both ways, and both ways work. Both ways have their downsides and upsides. The first is far more scalable, but harder to get in to columns (instead of rows). Everything's a trade-off.

Hope that helps some!

Design for file based with many events

Looking at your table structure means that while the actual row is small (many int values) there will be millions of rows.

Yes you can use a database of your choice, MySQL, Oracle, SQL Server, PostGres for the raw data from the cameras.

You can add partitioning by day which will create a partition for each day, and you may want to consider archiving data on an annual basis into a separate table for each year to make the table sizes manageable.

I would recommend that you setup another set of summary tables for reporting which can be based on the partitions, and this can be done on a daily basis (per partition) and automated.

The actual design of the summary tables depends on the reports you need to run, but it will be necessary so that your queries do not run off the base data table which will be huger.

Best Answer

Related Solutions

Importing order information

Design for file based with many events

Related Question