Sql-server – Efficiently storing sets of key-value pairs with wildly different keys

database-designreportingschemasql-server-2012

I inherited an application that associates many different types of activities with a site. There are roughly 100 different activity types, and each one has differing set of 3-10 fields. However, all activities have at least one date field (could be any combination of date, start date, end date, scheduled start date, etc.), and one responsible person field. All other fields vary widely and a start date field will not necessarily be called "Start Date".

Making one subtype table for each activity type would result in a schema with 100 different subtype tables, which would be too unwieldly to deal with. The current solution to this problem is to store the activity values as key-value pairs. This is a greatly simplified schema of the current system to get the point across.

enter image description here

Each Activity has multiple ActivityFields; each Site has multiple Activities, and the SiteActivityData table stores the KVPs for each SiteActivity.

This makes the (web-based) application very easy to code because all you really need to do is loop over the records in SiteActivityData for a given activity and add a label and input control for each row to a form. But there are lots of problems:

  • Integrity is bad; it's possible to put a field in SiteActivityData that does not belong to the activity type, and DataValue is a varchar field so numbers and dates need to be constantly cast.
  • Reporting and ad-hoc querying of this data is difficult, error prone, and slow. For example, getting a list of all activities of a certain type that have an End Date within a specified range requires pivots and casting varchars to dates. The report writers HATE this schema, and I don't blame them.

So what I am looking for is a way to store a large number of activities that have almost no fields in common in a way that makes reporting easier. What I have come up with so far is to use XML to store the activity data in a pseudo-noSQL format:

enter image description here

The Activity table would contain the XSD for each activity, eliminating the need for the ActivityField table. SiteActivity would contain the key-value XML so each activity for a site would now be in a single row.

An activity would look something like this (but I haven't fleshed it out fully):

<SomeActivityType>
  <SomeDateField type="StartDate">2000-01-01</SomeDateField>
  <AnotherDateField type="EndDate">2011-01-01</AnotherDateField>
  <EmployeeId type="ResponsiblePerson">1234</EmployeeId>
  <SomeTextField>blah blah</SomeTextField>
  ...

Advantages:

  • The XSD would validate the XML, catching errors like putting a string in a number field at the database level, something that was impossible with the old schema that stored everything in varchar.
  • The recordset of KVPs that is used to build the web forms could easily be reproduced using select ... from ActivityXML.nodes('/SomeActivityType/*') as T(r)
  • An xpath subquery of the XML could be used to produce a result set that has columns for start date, end date, etc without using a pivot, something like select ActivityXML.value('.[@type=StartDate]', 'datetime') as StartDate, ActivityXML.value('.[@type=EndDate]', 'datetime') as EndDate from SiteActivity where...

Does this seem like a good idea? I can't think of other ways to store such a large number of differing sets of properties. Another thought I had was keep the existing schema and translate it into something more easily queryable in a data warehouse, but I have never designed a star schema before and would have no idea where to begin.

Additional question: If I define a tag as having a date data type in the XSD using xs:date, is SQL Server going to index it as a date value? I'm concerned if I query by date it will need to cast the date string to a date value and blow any chance of using an index.

Best Answer

So what I am looking for is a way to store a large number of activities that have almost no fields in common in a way that makes reporting easier.

Not enough rep to comment first, so here we go!

If the primary purpose is reporting and you have a DW (even if it isn't star schema) I'd recommend attempting to get this into a star schema. The benefits are fast, simple queries. The downside is ETL, but you're already considering moving the data to a new design and ETL to star schema is likely simpler to build and maintain than an XML wrapper solution (and SSIS is included in your SQL Server licensing). Plus it starts the process of a recognized reporting/analytics design.

So how to do that... It sounds like you have what is known as a Factless Fact. This is an intersection of attributes that define an event with no associated measure (such as a sales price). You have dates available for some or all of your activities? Likely you should really have an intersection of an Activity, Site, and Date(s).

DimActivity - I'm guessing there is a pattern, something that can allow you to break these down into at least relatively shared columns. If so, you may have three? five? dimensions for classes of activities. At worst you have a couple consistent columns, such as activity name, you can filter on, and you leave general headings such as "Attribute1" etc. for the remaining random details.

You don't need everything in the dimension - there (likely) shouldn't be any dates in the Activity dimension - they should all be in the fact, as Surrogate Key references to the Date dimension. As an example, a Date that would stay in a person dimension would be a date of birth because it's an attribute of a person. A hospital visit date would reside in a fact, as it is a point in time event associated with a person, among other things, but it is not an attribute of the person visiting the hospital. More date discussion in the fact.

DimSite - seems straight forward, so we'll describe Surrogate Keys here. Essentially this is just an incrementing, unique ID. Integer Identity column is common. This allows separation of DW and source systems and ensures optimal joins in the data warehouse. Your Natural Key or Business Key is usually kept, but for maintenance/design not analysis and joins. Example schema:

CREATE TABLE [DIM].[Site]
(
 SiteSK INT NOT NULL IDENTITY PRIMARY KEY
,SiteNK INT NOT NULL --source system key
,SiteName VARCHAR(500) NOT NULL
)

DimDate - date attributes. Make a "smart key" instead of an Identity. This means you can type a meaningful integer that relates to a date for queries such as WHERE DateSK = 20150708. There are lots of free scripts to load DimDate and most have this smart key included. (one option)

DimEmployee - your XML included this, if it is more general change to DimPerson, and fill with relevant person attributes as they are available and pertinent to reporting.

And your fact is:

FactActivitySite
DimSiteSK - FK to DimSite
DimActivitySK - FK to DimActivity
DimEmployee - FK to DimEmployee
DimDateSK - FK to DimDate

You can Rename these in the Fact, and you can have multiple date keys per event. Facts are typically very large so avoiding updates is typically good... if you have multiple date updates to a single event you may want to try a Delete/Insert design by adding a SK to the fact which allows selection of "update" rows to be deleted then inserting latest data.

Expand your Fact dates to whatever you need: StartDateSK, EndDateSK, ScheduledStartDateSK.

All dimensions should have an Unknown row typically with a hardcoded -1 SK. When you load the fact, and an activity doesn't have any of the included Dates it should simply load a -1.

The fact is a collection of integer references to your attributes stored in the dimensions, join them together and you get all your details, in a very clean join pattern, and the fact, due to it's data types, is exceptionally small and fast. Since you are in SQL Server, add a columnstore index to increase performance further. You can just drop it and rebuild during ETL. Once you get to SQL 2014+ you can write to columnstore indexes.

enter image description here

If you go this route research Dimensional Modelling. I'd recommend Kimball methodology. There are lots of free guides out there too, but if this will be anything other than a one off solution, the investment is likely worth it.

Related Question