How to add a time dimension to this forecasting database without having to duplicate every data point @ each reforecast

data-warehousedatabase-design

I have a database designed to facilitate revenue forecasting. Revenue forecasting consists of tracking Orders which are recognized into Revenue at multiple points in time (each, a revenueDate). There are two tables (keys in italic):

  • Order: id, orderDate
  • Revenue: ForeignKey(Order.id), revenueDate, amount

At any point in time, querying the Revenue table will give me the total revenue for each revenueDate.

I'd like to add a time component so that I can track what the revenue forecast looked like at any given point of time. Something like the following:

  • Order: id, forecastDate, orderDate
  • Revenue: ForeignKey(Order.id), ForeignKey(Order.forecastDate), revenueDate, amount

Doing the table this way would allow me to query (forecastDate, revenueDate) to get my answer. Unfortunately, it would also require me to duplicate all data points for every reforecast, even those that don't change.

Is there a more efficient way for me to design my tables so that each new forecast only requires appending changes or the changed data points, while still allowing me to view a forecast at any historical point in time?

===Update===

I actually don't expect there to be a definitive answer on this, as it seems like luminaries in the field like Snodgrass and Darwen have fundamental disagreements about how best to model time within a database (see ISO's rejection of temporal elements back in 1996).

Best Answer

Snapshotting vs. Running Sums

Any approach to point-in-time reporting is effectively equivalent to snapshotting or calculating a running sum. Your options are variations on:

  1. Calculate a running sum from the beginning of time, or some periodicaly checkpointed position. This requires you to record changes rather than totals.

  2. Calculate periodic snapshots and do your historical statiscs based on the snapshots.

  3. Hybrid running-sum with periodic snapshots to baseline positions and reduce the data set over which the running sum needs to be calculated.

This is essentially a speed for space tradeoff, and temporal databases are really just abstractions on a data structure that is semantically equivalent to one of the schemes described above.

If you want speed, calculate and persist periodic snapshots of your position. This will use more space, but if your DBMS platform supports partitioning then the snapshots can be managed and disposed quite efficiently. The snapshot date forms the partition key, so any queries that specify dates will give the optimiser enough information to ignore partitions that are not relevant to the query.

Frequent snapshots will use lots of disk space, so you trade off granularity against disk space usage.

If you want to save space then store changes and calculate running sums for your as-at positions. You can also use periodic (though less frequent) snapshots in combination with the deltas for a period to optimise the query by starting with balance figures from the appropriate snapshot and adding the deltas to your as-at date. This technique is commonly used with accounting systems such as Oracle Financials.

Hybrid approaches

Essentially you have a continuum from a pure running sum to a pure snapshot. Snapshotting is faster to query, and can be used to speed up running sum calculations by allowing you to start the running sum calculation at the snapshot date. A hybrid approach is more complex as you have to implement both mechanisms, but it does allow you to tune your speed/disk usage tradeoff to whatever is most appropriate to your application.

Some thoughts on what might be applicable when

It's far easier to retrofit snapshots to a base transactional model than to go the other way around. However, some models - things where you have time-dependent state - only make sense when viewed as a snapshot that shows the state at a given point in time. An example of this is aged debt reporting where you want to see the invoices with money that's been owing for more than (for example) 180 days. If you want to retain a reconcilable record of aged debt positions then you may wish to snapshot this periodically.

In the first instance a transactional fact table with the base movements (deltas) gives you the raw data from which you can calculate your snapshotted positions. It's best to have that available if at all possible, unless recording the data in that format is made infeasible by some limitation of the upstream data source.

I've done an aged debt reporting system and a claims reporting system based on snapshots before. The aged debt report was an operational report run at the beginning of each week, and the requirement for the claims snapshot table was for a monthly report where the business wished to see counts and outstanding values of claims that were open at the time.

Some things (e.g. claim reserving) are done on a periodic basis, so (for example) ultimates and IBNR1 are only calculated on a monthly or quarterly basis and it only makes sense to report those on a periodic basis. In this case, snapshotting the figures makes sense due to the nature of the underlying process that generates the data.

1Ultimates are insurance-speak for forecasts of the total claims expected on policies incepting in a given year. IBNR (Incurred but not Reported) is an estimate of the figures on claims that have not occurred or been reported yet.