Sql-server – Approach to Handling Errors/DataCorrections in Aggregated Data

We're in the process of building a vNext Data/Reporting Platform for capturing Web & Sales Traffic.

At a very high level we're capturing Impressions, Clicks & Sales Data which then needs to be reported on.

Impressions are on the order of Millions per day
Clicks: 100,000's per day
Sales: 1000's per day

My first pass design for the platform essentially aggregates data into Hourly, Daily, Weekly & Monthly structures. Each being grouped on a number of other dimensions (Client, Brand, Product, Country, Retailer etc…) with Impression Counts, Click Counts, Total Value etc…

So each hour, source data gets aggregated -> Hourly
And then each day:

the prev 24h of Hourly Data -> New Daily
the prev 24h of Hourly Data -> Added to existing Weekly or create new Weekly
the prev 24h of Hourly Data -> Added to existing Monthly or create new Monthly

This gives us a lot of flexibility for reporting efficiently as well as monitoring trends and identifying potential problems.

One thing I'm not sure on is how to "Fix" data in the event a "discrepancy" gets into our aggregated data.

e.g. Every once in a while a customer will provide incorrect pricing information. When that occurs, the "Potential Value" on each click and impression will be wrong.

So that source data ends up in the Hourly aggregates.

If unnoticed, for even one day, that hourly data pollutes the daily, weekly & monthly aggregates. And because those aggregates are being built and added to cumulatively, you can't easily undo it. (You can't pour the blue water out of the jug of purple water).

The naive solution here would seem to be to throw away the subset of aggregated data that's been polluted, fix the source data, and then regenerate each of those aggregates that's been removed. Which one day later might mean throwing away 24 Hourly's, 1 Daily, 1 Weekly & 1 Monthly.

Or is there another approach here that I'm not thinking of?

Best Answer

In accounting practice values are never altered once written. Instead secondary compensating transactions are created. In your circumstances this would mean one new row with a value that is the negative old, incorrect value and a second new row with the new, correct value.

To work each data row will need its own datetime column so these compensating transactions can be post-processed into the correct buckets.

Averages cannot be altered in post processing. Instead the sum and count are stored and average calculated at run time.

Best Answer

Related Solutions

Sql-server – Database/Storage engine suggestion for a project involving heavy inserts

How to choose primary keys for star schema BI data warehouse

Related Question