Postgresql – Mixed granularity weeks/months time series data in a data warehouse

data-warehousepostgresqltime

I'm not sure that "anti-aliasing of time series data" is the correct terminology to use, so let me explain:

I have some sources of data that are aligned quarterly, mostly to do with quarterly running costs. I have some other sources of data, some of which are aligned weekly, and some monthly, and these are mostly related to transactions that took place. By the term 'anti-aliasing', I am referring to the problem of how to represent this data on a common time granularity, so that it can all be integrated together into a data warehouse.

How do I get this data which does not align through time, into a data warehouse? The alignment problem mostly stems from the fact that weeks do not exactly fit into months or quarters. The months against quarters issue is less of a problem, since 3 months make a quarter.

For example, if I know the quarterly running cost, should I divide that up, and attribute an equal share of it to each day in the quarter. Then if I know the number of transactions completed on a weekly basis, should I divide them up and attribute them equally to each day of the week. That would artificially break down the grain of the data into days, and allow me to roll it up by week, month or quarter.

(Is there a more correct terminology in data-warehousing for this problem?)

Best Answer

I don't know if there is a better term for fixing the problem, but the name for the problem in the first place is "mixed granularity" or "overlapping granularity". It is most common in calendar related data but can occur in data split by geography (different sources using different boundaries when aggregating their results) especially when reporting over time (as official boundaries can change in significant ways some times deliberately (https://en.wikipedia.org/wiki/Gerrymandering)). Fixing the issue so you have some useful merging of the data in your warehouse depends on how you need to report upon the data.

If you are reporting on calendar quarters then it is fairly easy: you know the three months that go in each quarter, and 13 weeks go into each too. As weeks and quarters don't align perfectly you are going to have some data at each end of the quarter that has the possibility of not quite being counted right and there are one or two extra days (two in a leap year). One suggestion would be to count January the 1st as day 1 of week 1, and count weeks 1 through 13 quarter 1 and so on. For the one or two days that make up the partial week 53, add them to quarter 4. There is going to be some "error" here but that is unavoidable and should be more than small enough (given large enough samples/figures) to no significantly affect the trends you are trying to detect and report upon.

The other that springs to mind you have already suggested: average out all the data sources, breaking them down to a common such as a day during your ETL process - so record of £123,000 sales in June becomes £4,100 counted against each day, and £35,000 in the week beginning averages to £5,000 on each of the seven days. You can then aggregate this back up to whatever larger granularity you desire. Again you are losing some definition in your data, but the results will be less "lumpy" than, for example, mapping whole weeks to months and having some with four and some with five.

Of course the ideal solution is to try get hold of the original data before it was rolled up at all so you have it all at the finest granularity to start with, but that is often not practical or not even possible because only the aggregated data remains.