Postgresql – Mixed granularity weeks/months time series data in a data warehouse

data-warehousepostgresqltime

I'm not sure that "anti-aliasing of time series data" is the correct terminology to use, so let me explain:

I have some sources of data that are aligned quarterly, mostly to do with quarterly running costs. I have some other sources of data, some of which are aligned weekly, and some monthly, and these are mostly related to transactions that took place. By the term 'anti-aliasing', I am referring to the problem of how to represent this data on a common time granularity, so that it can all be integrated together into a data warehouse.

How do I get this data which does not align through time, into a data warehouse? The alignment problem mostly stems from the fact that weeks do not exactly fit into months or quarters. The months against quarters issue is less of a problem, since 3 months make a quarter.

For example, if I know the quarterly running cost, should I divide that up, and attribute an equal share of it to each day in the quarter. Then if I know the number of transactions completed on a weekly basis, should I divide them up and attribute them equally to each day of the week. That would artificially break down the grain of the data into days, and allow me to roll it up by week, month or quarter.

(Is there a more correct terminology in data-warehousing for this problem?)

Best Answer

I don't know if there is a better term for fixing the problem, but the name for the problem in the first place is "mixed granularity" or "overlapping granularity". It is most common in calendar related data but can occur in data split by geography (different sources using different boundaries when aggregating their results) especially when reporting over time (as official boundaries can change in significant ways some times deliberately (https://en.wikipedia.org/wiki/Gerrymandering)). Fixing the issue so you have some useful merging of the data in your warehouse depends on how you need to report upon the data.

If you are reporting on calendar quarters then it is fairly easy: you know the three months that go in each quarter, and 13 weeks go into each too. As weeks and quarters don't align perfectly you are going to have some data at each end of the quarter that has the possibility of not quite being counted right and there are one or two extra days (two in a leap year). One suggestion would be to count January the 1st as day 1 of week 1, and count weeks 1 through 13 quarter 1 and so on. For the one or two days that make up the partial week 53, add them to quarter 4. There is going to be some "error" here but that is unavoidable and should be more than small enough (given large enough samples/figures) to no significantly affect the trends you are trying to detect and report upon.

The other that springs to mind you have already suggested: average out all the data sources, breaking them down to a common such as a day during your ETL process - so record of £123,000 sales in June becomes £4,100 counted against each day, and £35,000 in the week beginning averages to £5,000 on each of the seven days. You can then aggregate this back up to whatever larger granularity you desire. Again you are losing some definition in your data, but the results will be less "lumpy" than, for example, mapping whole weeks to months and having some with four and some with five.

Of course the ideal solution is to try get hold of the original data before it was rolled up at all so you have it all at the finest granularity to start with, but that is often not practical or not even possible because only the aggregated data remains.

Related Solutions

Sql-server – Data Warehouse Creation

I would recommend Building a Data Warehouse With Examples in SQL Server by Vincent Rainardi, as this covers SQL Server specifics. The Data Warehouse Toolkit is also an excellent and practical guide, but isn't platform specific.

Sql-server – Database structure, relational or data warehouse

It seems you want to aggregate location based statistics over time for rainfall. A database structure like the one below would let you do that. The 'data source' could be just a filename, or some indication as to where it came from.

create table DimDataSource (
       DataSourceID      int identity (1,1) not null
       DataSourceDesc    nvarchar (100)  -- May need unicode for file names
)
go

alter table DimDataSource
  add constraint PK_DataSource
      primary key clustered (DataSourceID)
go

create table DimLocation (
       LocationID        int identity (1,1) not null
       LocationDesc      varchar (50)
)
go

alter table DimLocation
  add constraint PK_Location
      primary key clusterd (LocationID)
go

create table DimDate (
       DateID           smalldatetime not null  -- 'Date' is a reserved word
      ,MonthID          int not null
      ,MonthDesc        varchar (15)
      ,QuarterID        int not null
      ,QuarterDesc      varchar (15)
      ,YearID
)
go

alter table DimDate
  add constraint PK_Date
      primary key clustered (DateID)
go

create table DimTime (
       TimeID           time not null  -- 'Time' is a reserved word
      ,Hour             int not null
)
go

alter table DimTime
  add constraint PK_Time
      primary key clustered (TimeID)
go


-- If the table is <50GB, don't bother with partitioning, but put a clustered
-- index on DateID or LocationID and DateID, depending on how you normally expect
-- to query the data.

create table FactRainfall (
       RainfallID        int identity (1,1) not null -- May need a wider type if >4B rows.
                                                     -- SSAS likes an identity column for
                                                     -- incremental loads
      ,DataSourceID      int not null
      ,LocationID        int not null
      ,DateID            smalldatetime not null
      ,TimeID            time not null
      ,Rainfall          float
)
go

-- Add foreign keys as necessary

Populate the dimensions with the appropriate list of locations, date ranges, time of day to the right grain and one data source record per file. This table will also allow you to put a cube over the top, or can be flattened with a view, which will help people using tools like Excel or stats packages to get and use the data.

Best Answer

Related Solutions

Sql-server – Data Warehouse Creation

Sql-server – Database structure, relational or data warehouse

Related Question