Sql-server – “Avoid partitioning dimension tables” on SQL Server

data-warehousedatabase-designpartitioningsql server

SQL Cat have a list of tips titled Top 10 Best Practices for Building a Large Scale Relational Data Warehouse.

Under section 4 - Design dimension tables appropriately they state:

Avoid partitioning dimension tables.

They don't mention why this should not be done, nor can I find anything on the web that explicilty points out why it's something to be avoided.

Why should I avoid partitioning dimension tables?

A more concrete example is provided below to help facilitate an answer, and hold a discussion about why partitioning should not be done in large relational data warehouses. I'm not looking for advice on improving the data model that is specific to the concrete example. If the example doesn't help in providing any extra insight into why partitioning dimensions should not be done, then please ignore it.


Example: you can use to reference in your answer as to why partitioned dimensions are a bad/suboptimal idea (if it helps you) …

In our environment we have an Account dimension, this is partitioned on DateEffective and is loaded monthly. Some of our queries involve WHERE DateEffective >= @ReportDate, which seems to be a good candidate for partition elimination. Also if we need to reload the month's data we will delete an entire month's worth of data, which also seems it would benefit from table partitioning.


Update about our environment since posting question …

The table mentioned above has non-aligned non-clustered indexes (investigated with the following Brent Ozar code).

select
    [db_name]               = isnull(db_name(s.database_id),db_name())
    ,[schema_name]          = object_schema_name(i.object_id,db_id())
    ,[object_name]          = o.name
    ,index_name             = i.name
    ,index_type_desc        = i.type_desc
    ,data_space_name        = ds.name
    ,data_space_type_desc   = ds.type_desc
    ,s.user_seeks
    ,s.user_scans
    ,s.user_lookups
    ,s.user_updates
    ,s.last_user_seek
    ,s.last_user_update
from
    sys.objects as o

    inner join sys.indexes as i 
        on o.object_id = i.object_id

    inner join sys.data_spaces as ds 
        on ds.data_space_id = i.data_space_id

    left join sys.dm_db_index_usage_stats as s 
        on  i.object_id     = s.object_id 
        and i.index_id      = s.index_id
        and s.database_id   = db_id()
where 
        o.type      = 'u'
    and i.type      in (1, 2)
    and o.object_id in
    (
         select filter.object_id 
         from
         (
             select ob.object_id, ds.type_desc 
             from 
                sys.objects ob 
                inner join sys.indexes ind on ind.object_id = ob.object_id 
                inner join sys.data_spaces ds on ds.data_space_id = ind.data_space_id
             group by ob.object_id, ds.type_desc 
         ) as filter 
         group by filter.object_id 
         having count(*) > 1
     )
order by
    [object_name] desc
;

This showed:

  • the clustered index on the partition scheme
  • 5 of 8 non-clustered indexes on the partition scheme
  • 3 of 8 non-clustered indexes on primary, the rows_filegroup
    • 1 of these was a unique, non-clustered index (for the sake of completeness: defined as a primary key non-clustered in the create table script in source control)

Another update

I found this answer by Remus Rusanu which shed some light on the complications with partitioned tables that would be relevant for dimensions.

His statements are block quoted with my interpretation using my example above


non-aligned indexes prevent efficient partition switch operations

So, we should attempt to align indexes when a table is partitioned. Partition switching is not even used (?possibly prevented?) to load the table in my example, as there are non-aligned indexes.

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model

This would certainly seem to be the case with the example I've provided, and some changes would be required to implement aligned indexes.

Due to dimensions typically using surrogate keys as the primary key (a unique clustered index), this provides an ever increasing narrow key (i.e. small data size on disk). This is important because the B-tree seeks that occur when joining between dimensions and facts can occur faster. Further, the clustered index will be part of any non-clustered indexes created, which also prevents bloating the non-clustered index, creating more efficient index seeks/scans here also.

Why is this important?

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)

and

all foreign keys referencing the partitioned table must include the
partitioning key

and

this in turn requires that all tables referencing the partitioned
table contain partitioning key column value … in order to properly declare
the foreign key constraint.

The impacts are …

  • A DateEffective column would need to be added to every table referencing the account dimension in our environment. Implementing a DateEffective column on the fact tables we have is redundant, as this lookup is taken care of by our ETL process that loads the correct AccountID key value. Further, some facts are declared at a grain that is more selective than a date data type, of which DateEffective clearly is, making it more nonsensical to include this column in the fact tables (data model ripple effects).
  • A number of the non-clustered indexes would need to be changed to include the DateEffective column

However …

  • Data warehouses typically don't have foreign key constraints implemented. A good answer on SO covers this.
  • Also, since 2008 edition, Sql Server has parallel bitmap filtered hash-joins available to optimize star joins (see: Optimizing Data Warehouse Query Performance Through Bitmap Filtering), and foreign keys aren't required for this optimization.
  • This would seem to point to it being ok to partition a dimension table, as the changes required are now "only" having to include the partition key into the non-aligned indexes, because the foreign key constraints issue is non existent in our environment (our ETL process manages this integrity).

Best Answer

I suspect the advice is predicated on the likely utility of partitioning a dimension table. In a data warehouse, fact tables are good example of the adage, big data is medium data, plus time. Dimension tables don't have time (not really), and as a rule don't have useful partitioning properties.

Yours seems like a good example. Why is Accounts partitioned on DateEffective? "Because some reports select on that column" is not a sufficient answer. An index on that column would be the conventional solution, and has the advantage of not biasing the physical data structure.

However many accounts you have, your fact tables are at least 1-3 orders of magnitude larger. Your server is scaled to that proportion. Looking up accounts is a relatively trivial operation. On the face of it, it doesn't seem like a candidate for partitioning.