Sql-server – “Avoid partitioning dimension tables” on SQL Server

data-warehousedatabase-designpartitioningsql server

SQL Cat have a list of tips titled Top 10 Best Practices for Building a Large Scale Relational Data Warehouse.

Under section 4 - Design dimension tables appropriately they state:

Avoid partitioning dimension tables.

They don't mention why this should not be done, nor can I find anything on the web that explicilty points out why it's something to be avoided.

Why should I avoid partitioning dimension tables?

A more concrete example is provided below to help facilitate an answer, and hold a discussion about why partitioning should not be done in large relational data warehouses. I'm not looking for advice on improving the data model that is specific to the concrete example. If the example doesn't help in providing any extra insight into why partitioning dimensions should not be done, then please ignore it.

Example: you can use to reference in your answer as to why partitioned dimensions are a bad/suboptimal idea (if it helps you) …

In our environment we have an Account dimension, this is partitioned on DateEffective and is loaded monthly. Some of our queries involve WHERE DateEffective >= @ReportDate, which seems to be a good candidate for partition elimination. Also if we need to reload the month's data we will delete an entire month's worth of data, which also seems it would benefit from table partitioning.

Update about our environment since posting question …

The table mentioned above has non-aligned non-clustered indexes (investigated with the following Brent Ozar code).

select
    [db_name]               = isnull(db_name(s.database_id),db_name())
    ,[schema_name]          = object_schema_name(i.object_id,db_id())
    ,[object_name]          = o.name
    ,index_name             = i.name
    ,index_type_desc        = i.type_desc
    ,data_space_name        = ds.name
    ,data_space_type_desc   = ds.type_desc
    ,s.user_seeks
    ,s.user_scans
    ,s.user_lookups
    ,s.user_updates
    ,s.last_user_seek
    ,s.last_user_update
from
    sys.objects as o

    inner join sys.indexes as i 
        on o.object_id = i.object_id

    inner join sys.data_spaces as ds 
        on ds.data_space_id = i.data_space_id

    left join sys.dm_db_index_usage_stats as s 
        on  i.object_id     = s.object_id 
        and i.index_id      = s.index_id
        and s.database_id   = db_id()
where 
        o.type      = 'u'
    and i.type      in (1, 2)
    and o.object_id in
    (
         select filter.object_id 
         from
         (
             select ob.object_id, ds.type_desc 
             from 
                sys.objects ob 
                inner join sys.indexes ind on ind.object_id = ob.object_id 
                inner join sys.data_spaces ds on ds.data_space_id = ind.data_space_id
             group by ob.object_id, ds.type_desc 
         ) as filter 
         group by filter.object_id 
         having count(*) > 1
     )
order by
    [object_name] desc
;

This showed:

the clustered index on the partition scheme
5 of 8 non-clustered indexes on the partition scheme
3 of 8 non-clustered indexes on primary, the rows_filegroup
- 1 of these was a unique, non-clustered index (for the sake of completeness: defined as a primary key non-clustered in the create table script in source control)

Another update

I found this answer by Remus Rusanu which shed some light on the complications with partitioned tables that would be relevant for dimensions.

His statements are block quoted with my interpretation using my example above

non-aligned indexes prevent efficient partition switch operations

So, we should attempt to align indexes when a table is partitioned. Partition switching is not even used (?possibly prevented?) to load the table in my example, as there are non-aligned indexes.

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model

This would certainly seem to be the case with the example I've provided, and some changes would be required to implement aligned indexes.

Due to dimensions typically using surrogate keys as the primary key (a unique clustered index), this provides an ever increasing narrow key (i.e. small data size on disk). This is important because the B-tree seeks that occur when joining between dimensions and facts can occur faster. Further, the clustered index will be part of any non-clustered indexes created, which also prevents bloating the non-clustered index, creating more efficient index seeks/scans here also.

Why is this important?

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)

and

all foreign keys referencing the partitioned table must include the
partitioning key

and

this in turn requires that all tables referencing the partitioned
table contain partitioning key column value … in order to properly declare
the foreign key constraint.

The impacts are …

A DateEffective column would need to be added to every table referencing the account dimension in our environment. Implementing a DateEffective column on the fact tables we have is redundant, as this lookup is taken care of by our ETL process that loads the correct AccountID key value. Further, some facts are declared at a grain that is more selective than a date data type, of which DateEffective clearly is, making it more nonsensical to include this column in the fact tables (data model ripple effects).
A number of the non-clustered indexes would need to be changed to include the DateEffective column

However …

Data warehouses typically don't have foreign key constraints implemented. A good answer on SO covers this.
Also, since 2008 edition, Sql Server has parallel bitmap filtered hash-joins available to optimize star joins (see: Optimizing Data Warehouse Query Performance Through Bitmap Filtering), and foreign keys aren't required for this optimization.
This would seem to point to it being ok to partition a dimension table, as the changes required are now "only" having to include the partition key into the non-aligned indexes, because the foreign key constraints issue is non existent in our environment (our ETL process manages this integrity).

Best Answer

I suspect the advice is predicated on the likely utility of partitioning a dimension table. In a data warehouse, fact tables are good example of the adage, big data is medium data, plus time. Dimension tables don't have time (not really), and as a rule don't have useful partitioning properties.

Yours seems like a good example. Why is Accounts partitioned on DateEffective? "Because some reports select on that column" is not a sufficient answer. An index on that column would be the conventional solution, and has the advantage of not biasing the physical data structure.

However many accounts you have, your fact tables are at least 1-3 orders of magnitude larger. Your server is scaled to that proportion. Looking up accounts is a relatively trivial operation. On the face of it, it doesn't seem like a candidate for partitioning.

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Sql-server – Partition Key questions in SQL Server 2008

Assuming that you have the primary key on a clustered index then the partitioning key needs to be part of the primary key.

You will not loose the benefit of partitioning by joining to non-partitioned tables, providing that the queries are designed to make use of the partitioned table, for example the following query WILL benefit from partitioning

SELECT F.Col1, F.Col2, D.Col3
FROM Fact_Partitioned F
    INNER JOIN Dim_MyDim D ON F.Col1 = D.Col1
WHERE F.Col1 = 5

But the following query WILL NOT benefit from partition elimination

SELECT F.Col1, F.Col2, D.Col3
FROM Fact_Partitioned F
    INNER JOIN Dim_MyDim D ON F.Col1 = D.Col1
WHERE D.Col1 = 5

It is a subtle difference, but in the first query, the join key is filtered in the partitioned table, taking advantage of elimination and then joined to the dimension. In the second query, the key is filtered in the dimension and then joined against the whole of the fact table, rather than just required partitions.

It goes without saying that the partitioning key needs to be in the WHERE clause for elimination to work, otherwise SQL Server does not know which partition(s) the data is in.

Adding a filter criteria on the JOIN clause will not help you. It needs to be in the WHERE clause to benefit from elimination.

The Partition Key does not need to be part of a non-clustered index (NCI) but if the NCI is unique, then it needs to contain the partitioning key in order to align the index. This is where the NCI is built on the same partition scheme as the table. NCIs should also be partition aligned unless there is an exceedingly good reason not to. I have never come across a good enough reason!