Sql-server – How to structure Master/Detail tables for performance without denormalizing

partitioningsql server

Some Background Info

We have a set of tables that holds all the transactions for our system, TransactionHeaders and TransactionDetails. There are roughly 80k Transactions a day which translates to 80k rows in TransactionHeaders and 900k rows in TransactionDetails being burned daily. At current, the TransactionHeaders has about 10 million rows and TransactionDetails has about 110 million rows. We use the data for general reporting.

dbo.TransactionHeaders (
TransactionHeaderId            int                            identity,
TransactionHeaderTypeId        int                            not null,
StoreId                        int                            not null,
TransactionDate                date                           not null,
TransactionTime                time                           not null,
...)  

dbo.TransactionDetails (
TransactionDetailId            int                            identity,
TransactionHeaderId            int                            not null,
ItemUPC                        NCHAR(14)                      not null,
Price                          NUMERIC(16,2)                  not null,
ReplicationDate                datetime                       not null
...)

The Issue

Querying has become cumbersome. It takes a very long time to access the sales of a given store or a given item for any period.

What I've Tried

I have tried to bring TransactionDate down to the TransactionDetails table in order to partition it on the date with one partition per day. This worked great for finding the sales of an item. The problem is that many of the reports require the StoreID in addition to being over a specific date range.

Given that adding more information from the TransactionHeader table to the TransactionDetail table breaks the pattern, I'm hesitant to denormalize the tables into one table due to storage concerns.

I've had the idea to partition TransactionHeader on TransactionDate and partition TransactionDetail on TransactionHeaderID. In theory this makes the queried data significantly smaller and reinforces the pattern by making the details only reasonably accessed via the header information.

The Question(TL;DR)

Is there a preferred, correct, standard, etc. method for dealing with tables in the Master/Detail pattern in order to increase performance? Partitioning one or both tables? I'd like to avoid denormalizing if at all possible.

Best Answer

You've got a few different questions in here:

Q: It takes a very long time to access the sales of a given store or a given item for any period.

A: To troubleshoot that, we would need to see the execution plans of the queries involved, plus know a little about the query runtime and the hardware involved. 10mm rows in a header table and 110mm rows in a detail table isn't much at all for SQL Server, so this should be a solvable problem.

Q: (Partitioning) worked great for finding the sales of an item. The problem is that many of the reports require the StoreID in addition to being over a specific date range.

A: Correct, partitioning rarely makes SELECT queries faster. It's more about improving performance of bulk loads, specifically partition switching. I wouldn't think of partitioning as a solution to this problem, and indeed, it will actually make most queries worse.

Q: Is there a preferred, correct, standard, etc. method for dealing with tables in the Master/Detail pattern in order to increase performance?

A: Absolutely - archive older data. Figure out what you're going to let users query online at high speed, and then beyond that, move the data into a separate set of archive tables. You can use a partitioned view over the old and new tables in order to give them a single seamless view into the data for easier reporting too.

There's a lot of advantages to this approach. For example, when you want to add additional fields to the current table, you can do that quickly without having to deal with a large amount of archive data. If you want to add lots of indexes to the old archive data, you can - because it's not getting tons of inserts/updates/deletes anymore. If you split the old and new data into different databases, you can even use different backup/recovery strategies with them - even while the view is in place, and users don't know the data is split.

Related Solutions

Mysql – Partition by year and sub-partition by month thesql

I had to do the same thing and solved slightly differently. Basically as far as I understand by reading the docs MySQL subpartitioning does not support partition types besides HASH and KEY.

In MySQL 5.6, it is possible to subpartition tables that are partitioned by RANGE or LIST. Subpartitions may use either HASH or KEY partitioning. This is also known as composite partitioning.

This means that we can't determine in what subpartition a record will end up. It's up to MySQL. So I don't think it's wise giving such names to your subpartitions (i.e. january) because no way you're going to know if stuff created on that month will end up there. This because you can't subpartition by MONTH(nav_date) but only by HASH(MONTH(nav_date)) or KEY(MONTH(nav_date)).

So to solve the problem I decided to create a new month column in my table and then I added an index to it. Then I subpartioned by KEY(month) without caring about the subpartition names. This way in MySQL 5.6 I could select the main partition from the FROM clause and the subpartition by specifying the month in the WHERE clause.

Follows the full example:

CREATE TABLE `my_example` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `value` double DEFAULT NULL,
    `is_deleted` tinyint(1) NOT NULL DEFAULT '0',
    `timestamp` datetime NOT NULL,
    `last_modified` datetime NOT NULL,
    `month` tinyint(1) NOT NULL,
    PRIMARY KEY (`id`,`timestamp`, `month`),
    KEY `in_is_deleted` (`is_deleted`),
    KEY `in_last_modified` (`last_modified`),
    KEY `in_timestamp` (`timestamp`),
    KEY `in_month` (`month`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY RANGE (YEAR(`timestamp`))
SUBPARTITION BY KEY (`month`) 
SUBPARTITIONS 12 (
    PARTITION p2011 VALUES LESS THAN (2012),
    PARTITION p2012 VALUES LESS THAN (2013),
    PARTITION p2013 VALUES LESS THAN (2014),
    PARTITION p2014 VALUES LESS THAN (2015),
    PARTITION p2015 VALUES LESS THAN (2016),
    PARTITION p2016 VALUES LESS THAN (2017),
    PARTITION p2017 VALUES LESS THAN (2018),
    PARTITION pmax VALUES LESS THAN MAXVALUE
);

EXPLAIN PARTITIONS SELECT * FROM my_example PARTITION (p2015); -- will go through all twelve subpartitions
EXPLAIN PARTITIONS SELECT * FROM my_example PARTITION (p2015) WHERE month = 9; -- will look only into one subpartition

Sql-server – SQL Server Database Table Partitioning Consideration

The real use case for table partitioning is fast load and unload of data.

If your warehouse tables and staging tables are in the same database, and have the same schema, you will be able to swap a partition from staging to warehouse very quickly. Similarly, when data has reached its retention expiry date and must be purged, it is very quick to delete a whole partition of data. This page and its ilk will give some pointers.

As each partition can be directed to a different filegroup your backup & recovery cycles could be shortened at the cost of increased complexity. For example, load one company's data, process it, then take a backup of the file(s) that hold that company. Perhaps several companies could be processed in parallel, knowing that file contention will be reduced through partitioning?

Indexes can be partitioned as well as the tables. Index maintenance can be performed one partition at a time. This could reduce contention and overall load on your system.

Using partitions solely for performance enhancement can be problematic. Every query will have to have the partition key in the predicate, as a minimum. Sometimes the optimiser chooses not to perform partition elimination even then.

Partitioning is no free lunch. It has costs as well as benefits. You must consider both sides and test well. As so often in DB design "it depends."