Some Background Info
We have a set of tables that holds all the transactions for our system, TransactionHeaders
and TransactionDetails
. There are roughly 80k Transactions a day which translates to 80k rows in TransactionHeaders
and 900k rows in TransactionDetails
being burned daily. At current, the TransactionHeaders
has about 10 million rows and TransactionDetails
has about 110 million rows. We use the data for general reporting.
dbo.TransactionHeaders (
TransactionHeaderId int identity,
TransactionHeaderTypeId int not null,
StoreId int not null,
TransactionDate date not null,
TransactionTime time not null,
...)
dbo.TransactionDetails (
TransactionDetailId int identity,
TransactionHeaderId int not null,
ItemUPC NCHAR(14) not null,
Price NUMERIC(16,2) not null,
ReplicationDate datetime not null
...)
The Issue
Querying has become cumbersome. It takes a very long time to access the sales of a given store or a given item for any period.
What I've Tried
I have tried to bring TransactionDate
down to the TransactionDetails
table in order to partition it on the date with one partition per day. This worked great for finding the sales of an item. The problem is that many of the reports require the StoreID in addition to being over a specific date range.
Given that adding more information from the TransactionHeader
table to the TransactionDetail
table breaks the pattern, I'm hesitant to denormalize the tables into one table due to storage concerns.
I've had the idea to partition TransactionHeader
on TransactionDate
and partition TransactionDetail
on TransactionHeaderID
. In theory this makes the queried data significantly smaller and reinforces the pattern by making the details only reasonably accessed via the header information.
The Question(TL;DR)
Is there a preferred, correct, standard, etc. method for dealing with tables in the Master/Detail pattern in order to increase performance? Partitioning one or both tables? I'd like to avoid denormalizing if at all possible.
Best Answer
You've got a few different questions in here:
Q: It takes a very long time to access the sales of a given store or a given item for any period.
A: To troubleshoot that, we would need to see the execution plans of the queries involved, plus know a little about the query runtime and the hardware involved. 10mm rows in a header table and 110mm rows in a detail table isn't much at all for SQL Server, so this should be a solvable problem.
Q: (Partitioning) worked great for finding the sales of an item. The problem is that many of the reports require the StoreID in addition to being over a specific date range.
A: Correct, partitioning rarely makes SELECT queries faster. It's more about improving performance of bulk loads, specifically partition switching. I wouldn't think of partitioning as a solution to this problem, and indeed, it will actually make most queries worse.
Q: Is there a preferred, correct, standard, etc. method for dealing with tables in the Master/Detail pattern in order to increase performance?
A: Absolutely - archive older data. Figure out what you're going to let users query online at high speed, and then beyond that, move the data into a separate set of archive tables. You can use a partitioned view over the old and new tables in order to give them a single seamless view into the data for easier reporting too.
There's a lot of advantages to this approach. For example, when you want to add additional fields to the current table, you can do that quickly without having to deal with a large amount of archive data. If you want to add lots of indexes to the old archive data, you can - because it's not getting tons of inserts/updates/deletes anymore. If you split the old and new data into different databases, you can even use different backup/recovery strategies with them - even while the view is in place, and users don't know the data is split.