Use a CROSS APPLY (or perhaps OUTER APPLY) to replace the case statement in your aggregates like this:
with f as (
select * from ( values
(1,4,1,0,0)
,(1,5,0,1,0)
,(1,6,0,0,1)
)f(IncludeColumn,SecondInclude,Column1,Column2,Column3)
)
select
orgid=boardname
,'B' as OrgType
, Lang
, BoardName
,cast(Null as varchar) as SchoolName
,SUM(Column1) as Column1
,SUM(Column2) as Column2
,SUM(Column3) as Column3
FROM TABLEB
CROSS APPLY f
WHERE [Include_Board] = 1
AND Table.IncludeColumn = f.IncludeColumn
AND TableB.SecondInclude = f.SecondInclude
Group By [Lang], BoardName
where now the table f can (as appropriate) be:
- a permanent table in your database; or
- a temporary table defined by the application; or
- a table variable created within the stored procedure
This has the additional advantage of being data-driven; in the case that your masking needs to be amended, only a data change is required instead of a code change.
You're right that the SQL Server optimizer prefers not to generate parallel MERGE
join plans (it costs this alternative quite high). Parallel MERGE
always requires repartitioning exchanges on both join inputs, and more importantly, it requires that row order be preserved across those exchanges.
Parallelism is most efficient when each thread can run independently; order preservation often leads to frequent synchronization waits, and may ultimately cause exchanges to spill to tempdb
to resolve an intra-query deadlock condition.
These problems can be circumvented by running multiple instances of the whole query on one thread each, with each thread processing an exclusive range of data. This is not a strategy that the optimizer considers natively, however. As it is, the original SQL Server model for parallelism breaks the query at exchanges, and runs the plan segments formed by those splits on multiple threads.
There are ways to achieve running whole query plans on multiple threads over exclusive data set ranges, but they require trickery that not everyone will be happy with (and will not be supported by Microsoft or guaranteed to work in the future). One such approach is to iterate over the partitions of a partitioned table and give each thread the task of producing a subtotal. The result is the SUM
of the row counts returned by each independent thread:
Obtaining partition numbers is easy enough from metadata:
DECLARE @P AS TABLE
(
partition_number integer PRIMARY KEY
);
INSERT @P (partition_number)
SELECT
p.partition_number
FROM sys.partitions AS p
WHERE
p.[object_id] = OBJECT_ID(N'test_transaction_properties', N'U')
AND p.index_id = 1;
We then use these numbers to drive a correlated join (APPLY
), and the $PARTITION
function to limit each thread to the current partition number:
SELECT
row_count = SUM(Subtotals.cnt)
FROM @P AS p
CROSS APPLY
(
SELECT
cnt = COUNT_BIG(*)
FROM dbo.test_transaction_item_detail AS i
JOIN dbo.test_transaction_properties AS t ON
t.transactionID = i.transactionID
WHERE
$PARTITION.pf_test_transactionId(t.transactionID) = p.partition_number
AND $PARTITION.pf_test_transactionId(i.transactionID) = p.partition_number
) AS SubTotals;
The query plan shows a MERGE
join being performed for each row in table @P
. The clustered index scan properties confirm that only a single partition is processed on each iteration:
![Apply serial plan](https://i.stack.imgur.com/S1tIi.jpg)
Unfortunately, this only results in sequential serial processing of partitions. On the data set you provided, my 4-core (hyperthreaded to 8) laptop returns the correct result in 7 seconds with all data in memory.
To get the MERGE
sub-plans to run concurrently, we need a parallel plan where partition ids are distributed over the available threads (MAXDOP
) and each MERGE
sub-plan runs on a single thread using the data in one partition. Unfortunately, the optimizer frequently decides against parallel MERGE
on cost grounds, and there is no documented way to force a parallel plan. There is an undocumented (and unsupported) way, using trace flag 8649:
SELECT
row_count = SUM(Subtotals.cnt)
FROM @P AS p
CROSS APPLY
(
SELECT
cnt = COUNT_BIG(*)
FROM dbo.test_transaction_item_detail AS i
JOIN dbo.test_transaction_properties AS t ON
t.transactionID = i.transactionID
WHERE
$PARTITION.pf_test_transactionId(t.transactionID) = p.partition_number
AND $PARTITION.pf_test_transactionId(i.transactionID) = p.partition_number
) AS SubTotals
OPTION (QUERYTRACEON 8649);
Now the query plan shows partition numbers from @P
being distributed among threads on a round-robin basis. Each thread runs the inner side of the nested loops join for a single partition, achieving our goal of processing disjoint data concurrently. The same result is now returned in 3 seconds on my 8 hyper-cores, with all eight at 100% utilization.
![Parallel APPLY](https://i.stack.imgur.com/hIdYp.jpg)
I am not recommending you use this technique necessarily - see my earlier warnings - but it does address your question.
See my article Improving Partitioned Table Join Performance for more details.
Columnstore
Seeing as you are using SQL Server 2012 (and assuming it is Enterprise) you also have the option of using a columnstore index. This shows the potential of batch mode hash joins where sufficient memory is available:
CREATE NONCLUSTERED COLUMNSTORE INDEX cs
ON dbo.test_transaction_properties (transactionID);
CREATE NONCLUSTERED COLUMNSTORE INDEX cs
ON dbo.test_transaction_item_detail (transactionID);
With these indexes in place the query...
SELECT
COUNT_BIG(*)
FROM dbo.test_transaction_properties AS ttp
JOIN dbo.test_transaction_item_detail AS ttid ON
ttid.transactionID = ttp.transactionID;
...results in the following execution plan from the optimizer without any trickery:
![Columnstore plan 1](https://i.stack.imgur.com/JeIWj.jpg)
Correct results in 2 seconds, but eliminating the row-mode processing for the scalar aggregate helps even more:
SELECT
COUNT_BIG(*)
FROM dbo.test_transaction_properties AS ttp
JOIN dbo.test_transaction_item_detail AS ttid ON
ttid.transactionID = ttp.transactionID
GROUP BY
ttp.transactionID % 1;
![Optimized columnstore](https://i.stack.imgur.com/RRXdb.jpg)
The optimized column-store query runs in 851ms.
Geoff Patterson created the bug report Partition Wise Joins but it was closed as Won't Fix.
Best Answer
Hashing is a common partitioning mechanism for your use case.. It is NOT available in SQL 2008: http://www.sqlservercentral.com/articles/partition/64740/.
The problem with hash partitioning is that if you add partitions, you need to redistribute the data. Logically speaking, any partitioning scheme that dynamically determines which partition the data goes into the data would need to redistribute the data after you add partitions (other than perhaps using a hash on the insert time + keeping track of when each partition was added). This assumes partitioning (distributing data via a predefined scheme to reduce subsequent lookup time) vs sharding (randomly distributing data to multiple boxes reduce processing time of the entire data set).
You can probably simulate a hash partition by generating the hash yourself on insert/update, and using a range based partition over the hashed value. For instance, lets say you want 10 partitions. Modulus divide a hash of your key value by 10, and range partition on the remainder (with each range taking one of the 10 possible values).