How to index cube style queries with changing aggrigations

azure-sql-data-warehouseexecution-planperformancequery-performance

I've got a query that emulates the CUBE function.

This is many different aggregations unioned together.

select x,y,z, sum(a) from tbl group by x,y,z
union all
select x,'all',z, sum(a) from tbl group by x,z
union all
select 'all',y,z, sum(a) from tbl group by y,z
union all
select x,'all','all', sum(a) from tbl group by x
etc.

Unfortunately, this is really slow. Everything is distribution aligned. Analysing the estimated query plans shows that a columnstore index always uses hash match to perform this operation. A clustered index will allow one of the queries to use a stream aggregate, although I was expecting a few more ( see plan below)

I was hoping that the below index:

clustered index (x,y,z)

would enable stream aggrigate for the below groupings

x,y,z

x,y

x

Is there something else I'm overlooking?

Best Answer

I am an Oracle developer. Relating to how CI is implemented in Oracle, here are a few pitfalls to avoid.

Ensure that the table tbl is small enough to avoid overflow segments.
The table tbl must be ideally used only for bulk Inserts and Selects. Updates, Deletes and interspersed Inserts will cause page split.
If there are other non clustered indexes on tbl, ensure that those indexes do not end up with huge ClusteringFactor post converting tbl to CI.
Sharding tbl might help provided it does not affect other indexes and SQL performance.

Related Solutions

Sql-server – Optimising join on large table

Your ix_hugetable looks quite useless because:

it is the clustered index (PK)
the INCLUDE makes no difference because a clustered index INCLUDEs all non-key columns (non-key values at lowest leaf = INCLUDEd = what a clustered index is)

In addition: - added or fk should be first - ID is first = not much use

Try changing the clustered key to (added, fk, id) and drop ix_hugetable. You've already tried (fk, added, id). If nothing else, you'll save a lot of disk space and index maintenance

Another option might be to try the FORCE ORDER hint with table order boh ways and no JOIN/INDEX hints. I try not to use JOIN/INDEX hints personally because you remove options for the optimiser. Many years ago I was told (seminar with a SQL Guru) that FORCE ORDER hint can help when you have huge table JOIN small table: YMMV 7 years later...

Oh, and let us know where the DBA lives so we can arrange for some percussion adjustment

Edit, after 02 Jun update

The 4th column is not part of the non-clustered index so it uses the clustered index.

Try changing the NC index to INCLUDE the value column so it doesn't have to access the value column for the clustered index

create nonclustered index ix_hugetable on dbo.hugetable (
    fk asc, added asc
) include(value)

Note: If value is not nullable then it is the same as COUNT(*) semantically. But for SUM it need the actual value, not existence.

As an example, if you change COUNT(value) to COUNT(DISTINCT value) without changing the index it should break the query again because it has to process value as a value, not as existence.

The query needs 3 columns: added, fk, value. The first 2 are filtered/joined so are key columns. value is just used so can be included. Classic use of a covering index.

Sql-server – Comparing two queries in SQL Server 2012

I love your approach to careful consideration to query tuning and reviewing options and plans. I wish more developers did this. One caution would be - always test with a lot of rows, looking at the logical reads, this is a smallish table. Try and generate a sample load and run the query again. One small issue - in your top query you are not asking for an order by, in your bottom query you are. You should compare and contrast them each with ordering.

I just quickly created a SalesOrders table with 200,000 sales orders in it - still not huge by any stretch of the imagination. And ran the queries with the ORDER BY in each. I also played with indexes a bit.

With no clustered index on OrderID, just a non-clustered index on CustID The second query outperformed. Especially with the order by included in each. There was twice as many reads on the first query than the second query, and the cost percentages were 67% / 33% between the queries.

With a clustered index on OrderID and a non-clustered index just on CustID They performed in a similar speed and the exact same number of reads.

So I would suggest you increase the number of rows and do some more testing. But my final analysis on your queries -

You may find them behaving more similarly than you realize when you increase the rows, so keep that caveat in mind and test that way.

If all you ever want to return is the maximum OrderID for each Customer, and you want to determine that by the OrderID being the greatest OrderID then the second query out of these two is the best way to go from my mindset - it is a bit simpler and while ever so slightly more expensive based on subtree cost it is a quicker and easier to decipher statement. If you intend on adding other columns into your result set someday? Then the first query allows you do to do that.

Updated: One of your comments under your question was:

Please keep in mind, that finding the best query in this question is a means of refining the techniques used for comparing them.

But best takeaway for doing that- test with more data - always makes sure you have data consistent with production and expected future production. Query plans start looking data when you give more rows to the tables, and try and keep the distribution what you'd expect in production. And pay attention to things like including Order By or not, here I don't think it makes a terrible bit of difference in the end, but still worth digging into.

Your approach of comparing this level of detail and data is a good one. Subtree costs are arbitrary and meaningless mostly, but still worth at least looking at for comparison between edits/changes or even between queries. Looking at the time statistics and the IO are quite important, as is looking at the plan for anything that feels out of place for the size of the data you are working with and what you are trying to do.

Best Answer

Related Solutions

Sql-server – Optimising join on large table

Sql-server – Comparing two queries in SQL Server 2012

Related Question