Sql-server – Most efficient way to aggregate data to new table

aggregatereportingsql server

Say I have a large table A around 151 million rows:

TableA
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[MetaDataProp1] [int] NULL,
[MetaDataProp2] [int] NULL,
[MetaDataProp3] [int] NULL,
[MetaDataProp4] [int] NULL,
[MetaDataProp5] [int] NULL,
[MetaDataProp6] [int] NULL,
[MetaDataProp7] [int] NULL,
[MetaDataProp8] [int] NULL,
[MetaDataProp9] [int] NULL,
[Date] [datetime] NOT NULL,
[Cost] [decimal](10,20) NOT NULL

I need to aggregate my large table in 'n' ways, depending on what is needed by our end users.

For instance group by Prop1-3, Day and SUM(Cost) for date range X and prop5 IN (X,Y,Z).

Now I also have to sometimes join to other tables for some data aggregations on top of that. Creating 20+ indexes on each view seems like a bad idea, and would create massive write overhead. So this makes indexed views out of the question.
That leaves me with one option as far as I know: to push the query results to a new table on my end. But this has some drawbacks:

A select insert is pretty intensive as it blocks other queries while it does it's work. This would be a disaster at just 20 aggregate results.

My idea is this:

  • Fetch the Max Id.
  • Do the select from table A in paginated batches off 100k using offset-fetch where the Id < the fetched max Id. This ensures we won't be endlessly re-fetching data and new data is pushed and it also ensures that data won't suddenly fluctuate as it selects.
  • Push aggregate batch to aggregate table.
  • Keep going until nothing is found in the paginated select.

Advantage to this:

  • Since it's now batched, it allows other queries to flow through in between the batch select and batch insert.

Disadvantages to this:

  • More read I/O as each paginated fetch must order the grouped results.
  • Longer overall aggregation time.

Another approach could be to fetch the whole data set from a machine, hold it all in memory, then bulk save in batches from the machine. This would reduce read I/O, but that seems like a bad idea as the query being returned can be 40GB+ in size. The cost of running a machine like that is not worth it just for aggregated data.

What do you think of this solution? What's the recommended solution for type of problem?

The data is updated about 3 times a day. In which we tear out old data and push in the new data. About a few million deletes and re-inserts each go.

I've heard whispers of ColumnStore in SQL Server and I am in a position where I can use it. But never read much on it.

Best Answer

This is what SSAS was made for, and there are a whole host of client tools that can connect to it including Excel.

Elaborating

SSAS - SQL Server analysis services is a free component of SQL server, which will build a multidimensional version of your data(cube), that will pre-build all the aggregations you require/define.

The design tool is free to download called SQL Server Data Tools.

Once you set up the data source and data view based on the table you have the easiest way to go is to create a Dimension and Fact based on your table the dimension will have all the Metadata descriptors and the fact will just have the numerics/measures.

You then join the two in the Cube, once published/deployed both SSRS and other tools can connect to it.

Some links

https://www.codeproject.com/Articles/658912/Create-First-OLAP-Cube-in-SQL-Server-Analysis-Serv

https://technet.microsoft.com/en-us/library/ee677579.aspx