Sql-server – Most efficient way to aggregate data to new table

aggregatereportingsql server

Say I have a large table A around 151 million rows:

TableA
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[MetaDataProp1] [int] NULL,
[MetaDataProp2] [int] NULL,
[MetaDataProp3] [int] NULL,
[MetaDataProp4] [int] NULL,
[MetaDataProp5] [int] NULL,
[MetaDataProp6] [int] NULL,
[MetaDataProp7] [int] NULL,
[MetaDataProp8] [int] NULL,
[MetaDataProp9] [int] NULL,
[Date] [datetime] NOT NULL,
[Cost] [decimal](10,20) NOT NULL

I need to aggregate my large table in 'n' ways, depending on what is needed by our end users.

For instance group by Prop1-3, Day and SUM(Cost) for date range X and prop5 IN (X,Y,Z).

Now I also have to sometimes join to other tables for some data aggregations on top of that. Creating 20+ indexes on each view seems like a bad idea, and would create massive write overhead. So this makes indexed views out of the question.
That leaves me with one option as far as I know: to push the query results to a new table on my end. But this has some drawbacks:

A select insert is pretty intensive as it blocks other queries while it does it's work. This would be a disaster at just 20 aggregate results.

My idea is this:

Fetch the Max Id.
Do the select from table A in paginated batches off 100k using offset-fetch where the Id < the fetched max Id. This ensures we won't be endlessly re-fetching data and new data is pushed and it also ensures that data won't suddenly fluctuate as it selects.
Push aggregate batch to aggregate table.
Keep going until nothing is found in the paginated select.

Advantage to this:

Since it's now batched, it allows other queries to flow through in between the batch select and batch insert.

Disadvantages to this:

More read I/O as each paginated fetch must order the grouped results.
Longer overall aggregation time.

Another approach could be to fetch the whole data set from a machine, hold it all in memory, then bulk save in batches from the machine. This would reduce read I/O, but that seems like a bad idea as the query being returned can be 40GB+ in size. The cost of running a machine like that is not worth it just for aggregated data.

What do you think of this solution? What's the recommended solution for type of problem?

The data is updated about 3 times a day. In which we tear out old data and push in the new data. About a few million deletes and re-inserts each go.

I've heard whispers of ColumnStore in SQL Server and I am in a position where I can use it. But never read much on it.

Best Answer

This is what SSAS was made for, and there are a whole host of client tools that can connect to it including Excel.

Elaborating

SSAS - SQL Server analysis services is a free component of SQL server, which will build a multidimensional version of your data(cube), that will pre-build all the aggregations you require/define.

The design tool is free to download called SQL Server Data Tools.

Once you set up the data source and data view based on the table you have the easiest way to go is to create a Dimension and Fact based on your table the dimension will have all the Metadata descriptors and the fact will just have the numerics/measures.

You then join the two in the Cube, once published/deployed both SSRS and other tools can connect to it.

Some links

https://www.codeproject.com/Articles/658912/Create-First-OLAP-Cube-in-SQL-Server-Analysis-Serv

https://technet.microsoft.com/en-us/library/ee677579.aspx

Related Solutions

Sql-server – the most efficient way to get the minimum of multiple columns on SQL Server 2005

I tested the performance of all 3 methods, and here's what I found:

1 record: No noticeable difference
10 records: No noticeable difference
1,000 records: No noticeable difference
10,000 records: UNION subquery was a little slower. The CASE WHEN query is a little faster than the UNPIVOT one.
100,000 records: UNION subquery is significantly slower, but UNPIVOT query becomes a little faster than the CASE WHEN query
500,000 records: UNION subquery still significantly slower, but UNPIVOT becomes much faster than the CASE WHEN query

So the end results seems to be

With smaller record sets there doesn't seem to be enough of a difference to matter. Use whatever is easiest to read and maintain.
Once you start getting into larger record sets, the UNION ALL subquery begins to perform poorly compared to the other two methods.
The CASE statement performs the best up until a certain point (in my case, around 100k rows), and which point the UNPIVOT query becomes the best-performing query

The actual number at which one query becomes better than another will probably change as a result of your hardware, database schema, data, and current server load, so be sure to test with your own system if you're concerned about performance.

I also ran some tests using Mikael's answer; however, it was slower than all 3 of the other methods tried here for most recordset sizes. The only exception was it did better than a the UNION ALL query for very large recordset sizes. I like the fact it shows the column name in addition to the smallest value though.

I'm not a dba, so I may not have optimized my tests and missed something. I was testing with the actual live data, so that may have affected the results. I tried to account for that by running each query a few different times, but you never know. I would definitely be interested if someone wrote up a clean test of this and shared their results.

Sql-server – Most cost efficient way to page through a poorly ordered table

Essentially, you are asking if you can perform a single ordered scan through the data overall, while making no copies of the data, and returning 'x' disjoint sets of rows from the full set on each call. This is exactly the behaviour of an appropriately-configured API cursor.

For example, using the AdventureWorks table Person.EmailAddress to return sets of 1,000 rows:

DECLARE 
    @cur integer,
    -- FAST_FORWARD | AUTO_FETCH | AUTO_CLOSE
    @scrollopt integer = 16 | 8192 | 16384,
    -- READ_ONLY, CHECK_ACCEPTED_OPTS, READ_ONLY_ACCEPTABLE
    @ccopt integer = 1 | 32768 | 65536, 
    @rowcount integer = 1000,
    @rc integer;

-- Open the cursor and return the first 1,000 rows
EXECUTE @rc = sys.sp_cursoropen
    @cur OUTPUT,
    N'
    SELECT *
    FROM AdventureWorks2012.Person.EmailAddress
        WITH (INDEX([IX_EmailAddress_EmailAddress]))
    ORDER BY EmailAddress;
    ',
    @scrollopt OUTPUT,
    @ccopt OUTPUT,
    @rowcount OUTPUT;

IF @rc <> 16 -- FastForward cursor automatically closed
BEGIN
    -- Name the cursor so we can use CURSOR_STATUS
    EXECUTE sys.sp_cursoroption
        @cur, 
        2, 
        'MyCursorName';

    -- Until the cursor auto-closes
    WHILE CURSOR_STATUS('global', 'MyCursorName') = 1
    BEGIN
        EXECUTE sys.sp_cursorfetch
            @cur,
            2,
            0,
            1000;
    END;
END;

Each fetch operation returns a maximum of 1,000 rows, remembering the position of the scan from the previous call.

Best Answer

Related Solutions

Sql-server – the most efficient way to get the minimum of multiple columns on SQL Server 2005

Sql-server – Most cost efficient way to page through a poorly ordered table

Related Question