Sql-server – T-SQL Optimizing a Join on TOP Value from another table

performancequery-performancesql serversql-server-2008

I've got a data warehouse that goes through a full refresh each night that can take about an hour to process 16 million rows/25gigs of data and we're looking for ways to reduce this time without going the incremental approach.

The basic format of our queries is as below, only I've stripped out about 20 more joins and 30+ more columns that would also be included. The stripped out columns and joins are very straightforward with no aggregation, subqueries, or other types of calculation involved. What's left is the main fact table (First_Source_Table) and the most problematic datapoint to collect. Second_Source_Table consists of many records for each Account_ID, but we only want to include the first record for each Account_ID.

Now my constraints. This in a replicated environment on SQL Server 2008. Unfortunately I have no control over the source tables, and while I can add new indexes on them, they will be lost the next day. I've tried calculating an in-between table off of Second_Source_Table before I do the full-table, but as that would need to be re-calculated each night, it didn't have a material impact on the overall calculation time.

The code below works, but if you look at the execution plan and IO Stats, the logic associated with Second_Source_Table constitutes about 80% of all resources used, but changing this field to NULL only cuts execution time in half. I'll also point out again that being a replicated environment, there are no issues to worry about with locking or other users writing to the tables we're in.

INSERT INTO
    New_Table
SELECT
    First_Source_Table.Account_ID,
    (
        select
            top 1
            Second_Source_Table.Code
        FROM
            Second_Source_Table
        WHERE
            Second_Source_Table.Account_ID = First_Source_Table.Account_ID
        ORDER BY
            Second_Source_Table.ID
    ) as Code
FROM
    First_Source_Table

Best Answer

You may want to consider partitioning instead of a scalar query.

So something like

insert into New_Table
    select
        [fst].Account_ID,
        [sst].Code
    from
        First_Source_Table as [fst]
            inner join (select
                            row_number()    over(
                                partition by Account_ID
                                order by Account_ID ) as [topN],
                            Account_ID,
                            Code
                        from
                            Second_Source_Table) as [sst]
            on     ( [sst].Account_ID = [fst].Account_ID )
    where
        ( [topN] = 1 ) --This is your topN query

Related Solutions

Sql-server – Optimising join on large table

Your ix_hugetable looks quite useless because:

it is the clustered index (PK)
the INCLUDE makes no difference because a clustered index INCLUDEs all non-key columns (non-key values at lowest leaf = INCLUDEd = what a clustered index is)

In addition: - added or fk should be first - ID is first = not much use

Try changing the clustered key to (added, fk, id) and drop ix_hugetable. You've already tried (fk, added, id). If nothing else, you'll save a lot of disk space and index maintenance

Another option might be to try the FORCE ORDER hint with table order boh ways and no JOIN/INDEX hints. I try not to use JOIN/INDEX hints personally because you remove options for the optimiser. Many years ago I was told (seminar with a SQL Guru) that FORCE ORDER hint can help when you have huge table JOIN small table: YMMV 7 years later...

Oh, and let us know where the DBA lives so we can arrange for some percussion adjustment

Edit, after 02 Jun update

The 4th column is not part of the non-clustered index so it uses the clustered index.

Try changing the NC index to INCLUDE the value column so it doesn't have to access the value column for the clustered index

create nonclustered index ix_hugetable on dbo.hugetable (
    fk asc, added asc
) include(value)

Note: If value is not nullable then it is the same as COUNT(*) semantically. But for SUM it need the actual value, not existence.

As an example, if you change COUNT(value) to COUNT(DISTINCT value) without changing the index it should break the query again because it has to process value as a value, not as existence.

The query needs 3 columns: added, fk, value. The first 2 are filtered/joined so are key columns. value is just used so can be included. Classic use of a covering index.

Sql-server – When a previously-fast SQL query starts running slow, where do I look to find the source of the issue

When a query that used to run fast suddenly starts running slowly in the middle of the night and nothing else is affected except for this one query, how do I troubleshoot it...?

You can start by checking if the execution plan is still in the cache. Check sys.dm_exec_query_stats, sys.dm_exec_procedure_stats and sys.dm_exec_cached_plans. If the bad execution plan is still cached you can analyze it, and you can also check the execution stats. The execution stats will contain information as logical reads, CPU time and execution time. These can give strong indications what the problem is (eg. large scan vs. blocking). See Identifying problem queries for an explanation how to interpret the data.

Also, this is not a problem with parameter sniffing. I've seen that before, and this is not it, since even when I hard-code the varaibles in SSMS, I still get slow performance.

I'm not convinced. Hard-coding variables in SSMS does not prove that the past bad execution plan was not compiled against a skewed input. Please read Parameter Sniffing, Embedding, and the RECOMPILE Options for a very good article on the topic. Slow in the Application, Fast in SSMS? Understanding Performance Mysteries is another excellent reference.

I've concluded (perhaps incorrectly) from these little experiments that the reason for the slow-down is due to how SQL's cached execution plan is set up -- when the query is a little different, it has to create a new execution plan.

This can be easily tested. SET STATISTICS TIME ON will show you the compile vs. execution time. SQL Server:Statistics performance counters will also reveal whether compilation is an issue (frankly, I find it unlikely).

However, there is something similar that you may hit: the query grant gate. Read Understanding SQL server memory grant for details. If your query requests a large grant at a moment no memory is available, it will have to wait, and it will all look as 'slow execution' to the application. Analyzing wait info stats will reveal if this is the case.

For a more general discussion about what to measure and what to look for, see How to analyse SQL Server performance

Best Answer

Related Solutions

Sql-server – Optimising join on large table

Sql-server – When a previously-fast SQL query starts running slow, where do I look to find the source of the issue

Related Question