Sql-server – T-SQL Optimizing a Join on TOP Value from another table

performancequery-performancesql serversql-server-2008

I've got a data warehouse that goes through a full refresh each night that can take about an hour to process 16 million rows/25gigs of data and we're looking for ways to reduce this time without going the incremental approach.

The basic format of our queries is as below, only I've stripped out about 20 more joins and 30+ more columns that would also be included. The stripped out columns and joins are very straightforward with no aggregation, subqueries, or other types of calculation involved. What's left is the main fact table (First_Source_Table) and the most problematic datapoint to collect. Second_Source_Table consists of many records for each Account_ID, but we only want to include the first record for each Account_ID.

Now my constraints. This in a replicated environment on SQL Server 2008. Unfortunately I have no control over the source tables, and while I can add new indexes on them, they will be lost the next day. I've tried calculating an in-between table off of Second_Source_Table before I do the full-table, but as that would need to be re-calculated each night, it didn't have a material impact on the overall calculation time.

The code below works, but if you look at the execution plan and IO Stats, the logic associated with Second_Source_Table constitutes about 80% of all resources used, but changing this field to NULL only cuts execution time in half. I'll also point out again that being a replicated environment, there are no issues to worry about with locking or other users writing to the tables we're in.

INSERT INTO
    New_Table
SELECT
    First_Source_Table.Account_ID,
    (
        select
            top 1
            Second_Source_Table.Code
        FROM
            Second_Source_Table
        WHERE
            Second_Source_Table.Account_ID = First_Source_Table.Account_ID
        ORDER BY
            Second_Source_Table.ID
    ) as Code
FROM
    First_Source_Table

Best Answer

You may want to consider partitioning instead of a scalar query.

So something like

insert into New_Table
    select
        [fst].Account_ID,
        [sst].Code
    from
        First_Source_Table as [fst]
            inner join (select
                            row_number()    over(
                                partition by Account_ID
                                order by Account_ID ) as [topN],
                            Account_ID,
                            Code
                        from
                            Second_Source_Table) as [sst]
            on     ( [sst].Account_ID = [fst].Account_ID )
    where
        ( [topN] = 1 ) --This is your topN query