Sql-server – Query optimization: Which is better in the case – CURSORs or a Set-based approach

querysql serversql-server-2008

First of all, I can describe the schema of the table, but my table itself is around 80GB so I am hoping some expert eyes can look at any obvious optimizations that can be performed. I am carrying out some activity-based analysis to understand a set of users from an in-house virtual gaming environment.

I have two base tables:

EventTable: Moderately big at 10GB
Activity_ID UserName ActivityStart ActivityEnd Badge

ActivityTable: Super big at 100GB
UserName ActivityTime Game1Points Game2Points

In short, I am trying the following:

- For each activity in the EventTable:
   - Compute the median of Game1Points, Game2Points 
     for two cases: for all entries in the ActivityTable 
     that were present since 1 hour before the activity 
     and for all entries stored during the activity itself

Therefore, the final table I am expecting is this:

ResultTable:
Activity_ID Badge Game1_Before Game1_During Game2_Before Game2_During

This clearly looks like a job for CURSORs so I first wrote my query using that approach. It was running and I could see the progress but after reading so many horror stories, I decided to re-write my 200-line CURSOR-based approach with a set-based approach. This works for small tables perfectly. However, for tables of my size, it seems to be stalling (or at least I am not sure how long I should run it). I ran the Display Estimated Plan and added the relevant indices as well. Now my problem is that I am not even sure how long this query will take to run (unlike the CURSOR approach where I could print some messages and see the progress).

Can someone tell me if there is a better way to do this? My current query is given below. While the query itself looks long, the structure is quite simple: First, I define a CTE that gets all the points associated with a particular event. Using this CTE, I calculate two separate medians.

WITH ST AS
(
    -- Fetching the required data: Get entries that exist 
    -- since 1 hour before the activity until the activity ended
    SELECT    ROW_NUMBER() OVER(ORDER BY Activity_ID) AS ID
            , Activity_ID
            , Badge
            , (CASE WHEN
                    Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityStart)
                    THEN 'Game1ActivityBefore'
                    WHEN
                    Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
                    THEN 'Game1ActivityDuring'
              END) Game1Type
            , (CASE WHEN
                    Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart)) AND dbo.ROUNDTIME(X.ActivityStart)
                    THEN 'Game2ActivityBefore'
                    WHEN
                    Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
                    THEN 'Game2ActivityDuring'
              END) Game2Type
            , CAST(Game1Points AS bigint) Game1Points
            , CAST(Game2Poins AS bigint) Game2Points
    FROM
    dbo.EventTable X
    INNER JOIN dbo.ActivityTable Y
    ON X.UserName = Y.UserName
    WHERE
    ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart))
                         AND dbo.ROUNDTIME(X.ActivityFinish)
    AND 
    (
            Badge LIKE 'GREEN%'
        OR Badge LIKE 'RED%'
    )
)
-- Eliminating any NULL values that resulted from the median computation step
SELECT Activity_ID
        , Badge
        , MAX(CAST([Game1ActivityBefore] AS DECIMAL)) AS 'Game1_Before'
        , MAX(CAST([Game1ActivityDuring] AS DECIMAL)) AS 'Game1_During'
        , MAX(CAST([Game2ActivityBefore] AS DECIMAL)) AS 'Game2_Before'
        , MAX(CAST([Game2ActivityDuring] AS DECIMAL)) AS 'Game2_During'
FROM
(
    -- Median computation for the two columns - Aggregation Step
    SELECT Activity_ID
            , Badge
            , Game1Type
            , Game2Type
            , AVG(Game1Points) Game1Median
            , AVG(Game2Points) Game2Median
    FROM
    (
            -- Median computation for the two columns - Inner step
        SELECT Activity_ID
                , Badge
                , Game1Type
                , Game2Type
                , Game1Points
                , ROW_NUMBER() OVER (
                    PARTITION BY Activity_ID, Badge, Game1Type
                    ORDER BY Game1Points ASC, ID ASC) AS Game1RowAsc
                , ROW_NUMBER() OVER (
                    PARTITION BY Activity_ID, Badge, Game1Type
                    ORDER BY Game1Points DESC, ID DESC) AS Game1RowDesc
                , Game2Points
                , ROW_NUMBER() OVER (
                    PARTITION BY Activity_ID, Badge, Game2Type
                    ORDER BY Game2Points ASC, ID ASC) AS Game2RowAsc
                , ROW_NUMBER() OVER (
                    PARTITION BY Activity_ID, Badge, Game2Type
                    ORDER BY Game2Points DESC, ID DESC) AS Game2RowDesc
        FROM ST TS
    ) X
    WHERE Game1RowAsc IN (Game1RowDesc, Game1RowDesc - 1, Game1RowDesc + 1)
    OR Game2RowAsc IN (Game2RowDesc, Game2RowDesc - 1, Game2RowDesc + 1)
    GROUP BY Activity_ID, Badge, Game1Type, Game2Type
) Y
PIVOT
(
    MAX(Game1Median)
    FOR Game1Type IN ([Game1ActivityBefore], [Game1ActivityDuring])
) AS PivotTable
PIVOT
(
    MAX(Game2Median)
    FOR Game2Type IN ([Game2ActivityBefore], [Game2ActivityDuring])
) AS PivotTable2
GROUP BY Activity_ID, Badge

UPDATE: After the suggestion. Take 1

CREATE TABLE #ST(ID INT
                    , Activity_ID INT
                    , Game1Type CHAR(13)
                    , Game2Type CHAR(13)
                    , Game1Points DECIMAL
                    , Game2Points DECIMAL
                )

INSERT INTO #ST
SELECT    ROW_NUMBER() OVER(ORDER BY Activity_ID) AS ID
        , Activity_ID
        , Badge
        , (CASE WHEN
                Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityStart)
                THEN 'Game1ActivityBefore'
                WHEN
                Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
                THEN 'Game1ActivityDuring'
          END) Game1Type
        , (CASE WHEN
                Y.ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart)) AND dbo.ROUNDTIME(X.ActivityStart)
                THEN 'Game2ActivityBefore'
                WHEN
                Y.ActivityTime BETWEEN dbo.ROUNDTIME(X.ActivityStart) AND dbo.ROUNDTIME(X.ActivityFinish)
                THEN 'Game2ActivityDuring'
          END) Game2Type
        , CAST(Game1Points AS bigint) Game1Points
        , CAST(Game2Poins AS bigint) Game2Points
FROM
dbo.EventTable X
INNER JOIN dbo.ActivityTable Y
ON X.UserName = Y.UserName
WHERE
ActivityTime BETWEEN DATEADD(HOUR, -1, dbo.ROUNDTIME(X.ActivityStart))
                     AND dbo.ROUNDTIME(X.ActivityFinish)
AND 
(
        Badge LIKE 'GREEN%'
    OR Badge LIKE 'RED%'
)

CREATE TABLE #INTERMEDIATE (Activity_ID INT
                            , Badge VARCHAR(255)
                            , Game1Type CHAR(13)
                            , Game2Type CHAR(13)
                            , Game1Points DECIMAL
                            , Game1RowAsc INT
                            , Game1RowDesc INT
                            , Game2Points DECIMAL
                            , Game2RowAsc INT
                            , Game2RowDesc INT 
                            )

INSERT INTO #INTERMEDIATE
SELECT Activity_ID
        , Badge
        , Game1Type
        , Game2Type
        , Game1Points
        , ROW_NUMBER() OVER (
            PARTITION BY Activity_ID, Badge, Game1Type
            ORDER BY Game1Points ASC, ID ASC) AS Game1RowAsc
        , ROW_NUMBER() OVER (
            PARTITION BY Activity_ID, Badge, Game1Type
            ORDER BY Game1Points DESC, ID DESC) AS Game1RowDesc
        , Game2Points
        , ROW_NUMBER() OVER (
            PARTITION BY Activity_ID, Badge, Game2Type
            ORDER BY Game2Points ASC, ID ASC) AS Game2RowAsc
        , ROW_NUMBER() OVER (
            PARTITION BY Activity_ID, Badge, Game2Type
            ORDER BY Game2Points DESC, ID DESC) AS Game2RowDesc
FROM #ST

CREATE CLUSTERED INDEX [TT1] ON #INTERMEDIATE (Acitivity_ID
                                             , Badge)

CREATE NONCLUSTERED INDEX [TT2] ON #INTERMEDIATE (Game1RowAsc
                                                , Game1RowDesc
                                                , Game2RowAsc
                                                , Game2RowDesc)

SELECT Activity_ID
        , Badge
        , MAX(CAST([Game1ActivityBefore] AS DECIMAL)) AS 'Game1_Before'
        , MAX(CAST([Game1ActivityDuring] AS DECIMAL)) AS 'Game1_During'
        , MAX(CAST([Game2ActivityBefore] AS DECIMAL)) AS 'Game2_Before'
        , MAX(CAST([Game2ActivityDuring] AS DECIMAL)) AS 'Game2_During'
FROM
(
    SELECT Activity_ID
            , Badge
            , Game1Type
            , Game2Type
            , AVG(Game1Points) Game1Median
            , AVG(Game2Points) Game2Median
    FROM #ST
    WHERE Game1RowAsc IN (Game1RowDesc, Game1RowDesc - 1, Game1RowDesc + 1)
    OR Game2RowAsc IN (Game2RowDesc, Game2RowDesc - 1, Game2RowDesc + 1)
    GROUP BY Activity_ID, Badge, Game1Type, Game2Type
) Y
PIVOT
(
    MAX(Game1Median)
    FOR Game1Type IN ([Game1ActivityBefore], [Game1ActivityDuring])
) AS PivotTable
PIVOT
(
    MAX(Game2Median)
    FOR Game2Type IN ([Game2ActivityBefore], [Game2ActivityDuring])
) AS PivotTable2
GROUP BY Activity_ID, Badge

DROP TABLE #INTERMEDIATE
DROP TABLE #ST

UPDATE 2:: Execution Plan

May be the Display Execution Plan option does not show the indices being used?

Best Answer

1 - Don't use CTEs for performance. CTEs are basically single-use views. They convey absolutely NO performance benefit. If performance is important and you have enough data, indexed #temp tables will probably work a lot better.

2- Definitely use a set-based approach. I can't think of any scenario where a CURSOR would be faster than a set-based query.

I'm not going to completely rewrite your query, but I do see some things that should make a huge difference:

  • Put your initial selection (outermost CTE) AND your subquery into a #temp table and index it. Right now your filters are created dynamically and not seekable!

You are filtering on the ROW_NUMBER output, 4 times, so that's a table scan over every row.