Sql-server – Query hint? for deduplicated CTE in Merge

clustered-indexctemergesql server

Referring to this link: Optimizing MERGE Statement Performance

The following criteria are given for ideal performance:

  • Create an index on the join columns in the source table that is unique and covering.
  • Create a unique clustered index on the join columns in the target table.

I have many large tables (100+ million rows) that I merge 1-5 million rows into on a regular basis. There is a unique clustered index on the target table which matches the keys on the source table, satisfying the second criteria.

There is a bulk import process that populates the source table over time. I then place a non-unique clustered index on it in preparation for merging – this is because there is no way to ensure that duplicate values will not be inserted during the bulk process. This only sorta-half satisfies the first criteria.

I use a CTE with deduplication logic as the source for my merge –

...
USING
    ( SELECT
        cte.Key1,
        cte.Key2,
        cte.Key3,
        cte.RestOfTheColumns
      FROM
        ( SELECT 
           Key1,
           Key2,
           Key3,
           RestOfTheColumns,
          ,ROW_NUMBER() OVER (PARTITION BY Key1, Key2, Key3 ORDER BY Key1, Key2, Key3) AS RowNumber
        ) cte
      WHERE cte.RowNumber = 1
    ) 
...

Ideally, I believe, this would already exist in a uniquely indexed table.

In summary:

  • I put a clustered index on the source then deduplicate it in-query.
  • In the query analyzer the Clustered Index Merge takes 91.9% of the plan cost.

Does it matter that the CTE/underlying table does not have an explicit unique constraint? There are no duplicates, but SQL doesn't know that, so I imagine it would check for every line in the upsert.

Best Answer

If you have a clustered (or covering) index starting with (Key1, Key2, Key3, ...) on the table that the CTE is querying, this should be a well-performing query.

Can you add a "Key4", preferably an identity column, in the source table to your non-unique clustered index, so you perhaps can make the index unique? That way, you could also set "Key4" as your ORDER BY in the ROW_NUMBER(). This might improve performance, but it's just a guess.

The fact that the Clustered Index Merge operator is over 90% of your plan doesn't by itself indicate a performance problem, it just says that there's not that much else to do in the query plan apart from the join. In fact, it may indicate that you have a well-optimized plan - when you have the proper indexes, a ROW_NUMBER() can be very efficient.