Sql-server – Updating a big replicated Dimension (SQL Server PDW)

data-warehousesql serversql-server-pdw

We use a SQL Server PDW appliance for our data warehouse. One of the tables in our warehouse is a replicated table with about 20 million rows. As part of our ETL process we need to expire old records from this dimension; however, we are seeing that updating a handful of records (<100) takes over 1 hour to complete. This is what I would like to improve if I can.

Naturally, one option that I thought about was changing this Dimension from Replicated to Distributed. My testing shows that it would fix the issue with the ETL process taking long (from 1.5 hours came down to 30 secs) but all the joins against the Distributed version of this dimension would be affected since the joins are almost never based on the same distribution column. When I look at the execution plan of some of these queries I usually see either a ShuffleMove or a BroadcastMove operation.

So my question to the PDW guru's here is:

Is there anything else that can be done in order to improve the performance of updating records in the replicated version of this Dimension?

Again, moving to a Distributed table doesn't seem to be the best solution since it will affect hundreds of already written SQL queries and reports developed by other people.

Best Answer

A few questions. 20 million rows isn't necessarily that large.

What process are you using to perform your updates and deletes right now?

Is the dimension a CLUSTERED COLUMNSTORE INDEX, CLUSTERED INDEX or HEAP?

Are you saying that there is movement whilst you update and delete this table or did you just see movement when you changed the table from replicated to distributed?

If it is the latter that is not surprising. You are unlikely to be join and aggregation compatible. If you are doing something to trigger the movement through your update / delete then we could look at that - although a concrete example would be helpful.

In general terms I would start by trying to keep the ETL simple.

Use CTAS against the dimension selecting only the rows you want to keep, union in any new rows and use CASE to pick up any changes (converting the UPDATE into a transform within the CTAS). Once complete you can then use a pair of RENAME OBJECT commands to switch from the current table to the new table. This gives you the added benefit of having a historic view of your table - which you can drop at your leisure.