Best practice Azure Datawarehouse for less data-movements

azure-sql-data-warehouse

I have a Table with a HASH on a Employee_Key, it also Hold a Date_From and Date_To.

To build the facts I need to change it on a by_date column.

But what is the best practice in Azure DataWarehousing with the less DataMoving as possible.

I've tried with a INNER JOIN to a Dates table that was in ROUND_ROBIN, or tried to put all the dates in a View (with a trick that only sys.objects is called), .. All working but … not sure what is the better way on performance due to the Data-Movements between the 60 nodes …

Any suggestions ?

Here some parts of the tables …

CREATE TABLE [edw_hr_temp].[DIM_EMPLOYEE]
(
    [EMP_UID] [BIGINT] NULL,
    [EMP_BK] [CHAR](8) NULL,
    [EMP_DATE_FROM] [DATE] NULL,
    [EMP_DATE_TO] [DATE] NULL,
)
WITH
(
    DISTRIBUTION = HASH ( [EMP_BK] ),
    CLUSTERED COLUMNSTORE INDEX
)

1st Option a Real Table for Dates

CREATE TABLE [bi_msft].[ALL_DATES]
(
    [ADATE] [DATE] NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
)

2nd option is in the Stored Procedure user this as a subquery of store it in a #tempTable

  SELECT d
    FROM (SELECT d = DATEADD(DAY, rn - 1, @FROM_DATE)
            FROM (SELECT TOP (DATEDIFF(DAY, @FROM_DATE, @TO_DATE)) 
                         rn = ROW_NUMBER() OVER (ORDER BY s1.[object_id])
                    FROM sys.all_objects AS s1
                         CROSS JOIN sys.all_objects AS s2
                   ORDER BY s1.[object_id]
                 ) AS x
         ) AS y;

So basically my question is, when using the 2nd variant, would this
query run on each node and so no data movement needed of will it run
on one node and will all be moved….

Best Answer

For Azure SQL Data Warehouse, unless you have have millions of records, then it's unlikely you will really benefit from clustered columnstore indexes. In fact this could even be detrimental to performance. For smaller volume tables, you would probably be better off with either heap (for fast load) or clustered index. ROUND_ROBIN will at least ensure even distribution. If you do plan to hash distribute your table, ensure your chosen hash column will give even distribution, ie has a large number of distinct values. What is EMP_BK for example?

I would generally recommend having a table to store dates, rather than some trick construct. The table is likely to scale better. As the datatype is date, it again seems unlikely there will be enough rows in this table to justify clustered columnstore. Try this with clustered index instead.

Have a read through the 'Hash distribute large tables' section in this article:

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices

Just as an observation, all columns in your tables are nullable, which would be bad for a dimensional model (you cannot build cubes on NULLs) and generally does not represent good data modelling.

If you have a genuine performance problem, consider posting an EXPLAIN plan and give an idea of the number of rows in your table and the type of data. As mentioned, for your larger tables, to avoid data movement, hash distribute your tables on the same column.