Sql-server – Clustered Index with nightly ETL process

clustered-indexetlsql server

We have a nightly ETL process that involves data from many different sources being flattened, and loaded into a completely denormalized table on SQL Server 2008 for reporting. The clustered index on the table consists of a composite key of four columns.

Due to where the data is coming from, there really isn't any way to load the data in the same order as the clustered index. Keeping the index active during load or dropping/rebuilding it after the load both take more time than the load itself without an index. On the other hand no clustered with proper non-clustered indexes can still result in some reporting queries taking hours to process.

The only things I can think of to speed this up would be partitioning and throwing more hardware at the problem. Am I missing anything else that would help?

Best Answer

There are few ways to work around this problem:

As you mention, partitioning will help. If you partition on the leading column in the index and make sure all partitions are equally sized, you can gain super scale. However, the total CPU required to build the index wont be much lower if you partition. But it will save you memory (if you have a slow I/O system, this can be useful)

If this is a traditional data warehouse fact table, you might be better off with a columnstore index

Instead of doing a mega update - which is slow. Consider this pattern:

Load the changes you want to make into a staging table with the same schema and cluster index as the main table
Join the staging table with the old table and use INSERT INTO a new table with a cluster index. Enabled Trace Flag 610 while doing this.

This way, you can keep the benefit of the old table being sorted and avoid resorting all the rows (only the ones that have changed).

Related Solutions

Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

Getting multiValued data into Solr via CSV:

The solr documentation describes a "split" function in UpdateCSV. Essentially, it parses a field value using a second CSV parser. See Solr - UpdateCSV - split. The parameters look like so (adjust field name, separator, and encapsulator as necessary):

f.fieldA.split=true&f.fieldA.separator=%2C&f.fieldA.encapsulator='

Getting multiValued data from separate fields to maintain position:

Since asking this question, I've done some reading about dimensional models. It seems that what I was trying to do is poor design, because it places too many expectations on the application, to much complexity in the warehouse, or both.

When trying to preserve the relationships between two field values on a single record, it's better to store them separately as well as together. Here's a comparison of my former input to the new input:

Former CSV input:

name|licenseState|licenseType
Josh|MA,CA|123,456
Fred|MD,OH|789,123

Transformed CSV input:

name|licenseState|licenseType|licenseStateType
Josh|MA,CA|123,456|MA123,CA456
Fred|MD,OH|789,123|MD789,OH123

This way your application can use the licenseState and licenseType dimension values independently, or it can use the licenseStateType dimension values, all without requiring complicated app or warehouse logic.

Sql-server – Does Clustered Index on GUID create more fragmentation than Non Clustered Index

Since you are indicating insert performance is the primary concern, I'd take the recommendation of the DBA and make the clustering key the identity column since it is a unique, monotonically ascending number, which is guaranteed to (almost) never cause page-splits on the table.

Also, don't store the GUID in an nvarchar(100) column, use the data type designed for them, which is uniqueidentifier.

Look carefully at the reporting requirements prior to adding indexes to this table. You may find out you don't need an index on the GUID column at all. If you determine that having an index on the GUID column is required, you may find it desirable to have the leading column of the index not be the GUID, perhaps it will be the Group or TestID column instead?

If you've determined, through a rigorous design process, that the majority of reporting queries against this table will be made using the session id as the primary component of joins or where clauses, you may actually benefit from the table being clustered on the session id. You'd likely want to set the page fill factor at some point below 100%, perhaps start at 93% and monitor fragmentation levels. Also, it's worth pointing out that fragmentation may not be such a massive problem if you store this table on SSD. Will the session id be propagated throughout the other related tables? If so, you'll likely need non-clustered indexes on the column in those other tables as well, compounding the fragmentation issue. It may be worthwhile maintaining a single table where you insert the GUIDs to obtain a surrogate integer key to use in the other tables, something like:

IF OBJECT_ID('dbo.SessionSurrogate') IS NOT NULL
DROP TABLE dbo.SessionSurrogate;
CREATE TABLE dbo.SessionSurrogate
(
    SessionID int NOT NULL
        CONSTRAINT PK_SessionSurrogate
        PRIMARY KEY NONCLUSTERED
        IDENTITY(1,1)
    , SessionGUID uniqueidentifier NOT NULL
);
CREATE UNIQUE INDEX CI_SessionSurrogate --could be non-unique if required
ON dbo.SessionSurrogate(SessionGUID);

IF OBJECT_ID('dbo.UserTestGroups') IS NOT NULL
DROP TABLE dbo.UserTestGroups;
CREATE TABLE dbo.UserTestGroups
(
    UserTestGroupsID int NOT NULL
        CONSTRAINT PK_UserTestGroups
        PRIMARY KEY CLUSTERED
        IDENTITY(1,1)
    , SessionID int NOT NULL
    , TestID bigint NOT NULL
    , [Group] tinyint NOT NULL
    , InsertDate datetime NOT NULL
);

GO
CREATE PROCEDURE dbo.InsertSession
(
    @SessionGUID uniqueidentifier
    , @TestID bigint
    , @Group tinyint
)
AS
BEGIN
    INSERT INTO dbo.SessionSurrogate (SessionGUID)
    OUTPUT INSERTED.SessionID, @TestID, @Group, GETDATE()
    INTO dbo.UserTestGroups (SessionID, TestID, [Group], InsertDate)
    OUTPUT INSERTED.SessionID
    VALUES (@SessionGUID);
END;
GO

You'd then insert data like:

DECLARE @SessGUID uniqueidentifier = NEWID();
DECLARE @TestID bigint = 42;
DECLARE @Group tinyint = 6;
EXEC dbo.InsertSession @SessGUID, @TestID, @Group;

The benefit here is you're only causing fragmentation on a single table with a very narrow row width (20 bytes), offering somewhere around 400 rows per page. The stored procedure above is very compact and outputs the generated int for you to use later, if necessary.

Best Answer

Related Solutions

Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

Sql-server – Does Clustered Index on GUID create more fragmentation than Non Clustered Index

Related Question