SQL Server – Is Columnstore Good for UPDATES with WHERE Clauses

columnstoredatabase-designperformancesql servertable

I am developing a database project on SQL Server and I am thinking whether using columnstore index is a good idea.

The project consists of a table (A) that will hold a large number of rows, with many repeated values for a column. Every day, a pack of new rows will be added to the table, with a "DateId" for each pack.

After that, I will need to update a different table (B) joining with A and filtering A for the "DateId" and other columns.

Example in SQL:

CREATE TABLE A (
  [Id] [BIGINT] IDENTITY(1,1) NOT NULL,
  [DateId] [INT] NOT NULL,
  [B_Id] [BIGINT] NOT NULL,
  -- other columns...
  INDEX cci_A CLUSTERED COLUMNSTORE
)

CREATE TABLE B (
  [Id] [BIGINT] IDENTITY(1,1) NOT NULL,
  -- other columns...
  INDEX cci_B CLUSTERED COLUMNSTORE
)

UPDATE B
SET ...
FROM A
INNER JOIN B ON A.B_Id = B.Id
WHERE A.DateId = @myDateId

Is columnstore a good choice in this case?

Best Answer

Modifying a row will cause the old row to be flagged as "deleted" (but its still in the column store index) and the new row to be added to the deltastore (row-based storage which will be compressed when it reaches about 1 million rows). So, as you can imagine, many updates will to some extent degrade your columnstore index over time. You can of course do index maintenance, but a columnstore index on B might not be the best choice...

Related Solutions

Sql-server – Clustered columnstore indexes and foreign keys

You've got lots of questions in here:

Q: (The lack of foreign keys) confuses me a lot! It is a good practice (not mandatory) to have Fk's in the DWH for a variety of reasons (data integrity, relations visible for semantic layer, ....)

A: Correct, it's normally a good practice to have foreign keys in a data warehouse. However, clustered columnstore indexes don't support that yet.

Q: So MS advocates Clustered Column store indexes for DWH scenarios, However it can not handle FK relationships?!

A: Microsoft gives you tools. It's up to you how you use those tools.

If your biggest challenge is a lack of data integrity in your data warehouse, then the tool you want is conventional tables with foreign keys.

If your biggest challenge is query performance, and you're willing to check your own data integrity as part of the loading process, then the tool you want is clustered columnstore indexes.

Q: However SQL 2014 than adds no real new value for DWH??

A: Thankfully, clustered columnstore wasn't the only new feature in SQL Server 2014. For example, check out the new cardinality estimator.

Q: Why am I so angry and bitter about the way my favorite feature was implemented?

A: You caught me - you didn't really ask that question - but I'll answer it anyway. Welcome to the world of third party software where not everything is built according to your exact specifications. If you feel passionately about a change you'd like to see in a Microsoft product, check out Connect.Microsoft.com. It's their feedback process where you can submit a change, other people can vote it up, and then the product team reads it and tells you why they won't implement it. Sometimes. Most of the time they just mark it as "won't fix, works on my machine" but hey, sometimes you do get some answers.

Sql-server – Should the index on an identity column be nonclustered

By default the PK is clustered and in most cases, this is fine. However, which question should be asked:

should my PK be clustered?
which column(s) will be the best key for my clustered index?

PK and Clustered index are 2 differences things:

PK is a constraint. PK is used to uniquely identify rows, but there is no notion of storage. However by default (in SSMS), it is enforced by a unique clustered index if a clustered index is not yet present.
Clustered indexes is a special type of index which store row data at the leaf level, meaning it is always covering. All columns whether they are part of the key or not, are stored at the leaf level. It does not have to be unique, in which case a uniquifier (4 bytes) is added to the clustered key.

Now we end up with 2 questions:

How do I want to uniquely identify rows in my table (PK)
How do I want to store it at the leaf level of an index (Clustered Index)

It depends on how:

you design your data model
you query your data and you write your queries
you insert or update your data
...

First, do you need a clustered index? If you bulk insert, it is more efficient to store unordered data to a HEAP (versus ordered data in a cluster). It uses RID (Row Identifier, 8 bytes) to uniquely identify rows and store it on pages.

The clustered index should not be a random value. The data at the leaf level will be stored and ordered by the index key. Therefore it should grow continuously in order to avoid fragmentation or page split. If this can not be achieved by the PK, you should consider another key as a clustered candidate. Clustered index on identy columns, sequential GUID or even something like the insertion's date is fine from a sequential point of view since all rows will be added to the last leaf page. On the other hand, while unique identifier may be useful to your business needs as a PK, they should not be clustered (they are randomly ordered/generated).

If after some data and query analysis, you find out that you mostly use the same index to get your data before doing a key lookup in the clustered PK, you may consider it as clustered index although it may not uniquely identify your data.

The clustered index key is composed of all the columns you want to index. A uniquefier column (4 bytes) is added if there is no unique constraint on it (incremental value for duplicates, null otherwise). This index key will then be stored once for each row at the leaf level of all your nonclustered indexes. Some of them will also be stored several times at intermediate levels (branch) between the root and the leaf level of the index tree (B-tree). If the key is too large, all the non clustered index will get larger, will require more storage and more IO, CPU, memory, ... If you have a PK on name+birthdate+country, it is very likely that this key is not a good candidate. It is too large for a clustered index. Uniqueidentifier using NEWSEQUENTIALID() is usually not considered as a narrow key (16 bytes) although it is sequential.

Then once you figured out how to uniquely identify rows in your table, you can add a PK. If you think you won't use it in your query, don't create it clustered. You can still create another nonclustered index if you sometime need to query it. Note that the PK will automaticaly create a unique index.

The non clustered indexes will always contain the clustered key. However, if the indexed columns (+key columns) are covering, there won't be any key lookup in the clustered index. Don't forget you can also add Include and Where to a non clustered index. (use it wisely)

Clustered index should be unique and as narrow as possible Clustered index should not change over time and should inserted incrementally.

It is now time to write some SQL which will create the table, clustered and nonclustered indexes and constraints.

This is all theoritical because we don't know your data model and datatypes used (A and B).

Best Answer

Related Solutions

Sql-server – Clustered columnstore indexes and foreign keys

Sql-server – Should the index on an identity column be nonclustered

Related Question