Sql-server – How to handle delayed records a month or older on a table clustered on time

data-warehousesql server

I have a two part question really for anyone working with data warehousing where they have large fact tables.

Question 1

Lets say you have a table that has 500 million or more records in it that is clustered on time. You are only posting incremental records for the last 24 hours on a daily basis to this table.

How do you handle inserting delayed records that are a month or older to that table? Would you do nothing unless the insert caused a lot of fragmentation or would you attempt to drop the indexes, insert and rebuild?

I do not have enterprise edition available to me for table partitioning.

Question 2

If you have a fact table that is growing large like the above example, would it be wise to split the fact table up into multiple tables or would it be better to look towards adding additional files to the filegroup of said table if table partitioning is not an option?

Thanks in advance. I'm dealing with some large growth and trying to approach it the right way.

Best Answer

For question 2.

I think I will break the large table up into many different tables and then use a view to join them together. This will allow me to have something like table partitioning without the feature of table partitioning as outlined in this article.

https://www.simple-talk.com/sql/sql-tools/sql-server-partitioning-without-enterprise-edition/

View with schemabinding that unions all the tables together. Then I can insert data and select data from this view as if it was the primary fact table. I would only need to ensure all my SELECT queries on this view include the column I chose to partition the tables with to get the full benefit.

Related Solutions

SQL Server – Running Parallel Delete Statements

Let me try to summarise the answers as per the commenters above.

First of all, to achieve higher deletion speeds you can do three things:

Delete a higher number of rows at a time
Use partitioning
Run multiple deletes

Ad 1) Since deleting a large number of rows at a time hit the transaction log hard and causes locking - you likely want the number of deletes in a statement to be "small". I find that around 10K rows is a good number (and having table lock escalation turned off with ALTER TABLE Foo SET LOCK_ESCALATION = Disabled

Ad 2) If your table is not already partitioned, this is not an option

Ad 3) DELETE in SQL Server is single threaded, so you need to run more than one at a time to get max speed. To run multiple deletes concurrently, you need a way to partition them so each parallel delete runs on its own set of keys and doesn't block with the others. Typically, you can use the primary key of the table to partition the delete.

For example, if you have an IDENTITY column called key on the table, you can first:

SELECT MAX(key) - MIN(key), MIN(key) FROM Foo WHERE <rows that must be deleted>

Keep the Max - Min and Min somewhere (a table in tempdb) that you can quickly read from into variables @IntervalSize, @MinKey respectively.

Let us say you decide that you can allocate 4 cores for running deletion. You now run 4 queries, either from 4 x SSMS new query or four command prompt via SQLCMD, each doing this:

DECLARE @NumDone INT = 1
WHILE @NumDone > 0 BEGIN
  SET ROWCOUNT 10000

  DELETE FROM Foo WHERE Key BETWEEN @MinKey + @IntervalSize / 4 * @n 
     AND @MinKey + @IntervalSize / 4 (@n + 1)
  SET @NumDone = @@ROWCOUNT
END

Pick @n = 0,1,2,3 for each of the concurrently running DELETES.

Sql-server – Clustered columnstore indexes and foreign keys

You've got lots of questions in here:

Q: (The lack of foreign keys) confuses me a lot! It is a good practice (not mandatory) to have Fk's in the DWH for a variety of reasons (data integrity, relations visible for semantic layer, ....)

A: Correct, it's normally a good practice to have foreign keys in a data warehouse. However, clustered columnstore indexes don't support that yet.

Q: So MS advocates Clustered Column store indexes for DWH scenarios, However it can not handle FK relationships?!

A: Microsoft gives you tools. It's up to you how you use those tools.

If your biggest challenge is a lack of data integrity in your data warehouse, then the tool you want is conventional tables with foreign keys.

If your biggest challenge is query performance, and you're willing to check your own data integrity as part of the loading process, then the tool you want is clustered columnstore indexes.

Q: However SQL 2014 than adds no real new value for DWH??

A: Thankfully, clustered columnstore wasn't the only new feature in SQL Server 2014. For example, check out the new cardinality estimator.

Q: Why am I so angry and bitter about the way my favorite feature was implemented?

A: You caught me - you didn't really ask that question - but I'll answer it anyway. Welcome to the world of third party software where not everything is built according to your exact specifications. If you feel passionately about a change you'd like to see in a Microsoft product, check out Connect.Microsoft.com. It's their feedback process where you can submit a change, other people can vote it up, and then the product team reads it and tells you why they won't implement it. Sometimes. Most of the time they just mark it as "won't fix, works on my machine" but hey, sometimes you do get some answers.

Best Answer

Related Solutions

SQL Server – Running Parallel Delete Statements

Sql-server – Clustered columnstore indexes and foreign keys

Related Question