Sql-server – SQL Server online maintenance – False publicity

dbcc-checkdbindex-maintenancemaintenancesql server

I'm being challenged since we have customers that are seeing different types of errors while running what is supposed to be online maintenance. errors like timeouts, scan stopped due to data movement and more.

Some of our customers purchased SQL server enterprise edition to be able to get the rebuild index online feature.

To replicate the issues, I've been testing on a big database the dbcc checkdb, reorganize and rebuild of indexes and the update of statistics while I bombard the server with transactions on a database that is almost 1TB.

My first test was with the checkdb with maxdop=1 while the sqlServer was processing 124000 small transactions… I received a timeout from my application which has a the timeout set to 5 minutes.
Research on the way the check db works , is that is creates a snapshot, uses tempdb intensively and nolocks are created to create the snapshot…
So how can one of my small transactions be blocked if it doesn't lock tables?

My second test was to reorganize all indexes (which is also supposed to be online) while processing 124000 transactions, This time I got a deadlock…

My third test was to update all statistics with maxdop=1 while running 124000 transactions. The error that was received in this case is: Could not continue scan with no lock due to data movement

my forth test to rebuild all indexes online while running 124000 transactions will be running soon and I will update my post with the results.

The maxdop=1 that I used where it can be used was to eliminate contention of resources.

I've read multiple articles inclusing Paul Randal's explanation of online vs offline of the rebuild of indexes and understand the difference… but as I said, I'm being challenged to explain why it isn't really online although Microsoft claims it's online.

Any input would be much appreciated.

What am I missing

Best Answer

A lot of questions here. I have a feeling the answers won't make you happy, though.

DBCC CheckDB

To replicate the issues, I've been testing on a big database the dbcc checkdb, reorganize and rebuild of indexes and the update of statistics while I bombard the server with transactions on a database that is almost 1TB.

I don't know that I'd lumped DBCC CHECKDB with "online maintenance", but it certainly isn't an offline activity by nature. That being said, it can still be resource intensive which could manifest as an application "not working" although the database is technically still online and available.

I received a timeout from my application which has a the timeout set to 5 minutes...So how can one of my small transactions be blocked if it doesn't lock tables?

That doesn't indicate any blocking, especially since DBCC CHECKDB doesn't cause blocking unless you specify it should. You should take a closer look at monitoring the queries to see what the cause may be for the timeout - it could be a lot of things honestly.

Online Index Builds

My second test was to reorganize all indexes (which is also supposed to be online) while processing 124000 transactions, This time I got a deadlock...

Was the deadlock with the index reorganize statement, or another application statement? I can't recall a reorganize causing this issue ever for me, but deadlocks are a fact of life and the application should be able to gracefully retry when it encounters them. The operation being online in no way guarantees you won't get deadlocks.

Most people schedule index maintenance overnights/weekends to help avoid related issues around locking, blocking, and resource contention. Testing against a normal business hour workload is bound to give you problems like you're seeing.

Update Statistics

My third test was to update all statistics with maxdop=1 while running 124000 transactions. The error that was received in this case is: Could not continue scan with no lock due to data movement

This one's on you (or rather, the application): don't use NOLOCK and you won't get NOLOCK related errors. Again, not sure I'd recommend updating all stats on a busy system during high volume, so this seems like a test destined to have issues as well.

Technical Definitions

I've read multiple articles inclusing Paul Randal's explanation of online vs offline of the rebuild of indexes and understand the difference... but as I said, I'm being challenged to explain why it isn't really online although Microsoft claims it's online.

If you'd read this then you know that online doesn't mean no locks are taken. The definition of the word is highly contextual. You can point whoever you need to to the documentation on online rebuilds:

Although not common, the online index operation can cause a deadlock when it interacts with database updates because of user or application activities. In these rare cases, the SQL Server Database Engine will select the user or application activity as a deadlock victim.

Conclusion

All this is to say that regardless of maintenance operations being able to be "online", most people don't test them during full workloads nor expect them to work well during normal transactional volume.

Most of the time you really won't need to rebuild/reorg all indexes or update all stats at the same time, even during maintenance windows. If you have a VLDB, split out that DBCC CHECKDB into different windows (like Paul has covered so well).

I'd focus more on how/when these things will actually run. Use something like Ola's maintenance solution so you're not making SQL Server do more work than it has to.

And have your customer scale back their expectations!

See also