Sql-server – Checkdb issue – Consistency error on two tables at critical database

clusteringdbcc-checkdbsql serversql-server-2012

There was a network activity last night and they were upgrading server switch. Whole network was down and we as a DBA had prepared by disabling all the jobs at DB server for replication and backup however during activity one of the WSFC(Windows server fail-over cluster) initiated fail-over and seems it didn't succeed completely. This resulted in two nodes up and running with database and all drives on both the servers, whereas drives and SQL services was supposed to be on only one of them.

Above resulted in many database corruption and I had very tough time in trying to clear the corruption. Started with two user database and later on tempdb and msdb also corrupted. Had to restart the service for tempdb however for msdb restored from last successful backup and all seemed back to running business.

Afterwards, executed dbcc checkdb on all databases – system and user database. system database didn't have any issue however on one of the user database(critical) is having below error:

Command: DBCC CHECKDB ([User_DB_Critical]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY, MAXDOP = 2
Msg 8914, Level 16, State 1, Server DB_Cluster_Name, Line 1
Incorrect PFS free space information for page (1:1439286) in object ID 526624919, index ID 0, partition ID 72057594055753728, alloc unit ID 72057594056933376 (type In-row data). Expected value  95_PCT_FULL, actual value  80_PCT_FULL.
Msg 8951, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: table 'Job_Execution_Log_Table' (ID 526624919). Data row does not have a matching index row in the index 'PK289' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Server DB_Cluster_Name, Line 1
Data row (1:2224:6) identified by (HEAP RID = (1:2224:6)) with index values 'JOB_NAME = 'populate_Tran_details' and START_TIME = '2019-07-03 03:42:00.323' and HEAP RID = (1:2224:6)'.
Msg 8951, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: table 'Job_Execution_Log_Table' (ID 526624919). Data row does not have a matching index row in the index 'PK289' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Server DB_Cluster_Name, Line 1
Data row (1:1395530:49) identified by (HEAP RID = (1:1395530:49)) with index values 'JOB_NAME = 'populate_Tran_details' and START_TIME = '2019-07-03 03:41:13.480' and HEAP RID = (1:1395530:49)'.
Msg 8951, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: table 'Job_Execution_Log_Table' (ID 526624919). Data row does not have a matching index row in the index 'PK289' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Server DB_Cluster_Name, Line 1
Data row (1:1439286:43) identified by (HEAP RID = (1:1439286:43)) with index values 'JOB_NAME = 'populate_Tran_details' and START_TIME = '2019-07-03 03:45:00.890' and HEAP RID = (1:1439286:43)'.
Msg 8951, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: table 'Job_Execution_Log_Table' (ID 526624919). Data row does not have a matching index row in the index 'PK289' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Server DB_Cluster_Name, Line 1
Data row (1:1439286:44) identified by (HEAP RID = (1:1439286:44)) with index values 'JOB_NAME = 'populate_Tran_details' and START_TIME = '2019-07-03 03:48:00.473' and HEAP RID = (1:1439286:44)'.
Msg 8935, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: Object ID 1374679995, index ID 1, partition ID 72057594120962048, alloc unit ID 72057596467675136 (type In-row data). The previous link (1:1685287) on page (1:491016) does not match the previous page (1:1445099) that the parent (1:232830), slot 129 expects for this page.
Msg 8937, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: Object ID 1374679995, index ID 1, partition ID 72057594120962048, alloc unit ID 72057596467675136 (type In-row data). B-tree page (1:491016) has two parent nodes (0:1), slot 0 and (1:1591622), slot 138.
Msg 8977, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: Object ID 1374679995, index ID 17, partition ID 72057594121093120, alloc unit ID 72057596467806208 (type In-row data). Parent node for page (1:692096) was not encountered.
Msg 8979, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: Object ID 1374679995, index ID 17, partition ID 72057594121093120, alloc unit ID 72057596467806208 (type In-row data). Page (1:692097) is missing references from parent (unknown) and previous (page (1:1548068)) nodes. Possible bad root entry in system catalog.
Msg 8978, Level 16, State 1, Server DB_Cluster_Name, Line 1
Table error: Object ID 1374679995, index ID 1, partition ID 72057594120962048, alloc unit ID 72057596467675136 (type In-row data). Page (1:1623651) is missing a reference from previous page (1:491016). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 5 consistency errors in table 'Job_Execution_Log_Table' (object ID 526624919).
CHECKDB found 0 allocation errors and 5 consistency errors in table 'Tran_details_Table' (object ID 1374679995).
CHECKDB found 0 allocation errors and 10 consistency errors in database 'User_DB_Critical'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (User_DB_Critical).

Size of the table:

Sizes

I suggested my manager to go with below approach:

  1. Try to find the row which was inserted at that time and if possible, delete them from above two tables.
  2. If step 1 is not possible, then rebuild all the indexes on the table. Rebuild requires exclusive access to table.
  3. If rebuild doesn’t work – we will need to drop and re-create the index. This requires exclusive access to table.
  4. If step 3 doesn’t work, we will have to go for repair rebuild option. This option requires whole database to be in single user mode – meaning nobody should access to database while this operation is ongoing.
  5. If step 4 doesn’t work – we will need to go for repair_allow_data_loss option, which is time consuming and has potential to lose data which is having consistency issue. This again requires database to be in single user mode and nobody should access the database.

I have database full back-up just prior to the activity however activity was planned on 3rd July morning and due to issue of all database, it became morning 6:30AM by the time we got all database corruption free and business started running as usual. For msdb and one user database – I used previous backup only for restoration. I ran checkdb after business hour on 3rd July meaning database contains all data of whole day. So, we will lose all data of 3rd July daytime, if we go with restoration of 3rd July backup prior to activity, which is not acceptable to business.

Adding a bit more details on backup – Currently I am using ola hallengren script for taking backup and backup ran successfully last night. Below are the parameters I am using for taking backup:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d DBA_Maintenance -Q "EXECUTE [dbo].[DatabaseBackup] @Databases = 'USER_DATABASES, -One_Heavy_Database', @Directory = N'DB_Backup_Path', @BackupType = 'FULL', @Verify = 'Y', @CleanupTime = 24, @CheckSum = 'Y', @Compress = 'Y',  @LogToTable = 'Y'" -b

I am using verify and checksum flag to check the backup. Differential backup is scheduled every 2 hours and log backup is being run every 15 minutes(Log shipping is configured however stopped for now) and so far none of the backup has failed or reported any issue.

On the heavy table, 3 consistency error is on clustered index and 2 on non-clustered index. For the first table i.e. Job_Execution_Log_Table has all inconsistency on non-clustered index.

I need advice on how to go about it and what should be the most effective and minimum time consuming effort to get this consistency issue fixed.

Currently I am going though Paul Randal's link and trying to see if that would be the best bet.

EDIT: I restored the backup from primary to secondary server and ran checkdb and found out same consistency error what was reported on primary. Dropped and recreated non-clustered index, 4 consistency error is gone by that and only one is remaining:

Incorrect PFS free space information for page (1:1439286) in object ID 526624919, index ID 0, partition ID 72057594055753728, alloc unit ID 72057594056933376 (type In-row data). Expected value  95_PCT_FULL, actual value  80_PCT_FULL.

Didn't touch the big table yet as it is having issue in clustered index. And don't know how to fix this PFS issue.

Appreciate your advice.

Version: Microsoft SQL Server 2014 (SP3) (KB4022619) – 12.0.6024.0
(X64) Sep 7 2018 01:37:51 Copyright (c) Microsoft Corporation
Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.3
(Build 9600: ) (Hypervisor)

Best Answer

This is not a direct answer but some suggestion

I need advice on how to go about it and what should be the most effective and minimum time consuming effort to get this consistency issue fixed.

You did not talked about restoring from backup why ?. Do you not have clean backup. Did you not took complete user and system database backup before the activity, if not, this was a blunder.

If you have backups start restoring from the backup on the server (with different name) and in parallel see if you succeed with step 1/2/3. If you fail with steps at the end of the day you would have your database ready and can point application to this just to avoid the "the management noise".

The checkdb suggested repair_allow_data_loss as minimum repair and I would rarely run this on production database knowing it may delete anything it likes, removing business constraints giving me database which is basically useless as far as business rules go. So if you do not have backup and all the above steps fail then only use repair and if you are suing it God be with you.

EIDT: (from the chat)

In this particular case, only checkdb has reported this consistency error, otherwise there is no ongoing issue. I hope, its clear now.

Thanks for explaining so database is currently functioning but checkdb reported issue. You have to inform the management that there is corruption and sooner or later they would start facing exception. The reason I think they have still not encountered this is because the corrupted page is still not read into memory.

What you should do

  1. Inform stakeholders that there is corruption, show them the message.

  2. Start restoring from whatever backup you have and you might as well need to put application in read only. See how much difference is there

  3. Could you also try to take backup of current corrupt database use continue_after_error if it fails. It is highly likely that backup would fail if there is corruption. If you succeed restore it with continue_after_error and run repair_allow_data_loss to see how much data you would loose.

  4. So you have identified 2 tables are causing problem. Try moving data out from these 2 tables to some other table see how much data you can move ( create new table like old_table_temp). What I am saying is if you are able to move most of data from these 2 tables and if you are sure this is what affected by corruption then drop these tables and recreate it from data you just moved to other table.

  5. After you have dropped and recreated the table and populated it with fresh data run checkdb again, see if it comes clean.

  6. From restored backup see if you can get the deleted data.

  7. Dropping and recreating index would most likely fix issue if the index is non clustered for clustered index it will not fix the issue.

EDIT:

Differential backup is scheduled every 2 hours and log backup is being run every 15 minutes(Log shipping is configured however stopped for now) and so far none of the backup has failed or reported any issue.

Wow you have log shipping and hope database is in read-only/standby mode, if so that is very good, immediately stop all LS jobs. If it is in restoring mode and you have enterprise edition create a snapshot and run checkdb on it. Take application downtime go ahead and run checkdb on secondary database if it comes clean bring it online and point application to this DB you might get away with this and get zero data loss.

On the heavy table, 3 consistency error is on clustered index and 2 on non-clustered index. For the first table i.e. Job_Execution_Log_Table has all inconsistency on non-clustered index.

You may try with NCI and see if that solves anything but I would rely on secondary database and hope corruption is not propagated their.

I know it is lot of work but this is what I believe would give you least possible data loss.

Good Luck