Sql-server – Potential corruption in sys.sysdercv, latch errors from ghost cleanup process

ghost-cleanupsql serversql-server-2016

We're running SQL Enterprise 2016 (13.0.4411). I have a database that was using service broker to handle some workflow, long story short, it was designed incorrectly and none of the conversations were closed correctly. The sys.sysdercv system table blew up to about 270,000,000 rows and 140GB in size.

I created an agent job that grabbed the top 50,000 conversation handles from sys.conversation_endpoints in DI or DO state, and closed them with cleanup. This agent job ran for a couple of weeks, and finally got the majority of the table cleaned out. However, now it appears that there may be a corrupted page in sys.sysdercv.

The ghost cleanup process (spid 596 in this case) has been chugging along for about 37 hours now. Checking it with sp_whoisactive shows that it's trying to get a lock on sys.sysdercv. I'm also seeing the following error in the SQL log:

A time-out occurred while waiting for buffer latch — type 3, bp 000000FA1CA98B80, page 1:21335899, stat 0x10b, database id: 13, allocation unit Id: 281474981494784/4784128, task 0x000000AF954AACA8 : 0, waittime 300 seconds, flags 0x100000001a, owning task 0x000001B98D8D04E8. Not continuing to wait.

The allocation unit ID corresponds to the clustered index on sys.sysdercv. The owning task address corresponds to the ghost cleanup process. I tried to run the following:

DBCC PAGE (13, 1, 21335899, 0) WITH TABLERESULTS;

This just hangs and doesn't return anything (sp_whoisactive shows that it's blocked by the ghost cleanup process). I also tried to run this:

DBCC CHECKTABLE ('sys.sysdercv') WITH TABLOCK, ALL_ERRORMSGS;

Same deal, it hangs, then eventually times out with a latch error complaining about the same page.

At this point, I'm unsure of what to do. There is no data I currently care about in sys.sysdercv or sys.sysdesend (no messages queued at the moment), so I was hoping I could just truncate these tables, which would (ideally) get rid of the offending page. I can't truncate system tables though. I can't kill the ghost process in order to do DBCC CHECKTABLE.

I thought about stopping the SQL service, enabling trace flag 661 on startup to prevent the ghost cleanup process from running, and then doing the table check. However, I'm also seeing this error message:

One or more recovery units belonging to database 'xxx' failed to generate a checkpoint. This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

So I don't know if the database will even come back online if I restart the SQL service. My differential backups have been failing due to this issue, and DBCC OPENTRAN shows that the ghost cleanup is preventing the log space from being reused.

So…all that said, is there anything at all I can do to fix this problem? Is there some magical trace flag that would allow me to truncate system tables? Can I safely restart SQL without the ghost cleanup process in order to run DBCC CHECKTABLE (as in, will my database come back online since it hasn't been able to checkpoint in the past day and a half)?

Any help would be greatly appreciated.

Best Answer

UPDATE: The problem has been resolved. Took a lot of screwing around, but we were able to take the instance down, bring it back up, put the database in emergency mode, and ran a DBCC CHECKTABLE with REPAIR_ALLOW_DATA_LOSS (we were able to fix it in a test restore with REPAIR_REBUILD, but emergency mode won't let you use that option, apparently). That fixed the corruption in sys.sysdercv, and everything is back to normal. Still not sure what ended up causing it, but backups are working again, DBCC CHECKDB has been coming back clean. Chalking it up to "stuff happens", I guess.