SQL Server Monitoring – How to Track Disappearing Rows in Production

sql serversql-server-2008-r2

Scenario:

We write online banking software and for the time being (due to legacy design decisions) are using Quartz 2.2 right now as a queue system to perform a handful of operatoins long term. [Let's sidebar the discussion on if it's the right tool, it's worked well for many years until we updated to 2.2]

Part of that quartz.net has the following tables (related) for schema: https://gist.github.com/jcolebrand/8695603

So the process is we're inserting records in this table, to the tune of say 80k records. (I have three replications of quartz.net serving three different configurations, one has 80k records, one 50k, one 280k, so it varies). We insert them from some other table that is the table of record, so rebuilding this table isn't a terrible loss. The issue is described below.

I have a tool which can bulk schedule these tasks (most of which are one-time and will happen anywhere from two weeks to twenty years in the future, again, this is about the db and not the architecture choices) and I can debug step through, see the rows get inserted into the table with no concern. I can monitor the table and see that they get inserted. I then come back 30 minutes later (well, this part varies. It's not deterministic on when they disappear) and check and the records are now missing.

Out of 80k on the one instance, I'm missing approximately 2700 records. And when I run my tool, I see them all synced up, and after some matter of time, they have evaporated again.

Here's things I've tried:

  • checking the all transactions report in SSMS
  • checking the all blocking transactions report in SSMS
  • leaving my scheduling application open for long duration (in case, by some freak magic, there was an open, uncommitted transaction)
  • restarting the Quartz.net windows service application maintaining the database
  • inserting records with the quartz.net windows service disabled

Things I haven't tried:

  • restarting SQL Server
  • my queries don't use "with(nolock)" [I've heard that's bad juju for day-to-day]

Things I think I should do and don't know how:

  • run SQL Profiler against the instance and monitor for "delete" statements
  • magic to determine if there is an uncommitted transaction
  • sacrifice unicorns
  • telephone a friend
  • daily double

@@version:

Microsoft SQL Server 2008 R2 (SP2) – 10.50.4000.0 (X64)

Jun 28 2012 08:36:30

Copyright (c) Microsoft Corporation

Standard Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1) (Hypervisor)

EDIT 2014-01-30

It is imperative for internal reasons that I can demonstrate precisely why the records are disappearing, including but not limited to:

  • deletion by a user
  • uncommitted transaction
  • deletion by Quartz itself
  • some other unexplained phenomena
  • butterfly effect
  • trigger effect
  • unicorns

Where possible, I need the specific statement involved in the delete.

TL;DR

So, basically, the TL;DR is: records are inserted. For some duration of time up to at least 30 minutes record definitively show in queries against the table. After some duration, records no longer show in queries against the table.

What gives? What am I overlooking? How would you profile this IN PRODUCTION to see what's happening here?

Best Answer

I would debug this using a server-side trace. Essentially, the reason Profiler is problematic in production is because it is a client application. If you attempt to trace to much and SQL Server is trying to manage that communication between the server and the client, it can bog things down. So instead, you can script out the trace as a T-SQL script and run it on the server itself. It will capture the information to a trace file, then you can take that trace file elsewhere and review it.

To script the trace, open Profiler and select your events. Since you are concerned that statements may not be committed/completed, I would recommend

  • RPC: Starting/Completed
  • SP: Starting/Completed
  • SQL: BatchStarting/BatchCompleted

Included relevant columns. Filter as necessary (such as database name, maybe duration)

Then click 'Run', but immediately stop. At this point, go to your toolbar and select: File->Export->Script Trace Definition->For SQL Server... This will ask you to save a script somewhere. Once you've saved it, open it.

Before you run it, specify a file name and path (replace InsertFileNameHere). Then execute to start the trace. The trace will continue to run until you stop it. To stop it, make sure you know the trace ID, which is returned when you run the script. If you can't remember it, look at sys.traces. Once you have the trace ID, you can stop it by using sp_settracestatus to set the status to 0 to stop it. Once it is set to 0, you can set it to 2 to delete it.

After stopping it, you should have a trace file to review and identify what's going on with your processes. The trace file is viewable through the Profiler application.