Sql-server – SQL Server 2008 R2 returning wrong record

sql-server-2008-r2

I have a SQL Server 2008 R2 hosting a database with which my software works through the ODBC API. The database has a table, called Users, which has a number of columns, but there are 3 columns (3 first ones in the Table Designer, if it matters) instrumental to understanding the problem:

  1. UserID – INT, PK, identity, incremented by 1, starts from 10
  2. NetworkUserID – BIGINT
  3. NetworkID – INT

Fields 2 and 3 are indexed.

There is a stored procedure, which receives NetworkUserID and NetworkID through the input parameters and returns UserID through an out parameter, here is a semi-pseudocode of the procedure:

PROCEDURE [dbo].[Procedure]
@NetworkID INT,
@NetworkUserID BIGINT,

@UserID INT OUT
AS
BEGIN
SET @UserID = 0

-- check whether the user exists
IF NOT EXISTS
(
    SELECT [UserID]
    FROM [Users]
    WHERE [NetworkID] = @NetworkID AND [NetworkUserID] = @NetworkUserID
)
BEGIN

    -- user does not exist, create a new record for him
    INSERT INTO [Users] 
        ([NetworkUserID], [NetworkID], [some_other_fields])
    VALUES
        (@NetworkUserID, @NetworkID, @some_other_values)

END
ELSE
BEGIN

    -- user has already registered, update his profile
    UPDATE [Users] SET [some_irrelevant_fields_are_being_updated_here, UserID_is_not_touched] WHERE [NetworkID] = @NetworkID AND [NetworkUserID] = @NetworkUserID

END

-- get the internal user ID
SELECT @UserID = [UserID] FROM [Users] WHERE [NetworkID] = @NetworkID AND [NetworkUserID] = @NetworkUserID

RETURN
END

Here is the problem – when the physical server running this SQL server starts to get low on memory, the SQL server starts to slam the disk due to massive swapping.

This results in overall, pretty significant, slowing down of query execution and here the problem pops up:

occasionally the SQL server returns the wrong UserID for the specified pair of NetworkID and NetworkUserID. This happens very randomly, so I cannot track it down. But it only happens when the SQL server slows down due to swapping.

I have instructed the SQL server not to consume more than 3/4 of available RAM, but it doesn't help, because the remaining memory is eventually eaten up by a leaky system service, which eventually leads to swapping anyway. From time to time I see the following messages in the logs of my software produced by the ODBC API:

SQLFetch: state = 40001, native error = 1205, message = [Microsoft][SQL Server Native Client 10.0][SQL Server]Transaction (Process ID 63) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

SQLDriverConnect: state = 08001, native error = 258, message = [Microsoft][SQL Server Native Client 10.0]Shared Memory Provider: Timeout error [258].

Apparently, these messages have something to do with the described above error.

Did anyone experience anything like that?

How can the SQL server return a totally wrong record by a uniqie pair of values NetworkID + NetworkUserID?

Best Answer

First, make sure your SQL Server is fully patched and up to date. For 2008 R2, at this time I'd say the hard minimum is build 10.50.6220 (SP3 + MS15-058), and you really should be at 10.50.6537 (SP3 + MS15-058 + TLS 1.2 capability).

Second, make sure you have a unique index on the combination of NetworkUserID, NetworkID (with the highest cardinality one of those two first).

Third, patch your host OS and move that leaky service to another machine.
Or, if you absolutely cannot, then reboot the server often enough that it doesn't run out of memory before the next reboot. Or buy more RAM for it.

My expectation is that EITHER the patching will fix it, OR putting the unique index on will fail because you've got duplicates in the table already.

The other thing that might be an issue is that you have an actual hardware fault, and may have corrupt data - this is actually very rare.

  • make sure you have SQL Alerts set up and running; Glen Berry has a script on sqlskills.com to do this.

  • run DBCC CHECKDB with DATA_PURITY first, on ALL databases on the instance.

  • make sure you have CHECKSUM checking on every database.

  • rebuild all your indexes (ALTER INDEX ALL on REBUILD).

    • just in case CHECKSUM was added late, and some of the pages don't have it yet because they've never been touched.

    • and make sure every table has a clustered index, while you're at it.

  • check for existing records of disk errors (event log, hardware logs, etc.)

  • run chkdsk in Windows on all your drives.

  • if you're on RAID, try a consistency check and/or patrol read or other validation.

  • run underlying hardware diagnostics.