Sql-server – Dirty read or phantom row

concurrencyisolation-levelsql server

We are having dirty reads or phantom rows in our data warehouse. My question is not how to solve it (it is a long story and I am currently working on it), but I want to know what exactly happens to better understand the behaviour and use the correct terminology. Are these dirty reads or phantom rows?

Let's say we have SourceTableA in the ERP database with a CLUSTERED UNIQUE index on Col1 and Col2. The DWH extraction process that performs the following query:

SELECT Col1, Col2, Col3
FROM ERPdatabase..SourceTableA WITH(NOLOCK);

This is not recommended, because now we can even have dirty reads, but we are forced to do this by the ERP vendor. The vendor knows the implications and uses READ UNCOMMITTED for all their transactions and forces us to use the NOLOCK hint.

Say we have RowA in SourceTableA:

('GreatCompany','WonderfullItem','ColorRed')

Now (I think…) a user session in the ERP system performs an update on RowA while the previous query is running like this:

UPDATE SourceTableA
SET Col3 = 'ColorBlue'
WHERE Col1 = 'GreatCompany'
AND Col2 = 'WonderfullItem'

If I am correct or not, what I see happening is that 2 rows are being read by the extraction process. I am not surprised since we use NOLOCK (although I am unsure if I understand the exact cause, hence this question), but I have 2 questions about this:

Since the row is not moved around in the CLUSTERED index, because the primary key does not change, why is it read a second time? My thinking would be that the extract process is already past that row, so it should not encounter it anymore. Or is that not how it works? The only explanation I can think of is that the extract query is not using the CLUSTERED index to read the data, but maybe a NONCLUSTERED index where the row IS moved around. Or what else am I missing? Or is it just a dirty read? Or is it just 'SQL Server does not guarantee row order unless you use an ORDER BY, DISTINCT or do not use NOLOCK'? Then I am curious why exactly. 🙂
Regarding dirty reads, I always thought the behaviour I see were phantom rows, but, to my understanding, for a phantom row to occur we actually need 2 statements within the same transaction that read the same set of data. That is not the case here since it is only 1 statement, so implicitly I think that this should then be a dirty read? On the other hand, for a dirty read to occur we need an update to be rolled back by another transaction. That is not what I assume is happening. Or is that actually the only explanation?

Best Answer

You are correct in your initial assumptions here. If the row don't move (including due to a page split), then a scan/seek won't read it twice. But, as you say, perhaps the query is done through an NC index for which the row index was moved, hence the double read.
Regarding terminology, there are some standard phenomena defined in ANSI SQL, which are then used in conjunction with isolation level to describe the isolation levels. I.e., isolation level A eliminates phenomena X and Y but not Z. Such phenomena include dirty reads (read something that has been modified but later might be rolled back), phantom (read some data, read it again and a new row occurs in that set). Double read isn't among those phenomena defined in ANSI SQL, perhaps since the standard is too posh to bother about physical implementations (like indexes) and in such a world encountering the same row twice isn't even on the table? I.e., what you describe here is what we in the SQL Server community use to call something like "double read", but don't expect to see the terminology formalized for this phenomena.

(I realize that I probably over/misuse the word phenomena here, my bad in that case!)

Missing committed rows or encountering them multiple times due to an allocation-ordered scan over changing data is specific to using read uncommitted isolation (enabled by the NOLOCK hint here). See also Previously committed rows might be missed if NOLOCK hint is used by Microsoft's Lubor Kollar.

As David Browne suggested you could consider reporting off of a database snapshot - if you can’t use a an isolation level that produces correct results.

One can also consider ultra low-tech solutions. I'm thinking backup and restore, and then report off the restored database. You can combine full backups with log backups. This is more feasible when the database isn't too large, of course.

Related Solutions

Sql-server – SQL Server 2008 R2 Dirty reads – how non-atomic

EDITED after reading the MSDN forum link from the comment, very interesting.

Regardless of isolation level, two users cannot update a single page simultaneously, nor can any user read a partially updated page. Just imagine how SQL Server would deal with a page where the header says Col3 starts at byte 17. But it really starts at byte 25, because that part of the row hasn't been updated yet. There's no way a database could handle that.

But for rows larger than 8k, multiple pages are used, and that makes a half-updated column possible. Copied from the MSDN link (in case the link breaks), start this query in one window:

if object_id('TestTable') is not null
    drop table TestTable
create table TestTable (txt nvarchar(max) not null)
go
insert into TestTable select replicate(convert(varchar(max),
    char(65+abs(checksum(newid()))%26)),100000)
go 10
update TestTable set txt=replicate(convert(varchar(max),
    char(65+abs(checksum(newid()))%26)),100000)
go 100000

This creates a table and then updates it with a string of 100.000x the same character. While the first query is running, start this query in another window:

while 1=1 begin
 if exists (select * from TestTable (nolock) where left(Txt,1) <> right(Txt,1))
    break
end

The second query stops when it reads a column that is half updated. That is, when the first character is different from the last. It will finish quickly, proving that it is possible to read half-updated columns. If you remove the nolock hint, the second query will never finish.

Surprising result! A half-updated XML column might break a (nolock) report, because the XML would be malformed.

Sql-server – SQL Server data extract with limited performance impact

Many ways you can approach this scenario.

I need to do a data extract from the database for reporting purposes.

Backup restore approach is best suited if you want to do this on an infrequent basis and do a static reporting of data.
Database Snapshot is an enterprise version feature only and this also gives you a static point-in-time view of the database.

Note that database snapshots take up space on disk, and having too many can fill up your disk space, especially in a production environment where data is updated frequently. In addition, using database snapshots will degrade performance a bit because I/O is increased on the database due to the fact that the data pages are copied as the write operation is performed.

Refer to Microsoft SQL Server Database Snapshots and Synonyms to over come some limitations that snapshots have and workarounds.

Look into Replication especially T-Rep if you want to do near to real-time reporting and want to offload the reporting work from the primary server. T-Rep will give you the benefits of Transactional consistency, Low latency, High Throughput, Minimal overhead as well as the ability to filter rows.
Lastly if none of the options are feasible to you, then with proper testing look into changing the isolation level of the database to RCSI (Read Committed Snapshot Isolation). This will be at the COST of TEMPDB performance.
```
-- check if RCSI is enabled
SELECT name, is_read_committed_snapshot_on
FROM sys.databases
WHERE name = '<dbname>'
-- Enable RCSI
ALTER DATABASE <dbname> SET READ_COMMITTED_SNAPSHOT ON
```

From The Data Loading Performance Guide :

RCSI basically will prevent queries reading data to block or be blocked by other queries modifying data in the same table. It is a powerful alternative to NOLOCK because it guarantees a complete, transactionally consistent view of the data and does not require a special hint. While RCSI originally targeted scenarios common to OLTP workloads, the feature can be a powerful tool in data warehouse workloads or scenarios involving large-scale bulk insert operations.

RCSI is enabled as a database-wide setting. When enabled, reader queries do not acquire shared locks on rows, pages or tables, and as a result they are not blocked by X or BU-locks taken by others. Instead, new or modified rows in a table carry a 17-byte version identifier, and the before-images of any rows being changed by a transaction (updated or deleted) are copied to tempdb using the row versioning mechanisms within SQL Server. Reader queries consider only those rows that were committed as of the start of the query – by ignoring any later version numbers and referencing tempdb for appropriate earlier versions of rows.

Also read up on : Comparing Different Results with RCSI & Read Committed from SQL CAT team.

LASTLY, to touch on NOLOCK hint, I would suggest you to read up

Best Answer

Related Solutions

Sql-server – SQL Server 2008 R2 Dirty reads – how non-atomic

Sql-server – SQL Server data extract with limited performance impact

Related Question