Sql-server – Dirty read or phantom row

concurrencyisolation-levelsql server

We are having dirty reads or phantom rows in our data warehouse. My question is not how to solve it (it is a long story and I am currently working on it), but I want to know what exactly happens to better understand the behaviour and use the correct terminology. Are these dirty reads or phantom rows?

Let's say we have SourceTableA in the ERP database with a CLUSTERED UNIQUE index on Col1 and Col2. The DWH extraction process that performs the following query:

SELECT Col1, Col2, Col3
FROM ERPdatabase..SourceTableA WITH(NOLOCK);

This is not recommended, because now we can even have dirty reads, but we are forced to do this by the ERP vendor. The vendor knows the implications and uses READ UNCOMMITTED for all their transactions and forces us to use the NOLOCK hint.

Say we have RowA in SourceTableA:

('GreatCompany','WonderfullItem','ColorRed')

Now (I think…) a user session in the ERP system performs an update on RowA while the previous query is running like this:

UPDATE SourceTableA
SET Col3 = 'ColorBlue'
WHERE Col1 = 'GreatCompany'
AND Col2 = 'WonderfullItem'

If I am correct or not, what I see happening is that 2 rows are being read by the extraction process. I am not surprised since we use NOLOCK (although I am unsure if I understand the exact cause, hence this question), but I have 2 questions about this:

  1. Since the row is not moved around in the CLUSTERED index, because the primary key does not change, why is it read a second time? My thinking would be that the extract process is already past that row, so it should not encounter it anymore. Or is that not how it works? The only explanation I can think of is that the extract query is not using the CLUSTERED index to read the data, but maybe a NONCLUSTERED index where the row IS moved around. Or what else am I missing? Or is it just a dirty read? Or is it just 'SQL Server does not guarantee row order unless you use an ORDER BY, DISTINCT or do not use NOLOCK'? Then I am curious why exactly. ๐Ÿ™‚

  2. Regarding dirty reads, I always thought the behaviour I see were phantom rows, but, to my understanding, for a phantom row to occur we actually need 2 statements within the same transaction that read the same set of data. That is not the case here since it is only 1 statement, so implicitly I think that this should then be a dirty read? On the other hand, for a dirty read to occur we need an update to be rolled back by another transaction. That is not what I assume is happening. Or is that actually the only explanation?

Best Answer

  1. You are correct in your initial assumptions here. If the row don't move (including due to a page split), then a scan/seek won't read it twice. But, as you say, perhaps the query is done through an NC index for which the row index was moved, hence the double read.

  2. Regarding terminology, there are some standard phenomena defined in ANSI SQL, which are then used in conjunction with isolation level to describe the isolation levels. I.e., isolation level A eliminates phenomena X and Y but not Z. Such phenomena include dirty reads (read something that has been modified but later might be rolled back), phantom (read some data, read it again and a new row occurs in that set). Double read isn't among those phenomena defined in ANSI SQL, perhaps since the standard is too posh to bother about physical implementations (like indexes) and in such a world encountering the same row twice isn't even on the table? I.e., what you describe here is what we in the SQL Server community use to call something like "double read", but don't expect to see the terminology formalized for this phenomena.

(I realize that I probably over/misuse the word phenomena here, my bad in that case!)

Missing committed rows or encountering them multiple times due to an allocation-ordered scan over changing data is specific to using read uncommitted isolation (enabled by the NOLOCK hint here). See also Previously committed rows might be missed if NOLOCK hint is used by Microsoft's Lubor Kollar.

As David Browne suggested you could consider reporting off of a database snapshot - if you canโ€™t use a an isolation level that produces correct results.

One can also consider ultra low-tech solutions. I'm thinking backup and restore, and then report off the restored database. You can combine full backups with log backups. This is more feasible when the database isn't too large, of course.