Understanding repeatable read isolation level

concurrencydatabase-theoryisolation-level

Wikipedia defines following terms:

  • With a lock-based concurrency control DBMS implementation, serializability isolation level requires read and write locks (acquired on selected data) to be released at the end of the transaction. Also range-locks must be acquired when a SELECT query uses a ranged WHERE clause, especially to avoid the phantom reads phenomenon.
  • In repeatable reads isolation level, a lock-based concurrency control DBMS implementation keeps read and write locks (acquired on selected data) until the end of the transaction. However, range-locks are not managed, so phantom reads can occur.

As long as I interpret it correctly, both serializability and repeatable read isolation levels corresponds to rigorous two phase locking (in which both shared and exclusive mode locks are held till transaction commits/aborts), in which transactions are serializable by their commit order, according to the book by Korth et al..

However, book by Korth et al also says following repeatable read isolation level:

  • Repeatable read allows only committed data to be read and further requires
    that, between two reads of a data item by a transaction, no other transaction
    is allowed to update it.
  • However, the transaction may not be serializable
    with respect to other transactions. For instance, when it is searching for data
    satisfying some conditions, a transaction may find some of the data inserted
    by a committed transaction, but may not find other data inserted by the same
    transaction.

My doubt is, if rigorous 2PL schedules are serializable by commit order of its transaction, then why book by Korth et al says repeatable read isolation level may not ensure serializability?

Best Answer

Serializable means that there is some order the transactions can be run in without overlapping and we'll end up with the same answers and the same state of the database as we get by running the transactions in parallel with serializable isolation level.

Given two transactions, A and B, the only valid states of the system are

  1. All of A followed by all of B, OR
  2. All of B followed by all of A.

That's it. If the system can end up in any other state the transactions are not serializable. If we can show that two transactions running in parallel do not correspond to one or other of these states then those transactions are not serializable.

Think about a table with 4 rows, ID values 2, 4, 6 and 8. There are two transactions A and B. A counts the number of rows. B inserts two rows, ID values 3 and 7.

If they run A->B then A returns 4. If they run B->A then A returns 6. Those are the only possible answers if we are to guarantee serialization.

So A starts under Repeatable Read isolation. It will perform a table scan. A reads row 2 and takes a lock, then reads row 4 and takes a lock.

Now B starts in parallel with A. B tries to insert row 3. Nothing prevents it; A has never read a row with ID 3 to take a lock on it. Then B inserts row 7 and commits, releasing its locks.

Transaction A continues reading. It has just finished with 4 so the next row is 6, then comes 7 (Tx B has committed so its lock on 7 has been released) and finally 8. So A has seen rows 2, 4, 6, 7 & 8 - five rows! This workload is not serializable.

This is the scenario in the second bullet point in the quote from Korth.

The "problem" is the phantom rows produced by B. They overlap the range of data to be read by A. But Repeatable Read does not issue range locks so B is free to do this. Under Serializable isolation B's insertion locks would have prevented A taking range locks or A's range locks would have blocked B's insertions.