Understanding repeatable read isolation level

concurrencydatabase-theoryisolation-level

Wikipedia defines following terms:

With a lock-based concurrency control DBMS implementation, serializability isolation level requires read and write locks (acquired on selected data) to be released at the end of the transaction. Also range-locks must be acquired when a SELECT query uses a ranged WHERE clause, especially to avoid the phantom reads phenomenon.

In repeatable reads isolation level, a lock-based concurrency control DBMS implementation keeps read and write locks (acquired on selected data) until the end of the transaction. However, range-locks are not managed, so phantom reads can occur.

As long as I interpret it correctly, both serializability and repeatable read isolation levels corresponds to rigorous two phase locking (in which both shared and exclusive mode locks are held till transaction commits/aborts), in which transactions are serializable by their commit order, according to the book by Korth et al..

However, book by Korth et al also says following repeatable read isolation level:

Repeatable read allows only committed data to be read and further requires
that, between two reads of a data item by a transaction, no other transaction
is allowed to update it.

However, the transaction may not be serializable
with respect to other transactions. For instance, when it is searching for data
satisfying some conditions, a transaction may find some of the data inserted
by a committed transaction, but may not find other data inserted by the same
transaction.

My doubt is, if rigorous 2PL schedules are serializable by commit order of its transaction, then why book by Korth et al says repeatable read isolation level may not ensure serializability?

Best Answer

Serializable means that there is some order the transactions can be run in without overlapping and we'll end up with the same answers and the same state of the database as we get by running the transactions in parallel with serializable isolation level.

Given two transactions, A and B, the only valid states of the system are

All of A followed by all of B, OR
All of B followed by all of A.

That's it. If the system can end up in any other state the transactions are not serializable. If we can show that two transactions running in parallel do not correspond to one or other of these states then those transactions are not serializable.

Think about a table with 4 rows, ID values 2, 4, 6 and 8. There are two transactions A and B. A counts the number of rows. B inserts two rows, ID values 3 and 7.

If they run A->B then A returns 4. If they run B->A then A returns 6. Those are the only possible answers if we are to guarantee serialization.

So A starts under Repeatable Read isolation. It will perform a table scan. A reads row 2 and takes a lock, then reads row 4 and takes a lock.

Now B starts in parallel with A. B tries to insert row 3. Nothing prevents it; A has never read a row with ID 3 to take a lock on it. Then B inserts row 7 and commits, releasing its locks.

Transaction A continues reading. It has just finished with 4 so the next row is 6, then comes 7 (Tx B has committed so its lock on 7 has been released) and finally 8. So A has seen rows 2, 4, 6, 7 & 8 - five rows! This workload is not serializable.

This is the scenario in the second bullet point in the quote from Korth.

The "problem" is the phantom rows produced by B. They overlap the range of data to be read by A. But Repeatable Read does not issue range locks so B is free to do this. Under Serializable isolation B's insertion locks would have prevented A taking range locks or A's range locks would have blocked B's insertions.

Related Solutions

Postgresql – Read Committed Isolation Level

The difference lies between a query and a transaction. A transaction can contain any number of queries. To illustrate the difference, I set up a small example:

CREATE TABLE table_to_be_updated (
    id serial PRIMARY KEY,
    other_column text,
    column_changing text
);

INSERT INTO table_to_be_updated (other_column, column_changing)
VALUES
('value', 'old_value'),
('value', 'other_value'),
('nonvalue', 'doesnt matter');

Then run two transactions concurrently (issuing the commands one by one, the middle line wants to depict the timeline):

                                | <-- BEGIN;
                                |
                                |
                                |     UPDATE table_to_be_updated
BEGIN; -----------------------> |     SET column_changing = 'new_value'
                                |     WHERE
                                |         other_column = 'value' AND
                                |         column_changing = 'old_value';
                                |
                                |
SELECT column_changing -------> |     -- update not yet committed
FROM table_to_be_updated        |
WHERE other_column = 'value';   | <-- COMMIT;
                                |
                                |
SELECT column_changing -------> |
FROM table_to_be_updated        |
WHERE other_column = 'value';   |
                                |
                                |
COMMIT; ----------------------> |

Running these in READ COMMITTED isolation level, the first query returns a row with 'old_value', while the second one shows a row with 'new_value'. On an other run, I change the left-hand-side transaction isolation level:

SET transaction ISOLATION LEVEL REPEATABLE READ;

(The command must be the first statement in a transaction.) Now both SELECTs return the same rowset, while a third one after committing both transactions will show the new row.

Postgresql – Inconsistency in repeatable read

Here is my reading of that section. I will admit it is confusing.

Suppose I have two tables:

CREATE TABLE batch (
   id serial not null unique,
   control_code text primary key,
   date_posted date not null default now()
);

CREATE TABLE details (
   batch_id int not null references batch(id),
   description text,
   primary key(batch_id, description)
);

Now, suppose we insert batch and details records in different transactions. Session 1 inserts a batch and starts to insert details but before it finishes, session 2 starts up. Session 2 gets to see the batch heading info, but does not wait on the commit on details to inform the user that no records are found. Now if your batch and details are entirely in the same transaction then this never is a problem.

this would differ from serializable where you would expect to wait for the previous insert to complete and commit or rollback before determining whether to notify the user that no rows were found.

Best Answer

Related Solutions

Postgresql – Read Committed Isolation Level

Postgresql – Inconsistency in repeatable read

Related Question