Timestamp ordering (T/O) is an optimistic class of concurrency control protocols where the DBMS assumes that transaction conflicts are rare. Instead of requiring transactions to acquire locks before they are allowed to read/write to a database object, ==the DBMS instead uses timestamps to determine the serializability order of transactions.==
Each transaction Ti is assigned a unique fixed timestamp TS(Ti ) that is ==monotonically increasing==. Different schemes assign timestamps at different times during the transaction. Some advanced schemes even assign multiple timestamps per transaction.
If TS(Ti ) < TS(Tj ), then the DBMS must ensure that the execution schedule is equivalent to a serial schedule where Ti appears before Tj .
There are multiple timestamp allocation implementation strategies.
- System Clock: The DBMS can use the system clock as a timestamp, but issues arise with edge cases like daylight savings.
- Logical Counter: Another option is to use a logical counter. However, this has issues with overflow and with maintaining the counter across a distributed system with multiple machines.
- Hybrid: There are also hybrid approaches that use a combination of both methods.
The basic timestamp ordering protocol (BASIC T/O) allows reads and writes on database objects without using locks. Instead, ==every database object X is tagged with timestamp of the last transaction that successfully performed a read (denoted as R-TS(X)) or write (denoted as W-TS(X)) on that object.== The DBMS then checks these timestamps for every operation. ==If a transaction tries to access an object in a way which violates the timestamp ordering, the transaction is aborted and restarted.== The underlying assumption is that violations will be rare and thus these restarts will also be rare.
- W-TS(X) – Write timestamp on X
- R-TS(X) – Read timestamp on X
If ==TS(Ti) < W-TS(X)==, this violates timestamp order of Ti with regard to the writer of X.
- Abort Ti and restart it with a new TS.
Else:
- Allow Ti to read X.
- Update R-TS(X) to ==max(R-TS(X), TS(T
i))== - ==Make a local copy of X== to ensure repeatable reads for T
i.
Tips:
为什么不允许 TS(T
i) < W-TS(X)?因为 DBMS 使用 timestamps 作为串行化的依据,我们假设 Tj> Ti,且写了 X 的事务就是 Tj,那么 Ti在串行化调度中是在 Tj前执行的,Ti当然不可能读到 Tj写的 X。
If ==TS(Ti) < R-TS(X)== or ==TS(Ti) < W-TS(X)==
- Abort and restart T
i.
Else:
-
Allow T
ito write X and ==update W-TS(X)== -
Also ==make a local copy of X== to ensure repeatable reads.
Tips:
- 为什么不允许 TS(T
i) < R-TS(X)?我们假设 Tj> Ti,且读了 X 的事务就是 Tj,那么 Ti在串行化调度中是在 Tj前执行的,Ti写 X 在 Tj读 X 之前,Tj应该读到 Ti写的 X,而不是老的 X。- 为什么不允许 TS(T
i) < W-TS(X)?我们假设 Tj> Ti,且写了 X 的事务就是 Tj,那么 Ti在串行化调度中是在 Tj前执行的,Ti先写了 X,Tj在 Ti之后覆盖写了 X,最后执行的结果是 X 的值应该是 Tj写的值。(因为这里 Ti写 X 的结果最终会被 Tj写 X 的结果覆盖,所以可以优化,见 Thomas Write Rule)
Tips:
对于读操作来说 TS(T
1) < R-TS(A) 并没有问题,所以 T1可以正常读取 A,并且因为 TS(T1) < R-TS(A),所以不需要更新 R-TS(A)。
If ==TS(Ti) < R-TS(X)==:
- Abort and restart T
i.
If ==TS(Ti) < W-TS(X)==:
- Thomas Write Rule: ==Ignore the write== to allow the txn to continue executing without aborting.
- This violates timestamp order of T
i.
Else:
- Allow Ti to write X and update W-TS(X)
Tips:
TS(T
i) < W-TS(X) 表示 X 已经被一个 timestamp 比 Ti大的事务(我们设为 Tj)写了,因为 DBMS 以 timestamp 作为串行化调度的依据,因此最终结果必须跟 Ti先执行,Tj后执行等价,即 X 最终的结果是 Tj写的值,我们忽略 Ti写的 X 最终的结果也与 Ti-> Tj这个调度等价。
-
==The Basic T/O protocol generates a schedule that is conflict serializable if it does not use Thomas Write Rule.==
-
It ==cannot have deadlocks== because no transaction ever waits.
-
However, there is a possibility of ==starvation== for long transactions if short transactions keep causing conflicts.
It also permits schedules that are ==not recoverable==. A schedule is recoverable if transactions commit only after all transactions whose changes they read, commit. Otherwise, the DBMS cannot guarantee that transactions read data that will be restored after recovering from a crash.
- High overhead from copying data to transaction’s workspace and from updating timestamps.
- Long running transactions can get starved. The likelihood that a transaction will read something from a newer transaction increases.
- Suffers from the timestamp allocation bottleneck on highly concurrent systems.
Optimistic concurrency control (OCC) is another optimistic concurrency control protocol which also uses timestamps to validate transactions. ==OCC works best when the number of conflicts is low.== This is when either all of the transactions are ==read-only== or when transactions ==access disjoint subsets of data==. If the database is large and the workload is not skewed, then there is a low probability of conflict, making OCC a good choice.
==In OCC, the DBMS creates a private workspace for each transaction==. ==All modifications of the transaction are applied to this workspace==. Any object read is copied into workspace and any object written is copied to the workspace and modified there. No other transaction can read the changes made by another transaction in its private workspace.
When a transaction commits, the DBMS compares the transaction’s workspace ==write set== to see whether it conflicts with other transactions. If there are no conflicts, the write set is installed into the “global” database.
OCC consists of three phases:
- Read Phase: Here, the DBMS tracks the read/write sets of transactions and ==stores their writes in a private workspace==.
- Validation Phase: When a transaction commits, the DBMS checks whether it conflicts with other transactions.
- Write Phase: If validation succeeds, the DBMS applies the private workspace changes to the database. Otherwise, it aborts and restarts the transaction.
==The DBMS assigns transactions timestamps when they enter the validation phase==. To ensure only serializable schedules are permitted, the DBMS checks Ti against other transactions for RW and WW conflicts and makes sure that all conflicts go one way (from older transactions to younger transactions). The DBMS checks the timestamp ordering of the committing transaction with all other running transactions. ==Transactions that have not yet entered the validation phase are assigned a timestamp of ∞==.
Tips:
transactions timestamp 是在进入 validation phase 之后才分配,之前都是无穷大。
Two methods for this phase:
- Backward Validation
- Forward Validation
If TS(Ti ) < TS(Tj ), then one of the following three conditions must hold:
-
T
icompletes all three phases before Tjbegins -
==T
icompletes before Tjstarts its Write phase==, and ==Tidoes not write to any object read by Tj== . (WriteSet(Ti) ∩ ReadSet(Tj) = Ø)Tips:
注意:T
1的 W(A) 操作是在本地 workspace 执行的,在提交之前其他事务是看不到的,因此尽管 T1W(A) 在 T2R(A) 之前发生,T2读到的 A 仍然是老的值 123。上面这两个例子的 R W 顺序都一样,唯一的区别是 Validate 的先后顺序:
- 第 1 个例子中 T
1先进入 Validate,因此 T1< T2。因为 T2Read 了 A,因此 WriteSet(T1) ∩ ReadSet(T2) 不是空集,不满足上述条件,不能提交。- T
1< T2表示在串行化调度中 T1在 T2之前执行,那么 T2的 R(A) 一定能看到 T1W(A),但实际情况是 T2R(A) 仍然是旧值 123,不符合串行化调度
- T
- 第 2 个例子中 T
2先进入 Validate,因此 T2< T1。尽管 T2Read 了 A,但是 WriteSet(T2) ∩ ReadSet(T1) 是空集,满足上述条件,可以提交。- T
2< T1表示在串行化调度中 T2在 T1之前执行,那么 T2的 R(A) 是看不到 T1的 W(A) 的,显然这个例子是符合这个要求的
- T
- 第 1 个例子中 T
-
==T
icompletes its Read phase before Tjcompletes its Read phase==, and ==Tidoes not write to any object that is either read or written by Tj== .(WriteSet(T
i) ∩ ReadSet(Tj) = Ø and WriteSet(Ti) ∩ WriteSet(Tj) = Ø)
Tips:
- T
1先进入 Validate,因此 T1< T2,虽然 T1写了 A,而且 T2在后面会读 A,但是在 Validate 的时候 T2并没有开始写 A,因此 T1与 T2之间并没有写冲突,可以提交- T
2R(A) 在 T1COMMIT 之后,读到的是 T1写的新值- T
2进入 Validate 的时候 T1已经完成 COMMIT,系统中没有其他活跃事务了,即第 1 中情况,直接通过- 整个调度等价于 T
1-> T2
- High overhead for copying data locally into the transaction’s private workspace.
- Validation/Write phase bottlenecks.
- Aborts are potentially more wasteful than in other protocols because they only occur after a transaction has already executed.
- Suffers from timestamp allocation bottleneck.
Serializability is useful because it allows programmers to ignore concurrency issues but enforcing it may allow too little parallelism and limit performance. We may want to use a weaker level of consistency to improve scalability.
Isolation levels control the extent that a transaction is exposed to the actions of other concurrent transactions.
Anomalies:
- Dirty Read: Reading uncommitted data.
- Unrepeatable Reads: Redoing a read results in a different result.
- Phantom Reads: Insertion or deletions result in different results for the same range scan queries.
How did this happen? ==Because T1 locked only existing records and not ones under way!== ==Conflict serializability on reads and writes of individual items guarantees serializability only if the set of objects is fixed==.
The DBMS tracks the WHERE clause for all queries that the txn executes. Have to retain the scan set for every range query in a txn.
Upon commit, re-execute just the scan portion of each query and check whether it generates the same result.
Example: Run the scan for an UPDATE query but do not modify matching tuples.
Proposed locking scheme from System R. Shared lock on the predicate in a WHERE clause of a SELECT query.
Exclusive lock on the predicate in a WHERE clause of any UPDATE, INSERT, or DELETE query.
Never implemented in any system except for HyPer (precision locking).
If there is an index on the status attribute then the txn can lock index page containing the data with status='lit'.
If there are no records with status='lit', the txn must lock the index page where such a data entry would be, if it existed.
If there is no suitable index, then the txn must obtain:
- A lock on every page in the table to prevent a record’s status='lit' from being changed to lit.
- The lock for the table itself to prevent records with status='lit' from being added or deleted.
Tips: MySQL 的 Next-Key-Lock
The isolation levels defined as part of SQL-92 standard only focused on anomalies that can occur in a 2PL-based DBMS. There are two additional isolation levels:
- CURSOR STABILITY
- Between repeatable reads and read committed
- Prevents Lost Update Anomaly.
- Default isolation level in IBM DB2.
- ==SNAPSHOT ISOLATION==
- Guarantees that all reads made in a transaction see a consistent snapshot of the database that existed at the time the transaction started.
- ==A transaction will commit only if its writes do not conflict with any concurrent updates made since that snapshot.==
- Susceptible to ==write skew== anomaly.