A database transaction has to preserve ACID
- Atomicity
- Consistency
- Isolation
- Durability
Implementing transaction required components
- transaction manager
- coordinates, schedules, tracks transactions and their individual steps
- log manager
- guards access to the resources and prevents concurrent access violating data integrity
- page cache
- serves as an intermediary between persistent storage and the rest of the storage engine
- log manager
- holds a history of operations
Buffer Management
page cache (buffer pool)
- keeps cached page contents in memory
- allows modification to be buffered
- when a requested page isn’t present in memory, it is paged in by the page cache
- if an already cached page is requested, its cached version is returned
- if there’s not enough space available, some other page is evicted and its contents are flushed to disk
Caching Semantics
- many database using
O_DIRECT
flag to bypass the kernal page cache - as an application specific equivalent of the kernal page cache
- accesses the block device directly
- decouples logical and physical write operations
- if the page is not pinned or referenced, it can be evicted right away
- dirty pages have to be flushed before they are evicted
- PostgreSQL has a background flush writer cycles through the dirty pages that are likely to be evicted
- to make sure all the changes are persisted (durability), flushes are coordinated by the checkpoint process with WAL and page cache
Trade-off objectives:
- Postpone flushed to reduce the number of disk accesses
- Preemptively flush pages to allow quick eviction
- Pick pages for eviction and flush in the optimal order
- Keep cache size within its memory bounds
- Avoid losing data as it is not persisted to the primary storage
Locking Pages in Cache
- the higher levels of B-Tree nodes could be pinned in cache permanently,
- since it just a small fraction of the tree
- saving in every lookup path
- disk access only required in lower levels nodes
Prefetching & Immediate Eviction
- Page cache also allows the storage engine to have fine-grained control over prefetching and eviction
- Prefetching
- leaf nodes traversed in a range scan
- Immediate Eviction
- maintenance process, unlikely to be used for the in-flight queries
Page Replacement
- FIFO (first in, first out)
- impractical, ex. higher level of page nodes
- LRU (least-recently used)
- 2Q LRU
- LRU-K keeping track of the last K accesses
- CLOCK
- as an approximated, compact version of LRU
- Linux uses a variant of CLOCK
- access bit
- set to
1
, whenever the page is accessed - around the circular buffer
- if access bit is
1
, but the page is unreferenced, then set to0
- if access bit is already
0
, then the page becomes a candidate and is scheduled for eviction
- if access bit is
- set to
- advantage
- use compare-and-swap (CAS) operations, and do not require locking
- LFU (least-frequency used)
- frequency histogram
- TinyLFU
- three queues
- Admission: newly added elements with LRU policy
- Probation: holding elements most likely to get evicted
- Protected: holding elements that are to stay for longer
- three queues
Recovery
Write-Ahead Log
WAL (write-ahead log), commit log
- an append-only auxiliary disk-resident structure
- used for crash and transaction recovery
- functionalities
- allow page cache to buffer updates while ensuring durability
- persist all operations on disk until the cache copies of pages are synchronized
- allow lost in-memory changes to be reconstructed
LSN (log sequence number)
- a unique, monotonically increasing number
- with an internal counter or a timestamp
- as the order index of the operation records in the WAL
Checkpoint
- sync checkpoint
- force all dirty pages to be flushed on disk
- fully synchronizes the primary storage structure
- impractical, require pausing all operations
- fuzzy checkpoint
last_checkpoint
pointer in log header, (with LSN of thebegin_checkpoint
record)begin_checkpoint
log record- info about the dirty pages
- transaction table
end_checkpoint
log record, until all the specified dirty pages are flushed
Operation Versus Data Log
- physical log
- before-image <=> after-image
- store complete page stat or byte-wise changes
- logical log
- redo <=> undo operation
- store operation that to be performed against the current state
Steal and Force Policies
Steal | No-steal |
---|---|
allow flushing uncommitted | only flushing committed |
could use only redo entries in recovery |
Force | No-force |
---|---|
only committing flushed | allow committing unflushed |
no need additional work on recovery | |
take longer to commit due to necessary I/O |
ARIES (Algorithm for Recovery and Isolation Exploiting Sematics)
- ARIES is a steal/no-force recovery algorithm
- uses
- LSNs for identifying log records
- dirty page table to track page modified
- physical redo to improve performance during recovery
- logical undo to improve concurrency during normal operations
- fuzzy checkpointing
- three phases in recovery proceeds
- analysis phase: identify dirty pages, identify the starting point for the redo phase
- redo phase: repeat the history up to the point of a crash
- undo phase: roll back all incomplete transactions and restore the database to the last consistent state
Concurrency Control
- Optimistic Concurrency Control (OCC)
- check conflict “before” the commit
- Multiversion Concurrency Control (MVCC)
- allowing multiple timestamped versions of the record to be present
- Pessimistic Concurrency Control (PCC)
- manage and grant access to shared resources
Transaction Isolation
- Serializability
- a schedule is a list of operations required to execute a set of transactions
- to be serial for a schedule is when transactions are executed completely independently without any interleaving
- a schedule is serializable, if it’s equivalent to some complete serial schedule
- Read & Write Anomalies
- read anomalies
- dirty read
- a transaction can read uncommitted changes from other transactions
- nonrepeatable read (fuzzy read)
- a transaction queries the same row twice and gets different results
- phantom read
- a transaction queries the same set of rows twice and gets different results
- dirty read
- write anomalies
- lost update
- two transactions update the same record without awareness about each other’s existence
- dirty write
- transaction results are based on the values that have never been committed
- write skew
- the combination of individual transactions does not satisfy the required invariant
- lost update
- read anomalies
- Isolation Levels
Dirty | Non-Repeatable | Phantom | |
---|---|---|---|
Read Uncommitted | Allowed | Allowed | Allowed |
Read Committed | - | Allowed | Allowed |
Repeatable Read | - | - | Allowed |
Serializable | - | - | - |
Lost Update | Dirty | Write Skew | |
---|---|---|---|
Snapshot Isolation | - | - | Allowed |
Optimistic Concurrency Control
- Transaction execution phases
- Read Phase
- Identify the read set & write set
- Validation Phase
- check serializability
- if the read set out-of-date
- if the write set will overwrite the other transactions committing during the read phase
- if conflict found, restart from the read phase
- else, start commit and write phase
- check serializability
- Write Phase
- commit the write set from private context to the database state
- Read Phase
- critical section: Validation Phase & Write Phase
- efficient if the validations usually succeeds and no need to retry
Multiversion Concurrency Control
- allowing multiple record versions
- using monotonically incremented transaction IDs or timestamps
- distinguishes between committed & uncommitted versions
- last committed version: current
- to keep at most one uncommitted value at a time
- major use cases for implementing snapshot isolation
Pessimistic Concurrency Control
Lock-Free Scheme
- timestamp ordering
max_read_timestamp
andmax_write_timestamp
- if read operations attempt to read value, which timestamp lower than
max_write_timestamp
, then abort - if write operations attempt to write value which timestamp lower than
max_read_timestamp
, then abort - if write operations attempt to write value which timestamp lower than
max_write_timestamp
, then just ignore the outdated written values
Lock-Based Scheme
- two phase locking (2PL)
- growing phase (locks acquiring)
- shrinking phase (locks releasing)
- deadlocks
- timeout and abort
- conservative 2PL
- requires to acquire all the locks before any execution operations
- significant limit concurrency
- wait-for graph
- maintained by the transaction manager
- applying either one of the restrictions
- wait-die: a transaction can be blocked only by a transaction with lower priority
- wounds-wait: a transaction can be blocked only by a transaction with higher priority
Locks & Latches
Locks | Latches |
---|---|
Guard the logical data integrity | Guard the physical data integrity |
Guard a specific key or key range | Guard a page node in the storage structure |
Heavyweight | Lightweight |
Lock-free concurrency still need latches |
- reader-writer lock (RW Lock)
Reader | Writer | |
---|---|---|
Reader | Shared | Exclusive |
Writer | Exclusive | Exclusive |
- manage access to pages
- busy-wait
- queueing, compare-and-swap (CAS)
- Latch Crabbing
- read path: the parent node’s latch can be release, as soon as the child node’s latch is acquired
- insert path: the parent node’s latch can be release, if the child node is not full
- delete path: the parent node’s latch can be release, if the child node holds enough elements
- Latch Upgrading
- acquisition of shared locks along the search path, then upgrading them to exclusive locks when necessary
- always latch root to avoid the root bottleneck (how?)
B^link-Trees
- B-Trees with *high keys and sibling link pointers
- allow the state of half-split
- referenced by sibling pointer, not child pointer
- if the search key larger than the high key, then follow the sibling link
- therefore,
- do not have to hold the parent lock when descending to the child level, even if child node splitting
- reduce the number of locks held
- allows reads concurrent to tree structural change, and prevents deadlocks
- slightly less efficient when encounter splitting (relative rare case)