Database-Internals-Ch-9-Failure-Detection

  • prerequisite of consensus, atomic broadcast algorithms, and distributed systems
  • link failure and process failure
  • essential properties
    • completeness
      • all members notice the failure process
      • eventually reach the final result
    • efficiency
      • how fast the failure process can be identified
    • accuracy
      • precisely detect process failure

Heartbeats & Pings

  • Fixed Time Interval
    • simple, need to carefully select frequency and timeout
  • Timeout-Free Failure Detector
    • under asynchronous assumption
  • Outsourced Heartbeats
    • heartbeats sending from neighbor nodes as failover
  • Phi-Accural Failure Detector
    • sliding window of ETA to estimate the failure probability
  • Gossip and Failure Detection
    • propagate and update heartbeat counters vector to random neighbor

Database-Internals-Ch-7-Log-Strutured-Storage

B-Tree LSM Tree
in-place update append-only
optimized for read performance optimized for write performance
  • RUM Conjecture
    • trade-off between Read, Update, & Memory

LSM Trees

  • immutable on-disk storage structure
  • introduced by Patrick O’Neil & Edward Cheng ‘96
  • sequential write, prevent fragmentation, have higher density
Amplification Source
Read Amplification From needing to read the duplication from multiple tables
Write Amplification From the multiple runs of the compactions
Space Amplification From the duplication in multiple tables

LSM Tree Structure

  • memtable
    • mutable in-memory
    • serves read & write
    • triggered periodically or size threshold, flush to disk
    • recoverable with WAL
  • Two-component LSM Tree
    • disk-resident tree & memory-resident tree
    • drawback: frequent merge by memtable flush
  • Multicomponent LSM Trees
    • multiple disk-resident tables (components)
    • periodic compaction for several tables
  • life cycles
    • current memtable: receives writes & serves reads
    • flushing memtable: still available for read, but not writable
    • on-disk flushing target: not readable, since still incomplete
    • flushed tables: available for read as soon as the flushed memtable is discarded
    • compacting tables: currently merging disk-resident tables
    • compacted tables: created from flushed or other compacted tables
  • Deletion
    • just remove records from memtable will cause resurrect
    • done by delete entry / tombstone / dormant certificate
    • range delete: predicate deletes / range tombstones, ex. Cassandra
  • Lookups
    • from each components, merge & reconcile the contents
  • Merge-Iteration
    • given a cursor or iterator to navigate through file contents
    • use multiway merge-sort / priority queue / min-heap
  • Reconciliation
    • reconciliation & conflict resolution of the data records associated with the same key
    • with records holding metadata, ex. timestamps

Compaction

  • Maintenance
    • has memory usage upper bond since it only holds iterator heads
    • multiple compactions can be executed (nonintersecting)
    • not only for merge but also allow repartition
    • preserve tombstones during compaction, only remove when no associated records assure, ex. RockDB’s bottommost level, Cassandra’s GC
  • Leveled Compaction
    • one of the compaction strategies used by RockDB
    • Level 0: flushed from memtable, tables range may overlapping, when reaching the size threshold, merge and partition into level 1
    • Level 1: partitioned into different key ranges, when reaching the size threshold, merge and partition into level 2
    • Level k: exponential enlarge the size threshold, bottommost is the oldest data
  • Size-tiered Compaction
    • decide level by tables size
    • merge small tables to become larger one to be promoted to the higher levels
  • Time-Window Compaction
    • if records’ ttl (time-to-live) have been set, ex. Cassandra, the expired tables can be dropped directly

Implementation Details

Sorted String Tables
  • SSTables
  • consist of index files and data files
  • index file for lookup, ex. B-Trees or hash tables
  • data records, concatenation of key-value, are ordered by key, so allows the sequential reading
  • immutable dist-resident contents
  • SSTables-Attached Secondary Indexes (SASI), implemented in Cassandra, the secondary index files are created along with the SSTable primary key index
Bloom Filter
  • conceived by Burton Howard Bloom in 1970
  • uses a large bits array and multiple hash functions apply on keys
  • bitwise or to compose as a filter to indicate whether the test key might in the set
Skiplist
  • probabilistic complexity guarantees are close to search tree
  • randomly assign the height
  • link by / to each equal or lower height level’s next node
  • fully_linked flag & compare-and-swap for concurrency
  • ex. Cassandra’s secondary index memtable, WiredTiger’s some in-memory operations
Compression & Disk Access
  • compressed page size will not align with page boundary
  • so need an indirection layer, offset table, which stores offsets and size of compressed pages

Unordered LSM Storage

  • Bitcask
    • one of the storage engine used in Riak
    • no memtable, store in log file directly, to avoid extra write
    • keydir as the in-memory hash table point from key to the latest record in log file
    • GC during compaction
    • not allow range scan
  • WiscKey
    • unsorted data records in append-only vLogs
    • sorted key in LSM tree
    • to allow range scan
    • when scanning range, use internal SSD parallelism to prefetch blocks, to reduce random I/O

Concurrency in LSM Trees

  • Cassandra uses operation order barriers
  • Memtable switch: after this, all writes go only to the new memtable, while the old one is still available for read
  • Flush finalization: replace the old memtable with a flushed disk-resident table in the table view
  • Write-ahead log truncation: discard a log segment holding records associated with a flushed memtable

Log Stacking

  • SSDs also use log-structured storage to deal with small random writes
  • stacking multiple log-structured systems can run into several problems
    • write amplification
    • fragmentation
    • poor performance
  • Flash Translation Layer
    • flash translation layer (FTL) is used by SSD
    • translate logical page addresses to their physical locations
    • keep track of pages status (live, discarded, empty)
    • garbage collection
      • relocate live pages
      • erase by block (group of pages, 64 to 512 pages)
    • wear leveling distributes the load evenly across the medium, to extend device lifetime
  • Filesystem Logging
    • cause
      • redundant logging
      • different GC pattern
      • misaligned segment writes
      • interleave data records and log records due to multiple write streams

LLAMA & Mindful Stacking

  • latch-free, log-structured, access-method aware (LLAMA)
  • allow Bw-Trees to arrange the physical delta nodes in the same chain in contiguous physical location
  • more efficient GC, less fragmentation
  • reduce read latency
  • Open-Channel SSDs
    • expose internal control of wear-leveling, garbage collection, data placement, & scheduling
    • skip flash translation layer, can achieve single layer GC, minimize write amplification
    • Software Defined Flash (SDF): read in page, write in block
    • LOCS (LSM Tree-based KV Store on Open-Channel SSD), 2013
    • LightNVM, implemented in the Linux kernel, 2017

Database-Internals-Ch-6-BTrees-Variants

  • Abstracting Node Updates, allow to have different life cycles
    • on-disk pages
    • in-memory raw binary cached versions
    • in-memory language-native representations (materialized)
  • Three problems for in-place update B-Tree implementation
    • write amplification
      • updating a disk-resident page copy on every update
    • space amplification
      • preserve some unused buffer space for future insertion and update, and included in transferring
    • complexity of concurrency
      • solving concurrency and dealing with latches

Copy-on-Write B-Trees

  • content updating
    • make a copy on the modified nodes
    • on completion, switch the topmost pointer
  • pros
    • tree is immutable
    • readers require no sync, no lock
    • inherent structure of MVCC (multiversioned)
  • cons
    • requires extra page copying
    • requires more space (but not too much since the shallow tree depth)\
  • ex. Lightning Memory-Mapped Database (LMDB)
    • k-v store used by the OpenLDAP
    • single-level data store
    • direct memory map, no application-level cache, no materialization

Lazy B-Trees

  • buffer updates and propagate them with a delay

WiredTiger

  • default MongoDB’s storage engine
  • data structures:
    • for a node, clean page consists of just index, initially constructed from the on-disk page image
    • for a node, update buffer, implemented using skiplists, complexity similar to search trees, better concurrency profile
  • operations
    • when content updating, save into the update buffer list
    • when reading, the update buffers are merge with the on-disk page content
    • when flushing, the update buffers contents are reconciled and persisted on disk
      • split or merge according to the reconciled page size
  • pros
    • page updates and structural modifications are performed by the background thread

Lazy-Adaptive Tree

  • update buffers attach to subtrees
  • when buffer full, it’s propagating to lower tree levels’ buffer
  • when the propagation reaching the leaf level and the buffer full, if flush to disk and change tree structure at once.

FD-Trees

  • small mutable head tree & multiple immutable sorted runs
    • limit the surface area, where random write I/O in the head tree
    • head tree is a small B-Tree buffering the updates
    • once head tree get full, contents are transferred to the immutable run
    • propagating records from upper level run to lower level
  • Fractional Cascading
    • helps to reduce the cost of locating an item in the lower cascading levels
    • bridges between neighboring-level
    • pull every N-th item from the lower level
  • Logarithmic Runs
    • increasing by factor k to previous level
    • propagating from up to down when run get full

Bw-Trees

  • Buzzword-Tree, try to resolve the three problems at once
  • Update Chains
    • each logical node for B-Tree consist of a linked list head from latest update: delta -> delta -> … -> base
    • base node: cache of the disk copy page contents
    • delta node: all the modifications, can represents inserts, updates, or deletes
    • logical rather than physical
      • node sizes are unlikely to be page aligned
      • no need to pre-allocate space
  • Concurrency
    • each logical node for B-Tree has a logical identifier
    • maintained with an in-memory mapping table
    • the mapping table contains virtual links to the update chain‘s head (latest delta node)
    • updated with Compare-and-Swap (lock-free)
  • Structural Modification Operations
    • SMO
    • Split
      • append a special split delta node to the splitting node
      • split delta node with the midpoint separator key & the pointer to the new logical sibling node
      • similar to B_link-Tree‘s half-split
      • update the parent with the new child node
    • Merge
      • append a special remove delta node to the right sibling, indicating the start of merge SMO
      • append a special merge delta node to the left sibling to point to the right sibling so to logical merge the contents
      • update the parent to remove the link to the right sibling
    • Prevent concurrent SMO
      • an abort delta node is installed on the parent, like a write lock
      • remove when SMO completes
  • Consolidation & GC
    • once the delta chain length reaches a threshold, consolidate the chain into a new node
    • then write to disk and update the mapping table
    • but need to wait for all reader complete to reclaim the memory, epoch-based reclaimation
  • ex. Sled, OpenBw-Tree (by CMU Database Group)

Cache-Oblivious B-Trees

  • automatically optimize the parameters, ex. block size, page size, cache line size for arbitrary adjacent two levels of memory hierarchy
  • van Emde Boas Layout
    • split the B-tree from the middle level, recursive for the subtree
    • result in the sqrt(N) size subtrees
    • any recursive subtree is stored in a contiguous block of memory
    • packed array with parameter density threshold to allow gaps for insertion or update
    • array grow or shrink when become too dense or sparse

Database-Internals-Ch-5-Transaction-Processing-and-Recovery

A database transaction has to preserve ACID

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Implementing transaction required components

  • transaction manager
    • coordinates, schedules, tracks transactions and their individual steps
  • log manager
    • guards access to the resources and prevents concurrent access violating data integrity
  • page cache
    • serves as an intermediary between persistent storage and the rest of the storage engine
  • log manager
    • holds a history of operations

Buffer Management

page cache (buffer pool)

  • keeps cached page contents in memory
  • allows modification to be buffered
  • when a requested page isn’t present in memory, it is paged in by the page cache
  • if an already cached page is requested, its cached version is returned
  • if there’s not enough space available, some other page is evicted and its contents are flushed to disk

Caching Semantics

  • many database using O_DIRECT flag to bypass the kernal page cache
  • as an application specific equivalent of the kernal page cache
  • accesses the block device directly
  • decouples logical and physical write operations
  • if the page is not pinned or referenced, it can be evicted right away
  • dirty pages have to be flushed before they are evicted
  • PostgreSQL has a background flush writer cycles through the dirty pages that are likely to be evicted
  • to make sure all the changes are persisted (durability), flushes are coordinated by the checkpoint process with WAL and page cache

Trade-off objectives:

  • Postpone flushed to reduce the number of disk accesses
  • Preemptively flush pages to allow quick eviction
  • Pick pages for eviction and flush in the optimal order
  • Keep cache size within its memory bounds
  • Avoid losing data as it is not persisted to the primary storage

Locking Pages in Cache

  • the higher levels of B-Tree nodes could be pinned in cache permanently,
    • since it just a small fraction of the tree
    • saving in every lookup path
    • disk access only required in lower levels nodes

Prefetching & Immediate Eviction

  • Page cache also allows the storage engine to have fine-grained control over prefetching and eviction
  • Prefetching
    • leaf nodes traversed in a range scan
  • Immediate Eviction
    • maintenance process, unlikely to be used for the in-flight queries

Page Replacement

  • FIFO (first in, first out)
    • impractical, ex. higher level of page nodes
  • LRU (least-recently used)
    • 2Q LRU
    • LRU-K keeping track of the last K accesses
  • CLOCK
    • as an approximated, compact version of LRU
    • Linux uses a variant of CLOCK
    • access bit
      • set to 1, whenever the page is accessed
      • around the circular buffer
        • if access bit is 1, but the page is unreferenced, then set to 0
        • if access bit is already 0, then the page becomes a candidate and is scheduled for eviction
    • advantage
      • use compare-and-swap (CAS) operations, and do not require locking
  • LFU (least-frequency used)
    • frequency histogram
  • TinyLFU
    • three queues
      • Admission: newly added elements with LRU policy
      • Probation: holding elements most likely to get evicted
      • Protected: holding elements that are to stay for longer

Recovery

Write-Ahead Log

WAL (write-ahead log), commit log

  • an append-only auxiliary disk-resident structure
  • used for crash and transaction recovery
  • functionalities
    • allow page cache to buffer updates while ensuring durability
    • persist all operations on disk until the cache copies of pages are synchronized
    • allow lost in-memory changes to be reconstructed

LSN (log sequence number)

  • a unique, monotonically increasing number
  • with an internal counter or a timestamp
  • as the order index of the operation records in the WAL

Checkpoint

  • sync checkpoint
    • force all dirty pages to be flushed on disk
    • fully synchronizes the primary storage structure
    • impractical, require pausing all operations
  • fuzzy checkpoint
    • last_checkpoint pointer in log header, (with LSN of the begin_checkpoint record)
    • begin_checkpoint log record
    • info about the dirty pages
    • transaction table
    • end_checkpoint log record, until all the specified dirty pages are flushed

Operation Versus Data Log

  • physical log
    • before-image <=> after-image
    • store complete page stat or byte-wise changes
  • logical log
    • redo <=> undo operation
    • store operation that to be performed against the current state

Steal and Force Policies

Steal No-steal
allow flushing uncommitted only flushing committed
could use only redo entries in recovery
Force No-force
only committing flushed allow committing unflushed
no need additional work on recovery
take longer to commit due to necessary I/O

ARIES (Algorithm for Recovery and Isolation Exploiting Sematics)

  • ARIES is a steal/no-force recovery algorithm
  • uses
    • LSNs for identifying log records
    • dirty page table to track page modified
    • physical redo to improve performance during recovery
    • logical undo to improve concurrency during normal operations
    • fuzzy checkpointing
  • three phases in recovery proceeds
    • analysis phase: identify dirty pages, identify the starting point for the redo phase
    • redo phase: repeat the history up to the point of a crash
    • undo phase: roll back all incomplete transactions and restore the database to the last consistent state

Concurrency Control

  • Optimistic Concurrency Control (OCC)
    • check conflict “before” the commit
  • Multiversion Concurrency Control (MVCC)
    • allowing multiple timestamped versions of the record to be present
  • Pessimistic Concurrency Control (PCC)
    • manage and grant access to shared resources

Transaction Isolation

  • Serializability
    • a schedule is a list of operations required to execute a set of transactions
    • to be serial for a schedule is when transactions are executed completely independently without any interleaving
    • a schedule is serializable, if it’s equivalent to some complete serial schedule
  • Read & Write Anomalies
    • read anomalies
      • dirty read
        • a transaction can read uncommitted changes from other transactions
      • nonrepeatable read (fuzzy read)
        • a transaction queries the same row twice and gets different results
      • phantom read
        • a transaction queries the same set of rows twice and gets different results
    • write anomalies
      • lost update
        • two transactions update the same record without awareness about each other’s existence
      • dirty write
        • transaction results are based on the values that have never been committed
      • write skew
        • the combination of individual transactions does not satisfy the required invariant
  • Isolation Levels
Dirty Non-Repeatable Phantom
Read Uncommitted Allowed Allowed Allowed
Read Committed - Allowed Allowed
Repeatable Read - - Allowed
Serializable - - -
Lost Update Dirty Write Skew
Snapshot Isolation - - Allowed

Optimistic Concurrency Control

  • Transaction execution phases
    • Read Phase
      • Identify the read set & write set
    • Validation Phase
      • check serializability
        • if the read set out-of-date
        • if the write set will overwrite the other transactions committing during the read phase
      • if conflict found, restart from the read phase
      • else, start commit and write phase
    • Write Phase
      • commit the write set from private context to the database state
  • critical section: Validation Phase & Write Phase
  • efficient if the validations usually succeeds and no need to retry

Multiversion Concurrency Control

  • allowing multiple record versions
  • using monotonically incremented transaction IDs or timestamps
  • distinguishes between committed & uncommitted versions
    • last committed version: current
    • to keep at most one uncommitted value at a time
  • major use cases for implementing snapshot isolation

Pessimistic Concurrency Control

Lock-Free Scheme

  • timestamp ordering
    • max_read_timestamp and max_write_timestamp
    • if read operations attempt to read value, which timestamp lower than max_write_timestamp, then abort
    • if write operations attempt to write value which timestamp lower than max_read_timestamp, then abort
    • if write operations attempt to write value which timestamp lower than max_write_timestamp, then just ignore the outdated written values

Lock-Based Scheme

  • two phase locking (2PL)
    • growing phase (locks acquiring)
    • shrinking phase (locks releasing)
  • deadlocks
    • timeout and abort
    • conservative 2PL
      • requires to acquire all the locks before any execution operations
      • significant limit concurrency
    • wait-for graph
      • maintained by the transaction manager
      • applying either one of the restrictions
        • wait-die: a transaction can be blocked only by a transaction with lower priority
        • wounds-wait: a transaction can be blocked only by a transaction with higher priority
Locks & Latches
Locks Latches
Guard the logical data integrity Guard the physical data integrity
Guard a specific key or key range Guard a page node in the storage structure
Heavyweight Lightweight
Lock-free concurrency still need latches
  • reader-writer lock (RW Lock)
Reader Writer
Reader Shared Exclusive
Writer Exclusive Exclusive
  • manage access to pages
    • busy-wait
    • queueing, compare-and-swap (CAS)
  • Latch Crabbing
    • read path: the parent node’s latch can be release, as soon as the child node’s latch is acquired
    • insert path: the parent node’s latch can be release, if the child node is not full
    • delete path: the parent node’s latch can be release, if the child node holds enough elements
  • Latch Upgrading
    • acquisition of shared locks along the search path, then upgrading them to exclusive locks when necessary
    • always latch root to avoid the root bottleneck (how?)
  • B-Trees with *high keys and sibling link pointers
  • allow the state of half-split
    • referenced by sibling pointer, not child pointer
    • if the search key larger than the high key, then follow the sibling link
  • therefore,
    • do not have to hold the parent lock when descending to the child level, even if child node splitting
    • reduce the number of locks held
    • allows reads concurrent to tree structural change, and prevents deadlocks
    • slightly less efficient when encounter splitting (relative rare case)

Database-Internals-Ch-4-Implementing-B-Trees

  • PostgreSQL: page size, layout version
  • MySQL InnoDB: number of heap records, level, implementation-specific values
  • SQLite: number of cells, a rightmost pointer

Magic Numbers

  • multibyte block, ex. (50 41 47 45)
  • validation & sanity check
  • identify version
  • forward / backward links
  • help to locate neighboring nodes without ascending back to parent / root
  • add complexity to split and merge
    • may required additional locking for the updating sibling node
    • could be useful in Blink-Trees (discussed later)

Rightmost Pointers

  • each cell has 1 separator key and 1 child pointer
  • the rightmost pointer is stored in header
  • used by SQLite

Node High Keys

  • high key, represents the highest possible key of the subtree
  • used by PostgreSQL, called B^link-Trees
  • pros:
    • pairwise store separator keys and child pointers
    • less edge case handling
    • more explicit search space

Overflow Pages

  • primary page, followed by multiple linked overflow pages
    • page ID of the next page could be stored in the page header
  • most of implementation
    • using fixed size of payload (max_payload_size) in the primary page
    • spill out to overflow page for the rest of payload
    • max_payload_size calculated by page size / fanout
  • require extra bookkeeping for defragmentation
  • keys reside in the primary page for frequent comparison
  • data records may need to traverse to locate in several overflow pages for the parts

Operation

Propagating Splits and Merges

  • breadcrumbs
    • be used to maintain the track of traversal path
    • PostgreSQL implements with BTStask
    • equivalent as parent pointers, since the child nodes are always referred from the root and parent(s)
    • build and maintained in memory
  • deadlocks may happen when using sibling pointers
    • WiredTiger uses parent pointers for leaf traversal

Rebalancing

  • Operation Cost
    • to postpone split & merge operations
    • to amortize the cost of split & merge by rebalancing
  • Occupancy
    • to improve the occupancy
      • B*-Trees keep distributing between the neighboring nodes until both are full
      • then split the two nodes into three nodes
    • lower the tree height
      • fewer pages traversal
  • SQLite implements as the balance-siblings algorithm

Right-Only Appends

  • An optimization for auto-incremented monotonically increasing primary index
  • fastpath in PostgreSQL
    • cache the rightmost leaf, to skip the whole read path from the root
  • quickbalance in SQLite
    • when rightmost page being full, “creating” a new empty page instead of “splitting” to form a half full page
  • bulk loading
    • bulk loading presorted data or rebuild the tree
    • compose bottom up, avoid splits & merges
      • fill the leaf level pages
      • propagate the first keys of each leaf node up to its parent node
  • Immutable B-Trees or auto-incremented primary index
    • can fill up nodes with out leaving any space for future middle insertion

Compression

  • Compression Level
    • entire index file
      • impractical
    • page-wise
      • may not align with the disk blocks
    • row-wise
    • filed-wise
  • Compression Evaluation, ex. Squash Compression Benchmark
    • memory overhead
    • compression performance
    • decompression performance
    • compression ratio

Vacuum & Maintenance

  • Rewrite the page with the lived cell data
  • Similar terminologies:
    • defragmentation
    • garbage collection
    • compaction
    • vacuum
    • maintenance

Database-Internals-Ch-3-File-Formats

Motivation

For on-disk layout structures, we have to deal with:

  • fixed-size primitives structures & variable size structures
  • garbage collection & fragmentation
  • serialization & deserialization
  • tracking and management of the segments usage

Binary Encoding

  • numeric (fixed-size)
    • Big-endian: most-significant byte (MSB)
    • Little-endian: least-significant byte (LSB)
    • IEEE 754: 32-bit float for single-precision value
      • bit 31: sign
      • bit 30-23: exponent
      • bit 22-1: fraction
  • string (variable-size)
    • (size|data): UCSD String or Pascal String
      • could get length in constant time without iterating through
      • could slice copy from memory
    • (data/null): null-terminated string
  • bit-packed data
    • booleans
    • enum
    • flags, bitmasks

Structures & Layouts

File Organization

| header | page | page | page | ... | tailer |

Page Organization for Fixed-size Records

Page for a B-Tree node:

| P0 | k1 | v1 | P1 | k2 | v2 | ... | kn | vn | Pn | unused |

Downside:

  • key insertion requires relocating elements
  • not allow managing or accessing variable-size records efficiently

Slotted Pages

| header | offset0 | offset1 | offset2 | ... | unused | ... | cell2 | cell1 | cell0 |

Allow:

  • storing variable-size of records with a minimal overhead
  • reclaiming space occupied by the removed records
  • dynamic layout to hide the exact location internally

Cells

  • separator key cells

| [int] key_size | [int] child_page_id | [bytes] key |

  • key-value cells

| [int] key_size | [int] value_size | [bytes] key | [bytes] data_record |

Management of Cells in Slotted Pages

  • Keep offsets sorted by keys
    • no need to relocate cells
  • Maintain an Availability List for inserting a new cell
    • holds the list of offsets of freed segments and their sizes
    • first fit strategy
      • larger overhead, effectively wasted
    • best fir strategy
      • find a segment leaves smallest remainder
    • if cannot find in availability list
      • and if there are enough fragmented bytes available
        • => defragmentation
      • if there are not enough available
        • => create an overflow page

Versioning

Could be done in several ways:

  • identified by filename prefix (ex. Cassandra)
  • separated file (ex. PG_VERSION in PostgreSQL)
  • stored in index file header
  • file format using magic number

Checksumming

Usually put the page cheksum in the page header.

Distinct between:

  • checksum
    • weakest form of guarantee and aren’t able to detect corruptiob in multiple bits
  • cyclic redundancy check (CRC)
    • make sure there were no unintended and accidental changes in data
    • not designed to resist attacks and intentional changes in data
  • cryptographic hash function
    • for security

Database-Internals-Ch-2-B-Tree-Basics

Binary Search Trees

  • BST (Binary Search Trees)
  • Tree balancing, pathological tree
  • rebalancing, pivot, rotate

Considerations (impractical) on trees for disk-based storage:

  • locality: node child pointers may span across several disk pages
  • tree height: hight number of disk seek to located the searched element

Disk-Based Structures

  • HDD (Hard Disk Drives)
    • read/write head movements (seek for random access): most expensive
    • sequential operations: relatively cheap
    • smallest transfer unit: sector (512Bytes - 4 KB)
  • SSD (Solid State Drives)
    • no disk spin or head movements
      • the diff between random versus sequential I/O is not as large as HDD
    • is built of
      • memory cells
      • strings (32 - 64 cells per string)
      • arrays
      • pages (2 - 16 KB)
      • blocks (64 - 512 pages per blocks)
      • planes
      • dies
    • smallest unit for read/write: page (write to empty cells only)
    • smallest unit for erase: block
    • FTL (Flash Translation Layer), responsible for
      • mapping page ID to physical locations
      • tracking empty, written, discarded pages
      • garbage collection, relocate live pages, erase unused blocks

On-Disk Structures

  • Block Device abstration
    • hide the internal disk structures provided by HDD & SSD for OS
    • even though garbage collection usually done in background, it may impact write performance in case of random and unaligned workloads
    • writing full block, combining subsequent writes to the same block
      • buffering, immutability
  • Pointer for disk structures
    • on disk, the data layout is managed manually
    • offset are
      • precomputed: if the pointer is written before the referring part
      • or cached in memory
    • preferred
      • to keep number of pointers and spans to minimum
      • rather to have a long dependency chain
  • Fewer disk access by reduce “out-of-page”pointers
    • Paged Binary Trees
      • group nodes into pages to improve locality
      • but still need to update out-of-page pointers during balancing
    • B-Trees
      • increase node fanout
      • reduce tree height
      • reduce the node pointers
      • reduce the frequency of balancing operations

Ubiquitous B-Trees

Using B-Tree, could query both point and range.

B-Tree Hierarchy

  • node: holds up to N keys and N + 1 pointers
  • key in the node: index entries, separator keys, divider cells
  • root node, internal nodes, leaf nodes
  • page as node
  • occupancy
    • balancing operations are triggered when full or nearly empty
  • B+-Trees: only holds value on the leaf nodes
  • some variants also have sibling node pointers, to simplify range scan

B-Tree Lookup

  • Complexity
    • block transfers: log_k(M)
    • comparisons: log_2(M) (binary search within each node)
  • Lookup objective
    • exact match: point queries, updates, deletions
    • predecessor: range scan, inserts

B-Tree Node Splits & Merges

Splits Merges
insert delete
when overflow the capacity when underflow the capacity
leaf nodes: N k-v pairs (occupancy under a threshold)
internal nodes: N + 1 pointers merge, otherwise rebalance
promote the split point (midpoint) to the parent node demote / delete the separator key for internal / leaf node

Database Internals Ch.1 Introduction and Overview

  • storage medium
    • Memory- vs. Disk-Based
  • layout
    • Column- vs. Row-Oriented
  • other taxonomy (not discussed)
    • OLTP vs. OLAP vs. HTAP (Hybrid Transactional & Analytical Processing)
    • k-v store, relational, document-oriented, graph databases

DBMS Architecture

  • Transport
    • Cluster Communication
    • Client Communication
  • Query Processor
    • Query Parser
      • parse, interpret, validate, access control
    • Query Optimizer
      • based on internal statistics, index cardinality, approx. intersection size
      • data placement
      • usually presented as dependency tree for execution plan/query plan
  • Execution Engine
    • Remote Execution
      • read/write, replication
    • Local Execution
  • Storage Engine
    • Transaction Manager
      • schedule transaction
      • ensure logical consistent
    • Lock Manager
      • ensure physical data integrity
    • Access Methods
      • manage access and organizing data on disk
      • heap file, B-Trees, LSM Trees (discussed later)
    • Buffer Manager
      • cache data pages
    • Recovery Manager
      • operation logs and restoring

Memory- Versus Disk-Based DBMS

main memory disk-based
primary in memory primary in disk
use disk for recovery & logging use memory for caching
usually simpler, because OS abstract memory management have to manage data references, serialization, freed memory, fragmentation
limit by volatility, might change while NVM (Non-Volatile Memory) grow
because the random access capacity, can choose from a larger pool of data structures usually use wide and short tree
make durability by backup copy, batch compaction, snapshot, checkpointing

Column- Versus Row-Oriented DBMS

  • According to how the data store on disk
column-oriented row-oriented wide column store
partition vertically partition horizontally group into column families, row-wise in each column family
Parquet, ORC, RCFile, Kudu, ClickHouse MySQL, PostgreSQL BigTable, HBase
analytical workloads transactional workloads retrieving by a sequence of keys
reconstruct with implicit identifiers / or offset identified by key identified by key & qualifier
computational efficiency with CPU’s vectorized instructions
compression efficiency

Data Files and Index Files

DBMS use specialized file organization for the purposes of:

  • storage efficiency
  • access efficiency
  • update efficiency

Some terminologies:

  • data records: consisting of multiple fields
  • index:efficiently locate data records without scanning
  • data files & index files: usually separated
  • page: files are partitioned into pages, as size of one or multiple disk blocks
  • deletion markers (tombstones): shadow the deleted record until reclaiming during garbage collection

Data Files

Also called primary files.

Implemented as:

  • heap-organized tables
    • no ordering required
    • append with new records
    • require additional index structures to be searchable
  • hash-organized tables
    • records are stored in buckets
    • inside the bucket, could be sorted or not
  • index-organized tables (IOTs)
    • store data records in the index
    • range scan could be done by sequentially scan
    • reduce the disk seek by one

Index Files

Primary index & Secondary index

  • Primary index
    • is built over a primary key or a set of keys identified as primary
    • unique entry per search key
  • Secondary index
    • all other indexes
    • may holds several entries per search key
    • may point to the same record from multiple secondary indexes

Clustered & Non-clustered

  • Clustered
    • the order of data records follows the search key order
    • primary indexes are most often clustered
    • IOTs are clustered by definition
    • secondary indexes are non-clustered by definition

Referencing directly or primary index as an indirection (when search by secondary index)

  • Referencing directly
    • reduce the number of disk seek
  • Indirection
    • reduce the cost of pointer updates while the record relocate
    • ex. MySQL InnoDB
  • Hybrid
    • store both data file offset and primary keys
    • try directly offset, if failed, go by primary key
    • update index after finding a new offset

Buffering, Immutability, and Ordering

Three common variables for storage structures.

  • Buffering
    • ex. in-memory buffers to B-Tree nodes to amortize the I/O costs
    • ex. two components LSM Trees combine buffering with immutability
  • Immutability
    • modifications are appended
    • copy-on-write
    • distinction between LSM and B-Trees is drawn as immutable against in-place update storage
  • Ordering
    • whether could efficiently range scan

Hadoop Application Architectures Ch.7 Near-Real-Time Processing with Hadoop

Straming processing tools

  • Apache Storm
  • Apache Spark Streaming
  • Apache Samza
  • Apache Flume via Flume interceptors
  • Apache Flink

Not include

  • Kafka
    • is only a message bus, not processing streaming data
  • Impala, Apache Drill, or Presto
    • are the low-latency, massively parallel processing (MPP) query engines

NRT, near-real-time, processing

Stream Processing

Common functions,

  • Aggregation
  • Windowing averages
  • Record level enrichment
  • Record level alerting / validation
  • Persistence of transient data (storing state)
  • Support for Lambda Architectures
  • Higher-level functions (sorting, grouping, partitioning, joining)
  • Integration with HDFS / HBase
THE LAMBDA ARCHITECTURE

The Lambda Architecture, as defined by Nathan Marz and James Warren and described more thoroughly in their book Big Data (Manning), is a general framework for scalable and fault-tolerant processing of large volumes of data.

  • Batch layer
    • immutable copy of the master data
    • precomputes the batch views
  • Serving layer
    • indexes the batch views, loads them, and makes them available for low-latency querying
  • Speed layer
    • is essentially the real-time layer in the architecture
    • creates views on data as it arrives in the system

Processing Flow,

  • new data will be sent to the batch and speed layers
    • in the batch layer, appended to the master data set
    • in the speed layer, used to do incremental updates of the real-time views
  • at query time
    • data from both layers will be combined
    • when the data is available in the batch and serving layers, it can be discarded from the speed layer

Advantage,

  • fault tolerant
  • low latency
  • error correction
MICROBATCHING VERSUS STREAMING
  • processing logic simplification
  • message processing overhead (batch puts)
  • exactly-once
  • Storm, pure streaming tool

Apache Storm

Storm High-Level Architecture

Storm Architecture

Storm Topologies

Storm Topology

  • In Storm you are building a topology that is solving a problem.
  • In Trident and Spark Streaming you are expressing how to solve a problem, and the topology is constructed for you behind the scenes.

Tuples and Streams

A stream is an unbounded sequence of tuples between any two nodes in a topology.

Spouts and Bolts

  • Spouts
    • provide the source of streams in a topology
    • read data from some external source, and emit one or more streams
  • Bolts
    • consume streams in a topology
    • do some processing on the tuples in the stream
    • then emit zero or more tuples to downstream bolts or external system as a data persistence layer

Stream Groupings

An important feature of Storm, over Flume.

Groupings,

  • Shuffle grouping
    • evenly and randomly distributing tuples to each downstream bolts
  • Fields grouping
    • based on the specified field(s) in the tuples, send to the same bolt
    • like partitioning by hash key
  • All grouping
    • fan-out
    • replicates stream to all bolts
  • Global grouping
    • fan-in, collecting
    • sends an entire stream to a single bolt

Reliability of Storm Applications

The ability to guarantee message processing relies on having a reliable message source, ex. Kafka.

  • At-most-once processing
  • At-least-once processing
  • Exactly-once processing
    • leverage an additional abstraction over core Storm, like Trident

Storm Example: Simple Moving Average

  • Linked list for the windowing buffer
  • suffleGrouping and fieldGrouping(new Field("ticker")) in buildTopology()

Evaluating Storm

SUPPORT FOR AGGREGATION AND WINDOWING
  • easy to implement
  • state, counters not fault-tolerant, since it uses local storage
    • if using external storage, like HBase or Memcached, notice the sync overhead, and progress loss trade-off
ENRICHMENT AND ALERTING
LAMDBA ARCHITECTURE
  • batch processes implemented with ex. MapReduce or Spark

Trident

  • a higher-level abstraction over Storm
  • wrap Storm in order to provide support for transactions over Storm
  • follows a declarative programming model similar to SQL
  • use operations for processing, such as filters, splits, merges, joins, and groupings
  • follows a microbatching model
    • providing a model where exactly-once semantics can be more easily supported
    • provides the ability to replay tuples in the event of failure
    • provides management of batches to ensure that tuples are processed exactly once

Evaluating Trident

SUPPORT FOR AGGREGATION AND WINDOWING
  • now can persist to external storage systems to maintain state with higher throughput
ENRICHMENT AND ALERTING
  • the batches are merely wrappers, nothing more than a marker at the end of a group of tuples
LAMDBA ARCHITECTURE
  • still need to implement the batch process in something like MapReduce or Spark

Spark Streaming

  • Reliable persistence of intermediate data for your counting and rolling averages.
  • Supported integration with external storage systems like HBase.
  • Reusable code between streaming and batch processing.
  • The Spark Streaming microbatch model allows for processing patterns that help to mitigate the risk of duplicate events.

Concept:

  • normal RDD: a reference to a distributed immutable collection
  • DStream: a reference to a distributed immutable collection in relation to a batch window, chunk

Spark Streaming simple count example

Spark Streaming multiple stream

Maintaining state in Spark Streaming

  • updateStateByKey()
  • checkpoint
DSTREAMS PROVIDES FAULT TOLERANCE
  • saved state to checkpoint directory every N microbatch
  • recreate from cache in memory or disk
SPARK STREAMING FAULT TOLERANCE
  • WAL for driver process failure recovery
  • resilient RDD, configurable

Evaluating Spark Streaming

SUPPORT FOR AGGREGATION AND WINDOWING
  • counting, windowing, and rolling averages are straightforward in Spark Streaming
ENRICHMENT AND ALERTING
  • have performance throughput advantages if it requires lookup from external systems like HBase to execute the enrichment and/or alerting
  • major downside here is the latency, seconds level microbatching
LAMDBA ARCHITECTURE
  • code reuse for Spark & Spark Streaming