Database-Internals-Ch-9-Failure-Detection

2020-10-05

prerequisite of consensus, atomic broadcast algorithms, and distributed systems
link failure and process failure
essential properties
- completeness
  - all members notice the failure process
  - eventually reach the final result
- efficiency
  - how fast the failure process can be identified
- accuracy
  - precisely detect process failure

Heartbeats & Pings

Fixed Time Interval
- simple, need to carefully select frequency and timeout
Timeout-Free Failure Detector
- under asynchronous assumption
Outsourced Heartbeats
- heartbeats sending from neighbor nodes as failover
Phi-Accural Failure Detector
- sliding window of ETA to estimate the failure probability
Gossip and Failure Detection
- propagate and update heartbeat counters vector to random neighbor

Database-Internals-Ch-7-Log-Strutured-Storage

2020-07-01

B-Tree	LSM Tree
in-place update	append-only
optimized for read performance	optimized for write performance

RUM Conjecture
- trade-off between Read, Update, & Memory

LSM Trees

immutable on-disk storage structure
introduced by Patrick O’Neil & Edward Cheng ‘96
sequential write, prevent fragmentation, have higher density

Amplification	Source
Read Amplification	From needing to read the duplication from multiple tables
Write Amplification	From the multiple runs of the compactions
Space Amplification	From the duplication in multiple tables

LSM Tree Structure

memtable
- mutable in-memory
- serves read & write
- triggered periodically or size threshold, flush to disk
- recoverable with WAL
Two-component LSM Tree
- disk-resident tree & memory-resident tree
- drawback: frequent merge by memtable flush
Multicomponent LSM Trees
- multiple disk-resident tables (components)
- periodic compaction for several tables
life cycles
- current memtable: receives writes & serves reads
- flushing memtable: still available for read, but not writable
- on-disk flushing target: not readable, since still incomplete
- flushed tables: available for read as soon as the flushed memtable is discarded
- compacting tables: currently merging disk-resident tables
- compacted tables: created from flushed or other compacted tables
Deletion
- just remove records from memtable will cause resurrect
- done by delete entry / tombstone / dormant certificate
- range delete: predicate deletes / range tombstones, ex. Cassandra
Lookups
- from each components, merge & reconcile the contents
Merge-Iteration
- given a cursor or iterator to navigate through file contents
- use multiway merge-sort / priority queue / min-heap
Reconciliation
- reconciliation & conflict resolution of the data records associated with the same key
- with records holding metadata, ex. timestamps

Compaction

Maintenance
- has memory usage upper bond since it only holds iterator heads
- multiple compactions can be executed (nonintersecting)
- not only for merge but also allow repartition
- preserve tombstones during compaction, only remove when no associated records assure, ex. RockDB’s bottommost level, Cassandra’s GC
Leveled Compaction
- one of the compaction strategies used by RockDB
- Level 0: flushed from memtable, tables range may overlapping, when reaching the size threshold, merge and partition into level 1
- Level 1: partitioned into different key ranges, when reaching the size threshold, merge and partition into level 2
- Level k: exponential enlarge the size threshold, bottommost is the oldest data
Size-tiered Compaction
- decide level by tables size
- merge small tables to become larger one to be promoted to the higher levels
Time-Window Compaction
- if records’ ttl (time-to-live) have been set, ex. Cassandra, the expired tables can be dropped directly

Implementation Details

Sorted String Tables

SSTables
consist of index files and data files
index file for lookup, ex. B-Trees or hash tables
data records, concatenation of key-value, are ordered by key, so allows the sequential reading
immutable dist-resident contents
SSTables-Attached Secondary Indexes (SASI), implemented in Cassandra, the secondary index files are created along with the SSTable primary key index

Bloom Filter

conceived by Burton Howard Bloom in 1970
uses a large bits array and multiple hash functions apply on keys
bitwise or to compose as a filter to indicate whether the test key might in the set

Skiplist

probabilistic complexity guarantees are close to search tree
randomly assign the height
link by / to each equal or lower height level’s next node
fully_linked flag & compare-and-swap for concurrency
ex. Cassandra’s secondary index memtable, WiredTiger’s some in-memory operations

Compression & Disk Access

compressed page size will not align with page boundary
so need an indirection layer, offset table, which stores offsets and size of compressed pages

Unordered LSM Storage

Bitcask
- one of the storage engine used in Riak
- no memtable, store in log file directly, to avoid extra write
- keydir as the in-memory hash table point from key to the latest record in log file
- GC during compaction
- not allow range scan
WiscKey
- unsorted data records in append-only vLogs
- sorted key in LSM tree
- to allow range scan
- when scanning range, use internal SSD parallelism to prefetch blocks, to reduce random I/O

Concurrency in LSM Trees

Cassandra uses operation order barriers
Memtable switch: after this, all writes go only to the new memtable, while the old one is still available for read
Flush finalization: replace the old memtable with a flushed disk-resident table in the table view
Write-ahead log truncation: discard a log segment holding records associated with a flushed memtable

Log Stacking

SSDs also use log-structured storage to deal with small random writes
stacking multiple log-structured systems can run into several problems
- write amplification
- fragmentation
- poor performance
Flash Translation Layer
- flash translation layer (FTL) is used by SSD
- translate logical page addresses to their physical locations
- keep track of pages status (live, discarded, empty)
- garbage collection
  - relocate live pages
  - erase by block (group of pages, 64 to 512 pages)
- wear leveling distributes the load evenly across the medium, to extend device lifetime
Filesystem Logging
- cause
  - redundant logging
  - different GC pattern
  - misaligned segment writes
  - interleave data records and log records due to multiple write streams

LLAMA & Mindful Stacking

latch-free, log-structured, access-method aware (LLAMA)
allow Bw-Trees to arrange the physical delta nodes in the same chain in contiguous physical location
more efficient GC, less fragmentation
reduce read latency
Open-Channel SSDs
- expose internal control of wear-leveling, garbage collection, data placement, & scheduling
- skip flash translation layer, can achieve single layer GC, minimize write amplification
- Software Defined Flash (SDF): read in page, write in block
- LOCS (LSM Tree-based KV Store on Open-Channel SSD), 2013
- LightNVM, implemented in the Linux kernel, 2017

Database-Internals-Ch-6-BTrees-Variants

2020-06-14

Abstracting Node Updates, allow to have different life cycles
- on-disk pages
- in-memory raw binary cached versions
- in-memory language-native representations (materialized)
Three problems for in-place update B-Tree implementation
- write amplification
  - updating a disk-resident page copy on every update
- space amplification
  - preserve some unused buffer space for future insertion and update, and included in transferring
- complexity of concurrency
  - solving concurrency and dealing with latches

Copy-on-Write B-Trees

content updating
- make a copy on the modified nodes
- on completion, switch the topmost pointer
pros
- tree is immutable
- readers require no sync, no lock
- inherent structure of MVCC (multiversioned)
cons
- requires extra page copying
- requires more space (but not too much since the shallow tree depth)\
ex. Lightning Memory-Mapped Database (LMDB)
- k-v store used by the OpenLDAP
- single-level data store
- direct memory map, no application-level cache, no materialization

Lazy B-Trees

buffer updates and propagate them with a delay

WiredTiger

default MongoDB’s storage engine
data structures:
- for a node, clean page consists of just index, initially constructed from the on-disk page image
- for a node, update buffer, implemented using skiplists, complexity similar to search trees, better concurrency profile
operations
- when content updating, save into the update buffer list
- when reading, the update buffers are merge with the on-disk page content
- when flushing, the update buffers contents are reconciled and persisted on disk
  - split or merge according to the reconciled page size
pros
- page updates and structural modifications are performed by the background thread

Lazy-Adaptive Tree

update buffers attach to subtrees
when buffer full, it’s propagating to lower tree levels’ buffer
when the propagation reaching the leaf level and the buffer full, if flush to disk and change tree structure at once.

FD-Trees

small mutable head tree & multiple immutable sorted runs
- limit the surface area, where random write I/O in the head tree
- head tree is a small B-Tree buffering the updates
- once head tree get full, contents are transferred to the immutable run
- propagating records from upper level run to lower level
Fractional Cascading
- helps to reduce the cost of locating an item in the lower cascading levels
- bridges between neighboring-level
- pull every N-th item from the lower level
Logarithmic Runs
- increasing by factor k to previous level
- propagating from up to down when run get full

Bw-Trees

Buzzword-Tree, try to resolve the three problems at once
Update Chains
- each logical node for B-Tree consist of a linked list head from latest update: delta -> delta -> … -> base
- base node: cache of the disk copy page contents
- delta node: all the modifications, can represents inserts, updates, or deletes
- logical rather than physical
  - node sizes are unlikely to be page aligned
  - no need to pre-allocate space
Concurrency
- each logical node for B-Tree has a logical identifier
- maintained with an in-memory mapping table
- the mapping table contains virtual links to the update chain‘s head (latest delta node)
- updated with Compare-and-Swap (lock-free)
Structural Modification Operations
- SMO
- Split
  - append a special split delta node to the splitting node
  - split delta node with the midpoint separator key & the pointer to the new logical sibling node
  - similar to B_link-Tree‘s half-split
  - update the parent with the new child node
- Merge
  - append a special remove delta node to the right sibling, indicating the start of merge SMO
  - append a special merge delta node to the left sibling to point to the right sibling so to logical merge the contents
  - update the parent to remove the link to the right sibling
- Prevent concurrent SMO
  - an abort delta node is installed on the parent, like a write lock
  - remove when SMO completes
Consolidation & GC
- once the delta chain length reaches a threshold, consolidate the chain into a new node
- then write to disk and update the mapping table
- but need to wait for all reader complete to reclaim the memory, epoch-based reclaimation
ex. Sled, OpenBw-Tree (by CMU Database Group)

Cache-Oblivious B-Trees

automatically optimize the parameters, ex. block size, page size, cache line size for arbitrary adjacent two levels of memory hierarchy
van Emde Boas Layout
- split the B-tree from the middle level, recursive for the subtree
- result in the sqrt(N) size subtrees
- any recursive subtree is stored in a contiguous block of memory
- packed array with parameter density threshold to allow gaps for insertion or update
- array grow or shrink when become too dense or sparse

Database-Internals-Ch-5-Transaction-Processing-and-Recovery

2020-05-28

A database transaction has to preserve ACID

Atomicity
Consistency
Isolation
Durability

Implementing transaction required components

transaction manager
- coordinates, schedules, tracks transactions and their individual steps
log manager
- guards access to the resources and prevents concurrent access violating data integrity
page cache
- serves as an intermediary between persistent storage and the rest of the storage engine
log manager
- holds a history of operations

Buffer Management

page cache (buffer pool)

keeps cached page contents in memory
allows modification to be buffered
when a requested page isn’t present in memory, it is paged in by the page cache
if an already cached page is requested, its cached version is returned
if there’s not enough space available, some other page is evicted and its contents are flushed to disk

Caching Semantics

many database using O_DIRECT flag to bypass the kernal page cache
as an application specific equivalent of the kernal page cache
accesses the block device directly
decouples logical and physical write operations
if the page is not pinned or referenced, it can be evicted right away
dirty pages have to be flushed before they are evicted
PostgreSQL has a background flush writer cycles through the dirty pages that are likely to be evicted
to make sure all the changes are persisted (durability), flushes are coordinated by the checkpoint process with WAL and page cache

Trade-off objectives:

Postpone flushed to reduce the number of disk accesses
Preemptively flush pages to allow quick eviction
Pick pages for eviction and flush in the optimal order
Keep cache size within its memory bounds
Avoid losing data as it is not persisted to the primary storage

Locking Pages in Cache

the higher levels of B-Tree nodes could be pinned in cache permanently,
- since it just a small fraction of the tree
- saving in every lookup path
- disk access only required in lower levels nodes

Prefetching & Immediate Eviction

Page cache also allows the storage engine to have fine-grained control over prefetching and eviction
Prefetching
- leaf nodes traversed in a range scan
Immediate Eviction
- maintenance process, unlikely to be used for the in-flight queries

Page Replacement

FIFO (first in, first out)
- impractical, ex. higher level of page nodes
LRU (least-recently used)
- 2Q LRU
- LRU-K keeping track of the last K accesses
CLOCK
- as an approximated, compact version of LRU
- Linux uses a variant of CLOCK
- access bit
  - set to 1, whenever the page is accessed
  - around the circular buffer
    - if access bit is 1, but the page is unreferenced, then set to 0
    - if access bit is already 0, then the page becomes a candidate and is scheduled for eviction
- advantage
  - use compare-and-swap (CAS) operations, and do not require locking
LFU (least-frequency used)
- frequency histogram
TinyLFU
- three queues
  - Admission: newly added elements with LRU policy
  - Probation: holding elements most likely to get evicted
  - Protected: holding elements that are to stay for longer

Recovery

Write-Ahead Log

WAL (write-ahead log), commit log

an append-only auxiliary disk-resident structure
used for crash and transaction recovery
functionalities
- allow page cache to buffer updates while ensuring durability
- persist all operations on disk until the cache copies of pages are synchronized
- allow lost in-memory changes to be reconstructed

LSN (log sequence number)

a unique, monotonically increasing number
with an internal counter or a timestamp
as the order index of the operation records in the WAL

Checkpoint

sync checkpoint
- force all dirty pages to be flushed on disk
- fully synchronizes the primary storage structure
- impractical, require pausing all operations
fuzzy checkpoint
- last_checkpoint pointer in log header, (with LSN of the begin_checkpoint record)
- begin_checkpoint log record
- info about the dirty pages
- transaction table
- end_checkpoint log record, until all the specified dirty pages are flushed

Operation Versus Data Log

physical log
- before-image <=> after-image
- store complete page stat or byte-wise changes
logical log
- redo <=> undo operation
- store operation that to be performed against the current state

Steal and Force Policies

Steal	No-steal
allow flushing uncommitted	only flushing committed
	could use only redo entries in recovery

Force	No-force
only committing flushed	allow committing unflushed
no need additional work on recovery
take longer to commit due to necessary I/O

ARIES (Algorithm for Recovery and Isolation Exploiting Sematics)

ARIES is a steal/no-force recovery algorithm
uses
- LSNs for identifying log records
- dirty page table to track page modified
- physical redo to improve performance during recovery
- logical undo to improve concurrency during normal operations
- fuzzy checkpointing
three phases in recovery proceeds
- analysis phase: identify dirty pages, identify the starting point for the redo phase
- redo phase: repeat the history up to the point of a crash
- undo phase: roll back all incomplete transactions and restore the database to the last consistent state

Concurrency Control

Optimistic Concurrency Control (OCC)
- check conflict “before” the commit
Multiversion Concurrency Control (MVCC)
- allowing multiple timestamped versions of the record to be present
Pessimistic Concurrency Control (PCC)
- manage and grant access to shared resources

Transaction Isolation

Serializability
- a schedule is a list of operations required to execute a set of transactions
- to be serial for a schedule is when transactions are executed completely independently without any interleaving
- a schedule is serializable, if it’s equivalent to some complete serial schedule
Read & Write Anomalies
- read anomalies
  - dirty read
    - a transaction can read uncommitted changes from other transactions
  - nonrepeatable read (fuzzy read)
    - a transaction queries the same row twice and gets different results
  - phantom read
    - a transaction queries the same set of rows twice and gets different results
- write anomalies
  - lost update
    - two transactions update the same record without awareness about each other’s existence
  - dirty write
    - transaction results are based on the values that have never been committed
  - write skew
    - the combination of individual transactions does not satisfy the required invariant
Isolation Levels

	Dirty	Non-Repeatable	Phantom
Read Uncommitted	Allowed	Allowed	Allowed
Read Committed	-	Allowed	Allowed
Repeatable Read	-	-	Allowed
Serializable	-	-	-

	Lost Update	Dirty	Write Skew
Snapshot Isolation	-	-	Allowed

Optimistic Concurrency Control

Transaction execution phases
- Read Phase
  - Identify the read set & write set
- Validation Phase
  - check serializability
    - if the read set out-of-date
    - if the write set will overwrite the other transactions committing during the read phase
  - if conflict found, restart from the read phase
  - else, start commit and write phase
- Write Phase
  - commit the write set from private context to the database state
critical section: Validation Phase & Write Phase
efficient if the validations usually succeeds and no need to retry

Multiversion Concurrency Control

allowing multiple record versions
using monotonically incremented transaction IDs or timestamps
distinguishes between committed & uncommitted versions
- last committed version: current
- to keep at most one uncommitted value at a time
major use cases for implementing snapshot isolation

Pessimistic Concurrency Control

Lock-Free Scheme

timestamp ordering
- max_read_timestamp and max_write_timestamp
- if read operations attempt to read value, which timestamp lower than max_write_timestamp, then abort
- if write operations attempt to write value which timestamp lower than max_read_timestamp, then abort
- if write operations attempt to write value which timestamp lower than max_write_timestamp, then just ignore the outdated written values

Lock-Based Scheme

two phase locking (2PL)
- growing phase (locks acquiring)
- shrinking phase (locks releasing)
deadlocks
- timeout and abort
- conservative 2PL
  - requires to acquire all the locks before any execution operations
  - significant limit concurrency
- wait-for graph
  - maintained by the transaction manager
  - applying either one of the restrictions
    - wait-die: a transaction can be blocked only by a transaction with lower priority
    - wounds-wait: a transaction can be blocked only by a transaction with higher priority

Locks & Latches

Locks	Latches
Guard the logical data integrity	Guard the physical data integrity
Guard a specific key or key range	Guard a page node in the storage structure
Heavyweight	Lightweight
	Lock-free concurrency still need latches

reader-writer lock (RW Lock)

	Reader	Writer
Reader	Shared	Exclusive
Writer	Exclusive	Exclusive

manage access to pages
- busy-wait
- queueing, compare-and-swap (CAS)
Latch Crabbing
- read path: the parent node’s latch can be release, as soon as the child node’s latch is acquired
- insert path: the parent node’s latch can be release, if the child node is not full
- delete path: the parent node’s latch can be release, if the child node holds enough elements
Latch Upgrading
- acquisition of shared locks along the search path, then upgrading them to exclusive locks when necessary
- always latch root to avoid the root bottleneck (how?)

B^link-Trees

B-Trees with *high keys and sibling link pointers
allow the state of half-split
- referenced by sibling pointer, not child pointer
- if the search key larger than the high key, then follow the sibling link
therefore,
- do not have to hold the parent lock when descending to the child level, even if child node splitting
- reduce the number of locks held
- allows reads concurrent to tree structural change, and prevents deadlocks
- slightly less efficient when encounter splitting (relative rare case)

Database-Internals-Ch-4-Implementing-B-Trees

2020-05-18

PostgreSQL: page size, layout version
MySQL InnoDB: number of heap records, level, implementation-specific values
SQLite: number of cells, a rightmost pointer

Magic Numbers

multibyte block, ex. (50 41 47 45)
validation & sanity check
identify version

Sibling Links

forward / backward links
help to locate neighboring nodes without ascending back to parent / root
add complexity to split and merge
- may required additional locking for the updating sibling node
- could be useful in Blink-Trees (discussed later)

Rightmost Pointers

each cell has 1 separator key and 1 child pointer
the rightmost pointer is stored in header
used by SQLite

Node High Keys

high key, represents the highest possible key of the subtree
used by PostgreSQL, called B^link-Trees
pros:
- pairwise store separator keys and child pointers
- less edge case handling
- more explicit search space

Overflow Pages

primary page, followed by multiple linked overflow pages
- page ID of the next page could be stored in the page header
most of implementation
- using fixed size of payload (max_payload_size) in the primary page
- spill out to overflow page for the rest of payload
- max_payload_size calculated by page size / fanout
require extra bookkeeping for defragmentation
keys reside in the primary page for frequent comparison
data records may need to traverse to locate in several overflow pages for the parts

Operation

Propagating Splits and Merges

breadcrumbs
- be used to maintain the track of traversal path
- PostgreSQL implements with BTStask
- equivalent as parent pointers, since the child nodes are always referred from the root and parent(s)
- build and maintained in memory
deadlocks may happen when using sibling pointers
- WiredTiger uses parent pointers for leaf traversal

Rebalancing

Operation Cost
- to postpone split & merge operations
- to amortize the cost of split & merge by rebalancing
Occupancy
- to improve the occupancy
  - B*-Trees keep distributing between the neighboring nodes until both are full
  - then split the two nodes into three nodes
- lower the tree height
  - fewer pages traversal
SQLite implements as the balance-siblings algorithm

Right-Only Appends

An optimization for auto-incremented monotonically increasing primary index
fastpath in PostgreSQL
- cache the rightmost leaf, to skip the whole read path from the root
quickbalance in SQLite
- when rightmost page being full, “creating” a new empty page instead of “splitting” to form a half full page
bulk loading
- bulk loading presorted data or rebuild the tree
- compose bottom up, avoid splits & merges
  - fill the leaf level pages
  - propagate the first keys of each leaf node up to its parent node
Immutable B-Trees or auto-incremented primary index
- can fill up nodes with out leaving any space for future middle insertion

Compression

Compression Level
- entire index file
  - impractical
- page-wise
  - may not align with the disk blocks
- row-wise
- filed-wise
Compression Evaluation, ex. Squash Compression Benchmark
- memory overhead
- compression performance
- decompression performance
- compression ratio

Vacuum & Maintenance

Rewrite the page with the lived cell data
Similar terminologies:
- defragmentation
- garbage collection
- compaction
- vacuum
- maintenance

Database-Internals-Ch-3-File-Formats

2020-05-11

Motivation

For on-disk layout structures, we have to deal with:

fixed-size primitives structures & variable size structures
garbage collection & fragmentation
serialization & deserialization
tracking and management of the segments usage

Binary Encoding

numeric (fixed-size)
- Big-endian: most-significant byte (MSB)
- Little-endian: least-significant byte (LSB)
- IEEE 754: 32-bit float for single-precision value
  - bit 31: sign
  - bit 30-23: exponent
  - bit 22-1: fraction
string (variable-size)
- (size|data): UCSD String or Pascal String
  - could get length in constant time without iterating through
  - could slice copy from memory
- (data/null): null-terminated string
bit-packed data
- booleans
- enum
- flags, bitmasks

Structures & Layouts

File Organization

Page Organization for Fixed-size Records

Page for a B-Tree node:

| P0 | k1 | v1 | P1 | k2 | v2 | ... | kn | vn | Pn | unused |

Downside:

key insertion requires relocating elements
not allow managing or accessing variable-size records efficiently

Slotted Pages

Allow:

storing variable-size of records with a minimal overhead
reclaiming space occupied by the removed records
dynamic layout to hide the exact location internally

Cells

separator key cells

| [int] key_size | [int] child_page_id | [bytes] key |

key-value cells

Management of Cells in Slotted Pages

Keep offsets sorted by keys
- no need to relocate cells
Maintain an Availability List for inserting a new cell
- holds the list of offsets of freed segments and their sizes
- first fit strategy
  - larger overhead, effectively wasted
- best fir strategy
  - find a segment leaves smallest remainder
- if cannot find in availability list
  - and if there are enough fragmented bytes available
    - => defragmentation
  - if there are not enough available
    - => create an overflow page

Versioning

Could be done in several ways:

identified by filename prefix (ex. Cassandra)
separated file (ex. PG_VERSION in PostgreSQL)
stored in index file header
file format using magic number

Checksumming

Usually put the page cheksum in the page header.

Distinct between:

checksum
- weakest form of guarantee and aren’t able to detect corruptiob in multiple bits
cyclic redundancy check (CRC)
- make sure there were no unintended and accidental changes in data
- not designed to resist attacks and intentional changes in data
cryptographic hash function
- for security

Database-Internals-Ch-2-B-Tree-Basics

2020-05-04

Binary Search Trees

BST (Binary Search Trees)
Tree balancing, pathological tree
rebalancing, pivot, rotate

Considerations (impractical) on trees for disk-based storage:

locality: node child pointers may span across several disk pages
tree height: hight number of disk seek to located the searched element

Disk-Based Structures

HDD (Hard Disk Drives)
- read/write head movements (seek for random access): most expensive
- sequential operations: relatively cheap
- smallest transfer unit: sector (512Bytes - 4 KB)
SSD (Solid State Drives)
- no disk spin or head movements
  - the diff between random versus sequential I/O is not as large as HDD
- is built of
  - memory cells
  - strings (32 - 64 cells per string)
  - arrays
  - pages (2 - 16 KB)
  - blocks (64 - 512 pages per blocks)
  - planes
  - dies
- smallest unit for read/write: page (write to empty cells only)
- smallest unit for erase: block
- FTL (Flash Translation Layer), responsible for
  - mapping page ID to physical locations
  - tracking empty, written, discarded pages
  - garbage collection, relocate live pages, erase unused blocks

On-Disk Structures

Block Device abstration
- hide the internal disk structures provided by HDD & SSD for OS
- even though garbage collection usually done in background, it may impact write performance in case of random and unaligned workloads
- writing full block, combining subsequent writes to the same block
  - buffering, immutability
Pointer for disk structures
- on disk, the data layout is managed manually
- offset are
  - precomputed: if the pointer is written before the referring part
  - or cached in memory
- preferred
  - to keep number of pointers and spans to minimum
  - rather to have a long dependency chain
Fewer disk access by reduce “out-of-page”pointers
- Paged Binary Trees
  - group nodes into pages to improve locality
  - but still need to update out-of-page pointers during balancing
- B-Trees
  - increase node fanout
  - reduce tree height
  - reduce the node pointers
  - reduce the frequency of balancing operations

Ubiquitous B-Trees

Using B-Tree, could query both point and range.

B-Tree Hierarchy

node: holds up to N keys and N + 1 pointers
key in the node: index entries, separator keys, divider cells
root node, internal nodes, leaf nodes
page as node
occupancy
- balancing operations are triggered when full or nearly empty
B+-Trees: only holds value on the leaf nodes
some variants also have sibling node pointers, to simplify range scan

B-Tree Lookup

Complexity
- block transfers: log_k(M)
- comparisons: log_2(M) (binary search within each node)
Lookup objective
- exact match: point queries, updates, deletions
- predecessor: range scan, inserts

B-Tree Node Splits & Merges

Splits	Merges
insert	delete
when overflow the capacity	when underflow the capacity
leaf nodes: N k-v pairs	(occupancy under a threshold)
internal nodes: N + 1 pointers	merge, otherwise rebalance
promote the split point (midpoint) to the parent node	demote / delete the separator key for internal / leaf node

Database Internals Ch.1 Introduction and Overview

2020-05-02

storage medium
- Memory- vs. Disk-Based
layout
- Column- vs. Row-Oriented
other taxonomy (not discussed)
- OLTP vs. OLAP vs. HTAP (Hybrid Transactional & Analytical Processing)
- k-v store, relational, document-oriented, graph databases

DBMS Architecture

Transport
- Cluster Communication
- Client Communication
Query Processor
- Query Parser
  - parse, interpret, validate, access control
- Query Optimizer
  - based on internal statistics, index cardinality, approx. intersection size
  - data placement
  - usually presented as dependency tree for execution plan/query plan
Execution Engine
- Remote Execution
  - read/write, replication
- Local Execution
Storage Engine
- Transaction Manager
  - schedule transaction
  - ensure logical consistent
- Lock Manager
  - ensure physical data integrity
- Access Methods
  - manage access and organizing data on disk
  - heap file, B-Trees, LSM Trees (discussed later)
- Buffer Manager
  - cache data pages
- Recovery Manager
  - operation logs and restoring

Memory- Versus Disk-Based DBMS

main memory	disk-based
primary in memory	primary in disk
use disk for recovery & logging	use memory for caching
usually simpler, because OS abstract memory management	have to manage data references, serialization, freed memory, fragmentation
limit by volatility, might change while NVM (Non-Volatile Memory) grow
because the random access capacity, can choose from a larger pool of data structures	usually use wide and short tree
make durability by backup copy, batch compaction, snapshot, checkpointing

Column- Versus Row-Oriented DBMS

According to how the data store on disk

column-oriented	row-oriented	wide column store
partition vertically	partition horizontally	group into column families, row-wise in each column family
Parquet, ORC, RCFile, Kudu, ClickHouse	MySQL, PostgreSQL	BigTable, HBase
analytical workloads	transactional workloads	retrieving by a sequence of keys
reconstruct with implicit identifiers / or offset	identified by key	identified by key & qualifier
computational efficiency with CPU’s vectorized instructions
compression efficiency

Data Files and Index Files

DBMS use specialized file organization for the purposes of:

storage efficiency
access efficiency
update efficiency

Some terminologies:

data records: consisting of multiple fields
index:efficiently locate data records without scanning
data files & index files: usually separated
page: files are partitioned into pages, as size of one or multiple disk blocks
deletion markers (tombstones): shadow the deleted record until reclaiming during garbage collection

Data Files

Also called primary files.

Implemented as:

heap-organized tables
- no ordering required
- append with new records
- require additional index structures to be searchable
hash-organized tables
- records are stored in buckets
- inside the bucket, could be sorted or not
index-organized tables (IOTs)
- store data records in the index
- range scan could be done by sequentially scan
- reduce the disk seek by one

Index Files

Primary index & Secondary index

Primary index
- is built over a primary key or a set of keys identified as primary
- unique entry per search key
Secondary index
- all other indexes
- may holds several entries per search key
- may point to the same record from multiple secondary indexes

Clustered & Non-clustered

Clustered
- the order of data records follows the search key order
- primary indexes are most often clustered
- IOTs are clustered by definition
- secondary indexes are non-clustered by definition

Referencing directly or primary index as an indirection (when search by secondary index)

Referencing directly
- reduce the number of disk seek
Indirection
- reduce the cost of pointer updates while the record relocate
- ex. MySQL InnoDB
Hybrid
- store both data file offset and primary keys
- try directly offset, if failed, go by primary key
- update index after finding a new offset

Buffering, Immutability, and Ordering

Three common variables for storage structures.

Buffering
- ex. in-memory buffers to B-Tree nodes to amortize the I/O costs
- ex. two components LSM Trees combine buffering with immutability
Immutability
- modifications are appended
- copy-on-write
- distinction between LSM and B-Trees is drawn as immutable against in-place update storage
Ordering
- whether could efficiently range scan

Hello World

2020-01-21

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment

Hadoop Application Architectures Ch.7 Near-Real-Time Processing with Hadoop

2019-12-23

Straming processing tools

Apache Storm
Apache Spark Streaming
Apache Samza
Apache Flume via Flume interceptors
Apache Flink

Not include

Kafka
- is only a message bus, not processing streaming data
Impala, Apache Drill, or Presto
- are the low-latency, massively parallel processing (MPP) query engines

NRT, near-real-time, processing

Stream Processing

Common functions,

Aggregation
Windowing averages
Record level enrichment
Record level alerting / validation
Persistence of transient data (storing state)
Support for Lambda Architectures
Higher-level functions (sorting, grouping, partitioning, joining)
Integration with HDFS / HBase

THE LAMBDA ARCHITECTURE

The Lambda Architecture, as defined by Nathan Marz and James Warren and described more thoroughly in their book Big Data (Manning), is a general framework for scalable and fault-tolerant processing of large volumes of data.

Batch layer
- immutable copy of the master data
- precomputes the batch views
Serving layer
- indexes the batch views, loads them, and makes them available for low-latency querying
Speed layer
- is essentially the real-time layer in the architecture
- creates views on data as it arrives in the system

Processing Flow,

new data will be sent to the batch and speed layers
- in the batch layer, appended to the master data set
- in the speed layer, used to do incremental updates of the real-time views
at query time
- data from both layers will be combined
- when the data is available in the batch and serving layers, it can be discarded from the speed layer

Advantage,

fault tolerant
low latency
error correction

MICROBATCHING VERSUS STREAMING

processing logic simplification
message processing overhead (batch puts)
exactly-once
Storm, pure streaming tool

Apache Storm

Storm High-Level Architecture

Storm Architecture

Storm Topologies

Storm Topology

In Storm you are building a topology that is solving a problem.
In Trident and Spark Streaming you are expressing how to solve a problem, and the topology is constructed for you behind the scenes.

Tuples and Streams

A stream is an unbounded sequence of tuples between any two nodes in a topology.

Spouts and Bolts

Spouts
- provide the source of streams in a topology
- read data from some external source, and emit one or more streams
Bolts
- consume streams in a topology
- do some processing on the tuples in the stream
- then emit zero or more tuples to downstream bolts or external system as a data persistence layer

Stream Groupings

An important feature of Storm, over Flume.

Groupings,

Shuffle grouping
- evenly and randomly distributing tuples to each downstream bolts
Fields grouping
- based on the specified field(s) in the tuples, send to the same bolt
- like partitioning by hash key
All grouping
- fan-out
- replicates stream to all bolts
Global grouping
- fan-in, collecting
- sends an entire stream to a single bolt

Reliability of Storm Applications

The ability to guarantee message processing relies on having a reliable message source, ex. Kafka.

At-most-once processing
At-least-once processing
Exactly-once processing
- leverage an additional abstraction over core Storm, like Trident

Storm Example: Simple Moving Average

Linked list for the windowing buffer
suffleGrouping and fieldGrouping(new Field("ticker")) in buildTopology()

Evaluating Storm

SUPPORT FOR AGGREGATION AND WINDOWING

easy to implement
state, counters not fault-tolerant, since it uses local storage
- if using external storage, like HBase or Memcached, notice the sync overhead, and progress loss trade-off

ENRICHMENT AND ALERTING

LAMDBA ARCHITECTURE

batch processes implemented with ex. MapReduce or Spark

Trident

a higher-level abstraction over Storm
wrap Storm in order to provide support for transactions over Storm
follows a declarative programming model similar to SQL
use operations for processing, such as filters, splits, merges, joins, and groupings
follows a microbatching model
- providing a model where exactly-once semantics can be more easily supported
- provides the ability to replay tuples in the event of failure
- provides management of batches to ensure that tuples are processed exactly once

Evaluating Trident

SUPPORT FOR AGGREGATION AND WINDOWING

now can persist to external storage systems to maintain state with higher throughput

ENRICHMENT AND ALERTING

the batches are merely wrappers, nothing more than a marker at the end of a group of tuples

LAMDBA ARCHITECTURE

still need to implement the batch process in something like MapReduce or Spark

Spark Streaming

Reliable persistence of intermediate data for your counting and rolling averages.
Supported integration with external storage systems like HBase.
Reusable code between streaming and batch processing.
The Spark Streaming microbatch model allows for processing patterns that help to mitigate the risk of duplicate events.

Concept:

normal RDD: a reference to a distributed immutable collection
DStream: a reference to a distributed immutable collection in relation to a batch window, chunk

Spark Streaming simple count example

Spark Streaming multiple stream

Maintaining state in Spark Streaming

updateStateByKey()
checkpoint

DSTREAMS PROVIDES FAULT TOLERANCE

saved state to checkpoint directory every N microbatch
recreate from cache in memory or disk

SPARK STREAMING FAULT TOLERANCE

WAL for driver process failure recovery
resilient RDD, configurable

Evaluating Spark Streaming

SUPPORT FOR AGGREGATION AND WINDOWING

counting, windowing, and rolling averages are straightforward in Spark Streaming

ENRICHMENT AND ALERTING

have performance throughput advantages if it requires lookup from external systems like HBase to execute the enrichment and/or alerting
major downside here is the latency, seconds level microbatching

LAMDBA ARCHITECTURE

code reuse for Spark & Spark Streaming

Heartbeats & Pings

LSM Trees

LSM Tree Structure

Compaction

Implementation Details

Sorted String Tables

Bloom Filter

Skiplist

Compression & Disk Access

Unordered LSM Storage

Concurrency in LSM Trees

Log Stacking

LLAMA & Mindful Stacking

Copy-on-Write B-Trees

Lazy B-Trees

WiredTiger

Lazy-Adaptive Tree

FD-Trees

Bw-Trees

Cache-Oblivious B-Trees

Buffer Management

Caching Semantics

Locking Pages in Cache

Prefetching & Immediate Eviction

Page Replacement

Recovery

Write-Ahead Log

Operation Versus Data Log

Steal and Force Policies

ARIES (Algorithm for Recovery and Isolation Exploiting Sematics)

Concurrency Control

Transaction Isolation

Optimistic Concurrency Control

Multiversion Concurrency Control

Pessimistic Concurrency Control

Lock-Free Scheme

Lock-Based Scheme

Locks & Latches

B^link-Trees

Page Header

Magic Numbers

Sibling Links

Rightmost Pointers

Node High Keys

Overflow Pages

Operation

Propagating Splits and Merges

Rebalancing

Right-Only Appends

Compression

Vacuum & Maintenance

Motivation

Binary Encoding

Structures & Layouts

File Organization

Page Organization for Fixed-size Records

Slotted Pages

Cells

Management of Cells in Slotted Pages

Versioning

Checksumming

Binary Search Trees

Disk-Based Structures

On-Disk Structures

Ubiquitous B-Trees

B-Tree Hierarchy

B-Tree Lookup

B-Tree Node Splits & Merges

DBMS Architecture

Memory- Versus Disk-Based DBMS

Column- Versus Row-Oriented DBMS

Data Files and Index Files

Data Files

Index Files

Buffering, Immutability, and Ordering

Quick Start

Create a new post

Run server

Generate static files

Deploy to remote sites