Database-Internals-Ch-6-BTrees-Variants

  • Abstracting Node Updates, allow to have different life cycles
    • on-disk pages
    • in-memory raw binary cached versions
    • in-memory language-native representations (materialized)
  • Three problems for in-place update B-Tree implementation
    • write amplification
      • updating a disk-resident page copy on every update
    • space amplification
      • preserve some unused buffer space for future insertion and update, and included in transferring
    • complexity of concurrency
      • solving concurrency and dealing with latches

Copy-on-Write B-Trees

  • content updating
    • make a copy on the modified nodes
    • on completion, switch the topmost pointer
  • pros
    • tree is immutable
    • readers require no sync, no lock
    • inherent structure of MVCC (multiversioned)
  • cons
    • requires extra page copying
    • requires more space (but not too much since the shallow tree depth)\
  • ex. Lightning Memory-Mapped Database (LMDB)
    • k-v store used by the OpenLDAP
    • single-level data store
    • direct memory map, no application-level cache, no materialization

Lazy B-Trees

  • buffer updates and propagate them with a delay

WiredTiger

  • default MongoDB’s storage engine
  • data structures:
    • for a node, clean page consists of just index, initially constructed from the on-disk page image
    • for a node, update buffer, implemented using skiplists, complexity similar to search trees, better concurrency profile
  • operations
    • when content updating, save into the update buffer list
    • when reading, the update buffers are merge with the on-disk page content
    • when flushing, the update buffers contents are reconciled and persisted on disk
      • split or merge according to the reconciled page size
  • pros
    • page updates and structural modifications are performed by the background thread

Lazy-Adaptive Tree

  • update buffers attach to subtrees
  • when buffer full, it’s propagating to lower tree levels’ buffer
  • when the propagation reaching the leaf level and the buffer full, if flush to disk and change tree structure at once.

FD-Trees

  • small mutable head tree & multiple immutable sorted runs
    • limit the surface area, where random write I/O in the head tree
    • head tree is a small B-Tree buffering the updates
    • once head tree get full, contents are transferred to the immutable run
    • propagating records from upper level run to lower level
  • Fractional Cascading
    • helps to reduce the cost of locating an item in the lower cascading levels
    • bridges between neighboring-level
    • pull every N-th item from the lower level
  • Logarithmic Runs
    • increasing by factor k to previous level
    • propagating from up to down when run get full

Bw-Trees

  • Buzzword-Tree, try to resolve the three problems at once
  • Update Chains
    • each logical node for B-Tree consist of a linked list head from latest update: delta -> delta -> … -> base
    • base node: cache of the disk copy page contents
    • delta node: all the modifications, can represents inserts, updates, or deletes
    • logical rather than physical
      • node sizes are unlikely to be page aligned
      • no need to pre-allocate space
  • Concurrency
    • each logical node for B-Tree has a logical identifier
    • maintained with an in-memory mapping table
    • the mapping table contains virtual links to the update chain‘s head (latest delta node)
    • updated with Compare-and-Swap (lock-free)
  • Structural Modification Operations
    • SMO
    • Split
      • append a special split delta node to the splitting node
      • split delta node with the midpoint separator key & the pointer to the new logical sibling node
      • similar to B_link-Tree‘s half-split
      • update the parent with the new child node
    • Merge
      • append a special remove delta node to the right sibling, indicating the start of merge SMO
      • append a special merge delta node to the left sibling to point to the right sibling so to logical merge the contents
      • update the parent to remove the link to the right sibling
    • Prevent concurrent SMO
      • an abort delta node is installed on the parent, like a write lock
      • remove when SMO completes
  • Consolidation & GC
    • once the delta chain length reaches a threshold, consolidate the chain into a new node
    • then write to disk and update the mapping table
    • but need to wait for all reader complete to reclaim the memory, epoch-based reclaimation
  • ex. Sled, OpenBw-Tree (by CMU Database Group)

Cache-Oblivious B-Trees

  • automatically optimize the parameters, ex. block size, page size, cache line size for arbitrary adjacent two levels of memory hierarchy
  • van Emde Boas Layout
    • split the B-tree from the middle level, recursive for the subtree
    • result in the sqrt(N) size subtrees
    • any recursive subtree is stored in a contiguous block of memory
    • packed array with parameter density threshold to allow gaps for insertion or update
    • array grow or shrink when become too dense or sparse