Database-Internals-Ch-9-Failure-Detection

  • prerequisite of consensus, atomic broadcast algorithms, and distributed systems
  • link failure and process failure
  • essential properties
    • completeness
      • all members notice the failure process
      • eventually reach the final result
    • efficiency
      • how fast the failure process can be identified
    • accuracy
      • precisely detect process failure

Heartbeats & Pings

  • Fixed Time Interval
    • simple, need to carefully select frequency and timeout
  • Timeout-Free Failure Detector
    • under asynchronous assumption
  • Outsourced Heartbeats
    • heartbeats sending from neighbor nodes as failover
  • Phi-Accural Failure Detector
    • sliding window of ETA to estimate the failure probability
  • Gossip and Failure Detection
    • propagate and update heartbeat counters vector to random neighbor