- prerequisite of consensus, atomic broadcast algorithms, and distributed systems
- link failure and process failure
- essential properties
- completeness
- all members notice the failure process
- eventually reach the final result
- efficiency
- how fast the failure process can be identified
- accuracy
- precisely detect process failure
- completeness
Heartbeats & Pings
- Fixed Time Interval
- simple, need to carefully select frequency and timeout
- Timeout-Free Failure Detector
- under asynchronous assumption
- Outsourced Heartbeats
- heartbeats sending from neighbor nodes as failover
- Phi-Accural Failure Detector
- sliding window of ETA to estimate the failure probability
- Gossip and Failure Detection
- propagate and update heartbeat counters vector to random neighbor