Zorro: Zero-Cost Reactive Failure Recovery in Distributed

Zorro: Zero-Cost
Reactive Failure Recovery
in Distributed Graph
Processing
Mayank Pundir and Luke Leslie
Proactive Failure Recovery
Periodic and synchronous disk IO
Overhead of Proactive Recovery
Per-iteration checkpointing slowdown with 16 servers
Desirable Properties
● Zero Overhead [ZO]: No overhead is incurred during
failure-free execution.
● Complete Recovery [CR]: Results in the face of failures
are fully accurate.
● Fast Recovery [FR]: Recovery after failure is quick and
does not require additional iterations.
GAS Model
Reactive Failure Recovery
Out-neighbor Replication
Example: LFGraph, Pregel, Giraph, Hama
All-neighbor Replication
Example: Distributed GraphLab, PowerGraph
Why checkpoint when it’s already replicated?
a. All-Neighbor Replication
b. Out-Neighbor Replication
Analysis
Detailed proofs in paper
● The expected number of recovered vertices and the
probability of recovery are both dependent on the
fraction of servers that fail, rather than the actual
number of server failures.
● The probability of vertex recovery exhibits rapid
convergence to 1 as the number of neighbors
increases, or as the fraction of servers that fail
decreases.
Three R’s of Zorr(r)o
● Replace: Each failed server is substituted by a new
server (replacement server).
● Rebuild: Each replacement server collects state
information from surviving servers, rebuilds local state.
● Resume: Computation restarts from the beginning of the
last iteration before failure
Zorro Recovery Protocol
Evaluation
● Graphs: CA Road (Exponential), Twitter (Power-law),
UK Web (Power-law)
● Frameworks: PowerGraph, LFGraph
● Applications: PageRank, Single-source Shortest Paths
(SSSP), Connected Components, K-core
Decomposition
● Setup: 16 machines, 16 cores, 64 GB RAM
PowerGraph: PageRank Inaccuracy
LFGraph: PageRank Inaccuracy
PowerGraph: SSSP Inaccuracy
LFGraph: SSSP Inaccuracy
Communication Overhead
Relative to total failure-free 10 PageRank iterations
a. PowerGraph
b. LFGraph
Recovery Time
a. PowerGraph
b. LFGraph
Partitioning Functions: PageRank
a. 8 failed servers
b. 4 Failures at last iteration
Partitioning Functions: SSSP
a. 8 failed servers
b. 4 Failures at last iteration
Future Work
● Don’t let failures stop you!
● Asynchronous computation
● Hybrid, tunable proactive + reactive
replication
● Delay scheduling for cascading failures prediction model
● Gossip-style propagation among
replacements
Conclusion
● Zorro borrows from the rich and gives to the
poor!
“We live in a system of approximations.”
- Ralph Waldo Emerson
Backup Slides
Trade-off Space