1. Non-crash outages matter 2. Classical models

Visigoth Fault Tolerance
Daniel
§
Porto ,
Joao
§
Leitao ,
†
Li ,
†
Clement ,
‡
Kate ,
Cheng
Allen
Aniket
Flavio
§
‡
∏
†
NOVA Univ. of Lisbon , MMCI ,MSR Cambridge , MPI-SWS
1. Non-crash outages matter
3. Visigoth: optimize for likely faults
•Bit-flip causes an outage
(Amazon S3)
Synchronous BFT (f=2)
Q1
Assume the worst: collusion
Asynchronous CFT (f=2)
Q2
Slow
Assume the worst: indefinite delay
Independent wrong replies
Q1
•Two machines in a cluster
compute wrong XOR results
(Google)
Rodrigo
§
Rodrigues
5. Adapting protocols to VFT
Slow
•Message corruption is often
undetected by TCP CRC
(Microsoft)
∏
Junqueira ,
crash
Slow
Adjusting to the environment (u=2)
Q2
e.g. u=2
Slow
Arbitrary faults lead to different outputs Safe to assume crashes after T
VFT (u=2,o=2,s=0)
•VFT basic primitives are provided
•Configure T and s values
•Adjust Q size after T
•Consensus lower bound:
•n = u + min(u,s) + o + 1 (vs. BFT 3f+1)
VFT (u=2,o=0,s=1)
Sync (s=0)
Async (s=1)
Async (s=2)
Crash
(o=0)
Up to 2 arbitrary
but independent
arbitrary
o=2
o=2,n=5
o=1,n=5
o=0,n=5
Fills spectrum between existing models
u
2. Classical models = extremes
faulty process
(crash + arbitrary)
Too optimistic
o
s
independent faults
with identical behavior
slow
(take ≥ T to reply)
Cost effectiveness (BFT x VFT)
Too pessimistic
Doesn’t capture
arbitrary faults
Byzantine
Assumes worst
case behavior
Synchronous
Asynchronous
Predictable
message delay
Unbounded
message delay
Log(100 - %comm steps)
Crash
•Independent faults: security stops coordinated attacks
•Latency is bounded: data center (DC) networks are well provisioned
•Measurement study of RPC times on a private DC*: 99.9%, T =1seg.
suffices for receiving 6 out of 7 replies
6th reply received
*We found similar results in Amazon EC2
4. What faults are likely?
Initial results show that VFT achieve
comparable throughput at a lower cost
6. Questions and final remarks
•Does the model approximate DC faults?
•Is s bounded during partitions?
•VFT withstands faults other than crash
•Fewer resources than BFT/asynch