Slides - Yale University

Traffic Engineering with
Forward Fault Correction (FFC)
Hongqiang “Harry” Liu,
Srikanth Kandula, Ratul Mahajan, Ming Zhang,
David Gelernter (Yale University)
1
Cloud services require large network capacity
Cloud Services
Growing traffic
Cloud Networks
Expensive
(e.g. cost of WAN: $100M/year)
2
TE is critical to effectively utilizing networks
Traffic Engineering
• Microsoft SWAN
• Google B4
• ……
WAN Network
• Devoflow
• MicroTE
• ……
Datacenter Network
3
But, TE is also vulnerable to faults
TE controller
Network
view
Network
configuration
Control-plane
faults
Frequent updates for
high utilization
Data-plane
faults
Network
4
Control plan faults
Failures or long delays to configure a network device
TE Controller
Switch
TE configurations
RPC failure
Firmware bugs Overloaded CPU
Memory shortage
5
Congestion due to control plane faults
s2
Link Capacity: 10
s4 (10)
s2
New Flows (traffic demands):
s1 s2 (10)
10
s1 s3 (10)
s1 s4 (10)
7
3
s1
s4
s1
s2
10
10
s4
3
s3
s3
7
10
s4 (10)
10
s3
10
Configuration
s2 failure
7
3
s1
s4
Congestion
10
10
s3
10
6
Data plane faults
Link and switch failures
Rescaling: Sending traffic
proportionally to residual paths
s2
s4 (10) s2
s2
7 failure
link
10
3
s1
s4
3
s3
s4 (10) s3
link failure
7
Link Capacity: 10
s1
congestion
s4
3
s3
7
7
Control and data plane faults in practice
In production networks:
• Faults are common.
• Faults cause severe congestion.
Control plane:
fault rate = 0.1% -- 1% per TE update.
Data plane:
fault rate = 25% per 5 minutes.
8
State of the art for handling faults
•Heavy over-provisioning
Big loss in throughput
•Reactive handling of faults
• Control plane faults: retry
• Data plane faults: re-compute TE and update networks
Cannot prevent
congestion
Slow
(seconds -- minutes)
Blocked by control
plane faults
9
How about handling congestion
proactively?
10
Forward fault correction (FFC) in TE
• [Bad News] Individual faults are unpredictable.
• [Good News] Simultaneous #faults is small.
FEC guarantees no information loss
under up to k arbitrary packet drops.
Packet loss
with careful data encoding
FFC guarantees no congestion
under up to k arbitrary faults.
Network faults
with careful traffic distribution
11
Example: FFC for control plane faults
Link Capacity: 10
s2
7
10
3
s1
s4
s2
10
10
s1
s4
3
s3
7
10
s3
10
Non-FFC
10
Configuration
s2 failure
7
3
s1
s4
Congestion
10
10
s3
10
12
Example: FFC for control plane faults
s2
Link Capacity: 10
7
s2
10
3
10
7
s1
s4
s1
s4
3
10
7
s3
s3
10
Control Plane FFC (k=1)
10
Configuration
s2 failure
7
10
3
s4
s1
s4
3
7
s3
10
7
s1
10
s2
10
10
Configuration
failure
s3
7
13
Trade-off: network utilization vs. robustness
10
s2
10
10
s3
10
10
s4
10
Non-FFC
Throughput: 50
s1
s4
10
s3
s2
10
4
7
10
s1
10
s2
10
K=1
(Control Plane FFC)
Throughput: 47
s1
s4
10
s3
10
K=2
(Control Plane FFC)
Throughput: 44
14
Systematically realizing FFC in TE
Formulation:
How to merge FFC into existing TE framework?
Computation:
How to find FFC-TE efficiently?
15
Basic TE linear programming formulations
Sizes of flows
LP formulations
𝑏𝑓
TE decisions:
𝑙𝑓,𝑡
Traffic on paths
TE objective: Maximizing throughput
Deliver all granted flows
Basic TE constraints:
FFC constraints:
No overloaded link up to
No overloaded link
…
∀𝑓 𝑏𝑓
max.
s.t. ∀𝑓:
∀𝑒:
∀𝑡 𝑙𝑓,𝑡
∀𝑓
≥ 𝑏𝑓
∀𝑡∋𝑒 𝑙𝑓,𝑡
…
𝑘𝑐 control plane faults
𝑘𝑒 link failures
𝑘𝑣 switch failures
16
≤ 𝑐𝑒
Formulating control plane FFC
𝑓1
𝑓2
𝑓3
𝑓1 ’s load in old TE
s1
s2
Total load under faults?
𝑓2 ’s load in new TE
Fault on 𝑓1 : 𝑙1𝑜𝑙𝑑 + 𝑙2𝑛𝑒𝑤 + 𝑙3𝑛𝑒𝑤 ≤ link cap
Fault on 𝑓2 : 𝑙1𝑛𝑒𝑤 + 𝑙2𝑜𝑙𝑑 + 𝑙3𝑛𝑒𝑤 ≤ link cap
𝟑
𝟏
Fault on 𝑓3 : 𝑙1𝑛𝑒𝑤 + 𝑙2𝑛𝑒𝑤 + 𝑙3𝑜𝑙𝑑 ≤ link cap
Challenge: too many constraints
With n flows and FFC protection k:
#constraints = 𝒏𝟏 + … + 𝒏𝒌 for each link.
17
An efficient and precise solution to FFC
Our approach:
A lossless compression from O( 𝑛𝑘 ) constraints to O(kn) constraints.
Total load under faults
≤ link capacity
Total additional load due to faults
≤ link spare capacity
𝑥𝑖 : additional load due to fault-i
Given 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, FFC requires that
O(
the sum of arbitrary k elements in 𝑋 is ≤ link spare capacity
Define 𝑦𝑚 as the mth largest element in 𝑋:
𝑘
𝑚=1 𝑦𝑚 ≤ link spare capacity
Expressing 𝑦𝑚 with 𝑋?
O(1)
19
𝑛
𝑘
)
Sorting network
𝑥1
𝑥2
𝑥3
𝑥4
𝑧2
𝑧1
1st round
𝑧4
𝑧3
𝑧5
𝑧7
𝑧6
2nd round
𝑧8
𝑦1 (1st largest)
𝑦2 (2nd largest)
A comparison
𝑥1
𝑧1 =max{𝑥1 , 𝑥2 }
𝑥2
𝑧2 =min{𝑥1 , 𝑥2 }
𝑦1 + 𝑦2 ≤ link spare capacity
• Complexity: O(kn) additional variables and constraints.
• Throughput: optimal in control-plane and data plane if
paths are disjoint.
19
FFC extensions
• Differential protection for different traffic priorities
• Minimizing congestion risks without rate limiters
• Control plane faults on rate limiters
• Uncertainty in current TE
• Different TE objectives (e.g. max-min fairness)
•…
20
Evaluation overview
• Testbed experiment
• FFC can be implemented in commodity switches
• FFC has no data loss due to congestion under faults
• Large-scale simulation
A WAN network with O(100)
switches and O(1000) links
Injecting faults based
on real failure reports
Single priority traffic in a
well-provisioned network
Multiple priority traffic in a
well-utilized network
21
FFC prevents congestion with negligible
throughput loss
160%
Single priority
High priority (High FFC protection)
Medium priority (Low FFC protection)
Low priority (No FFC protection)
100
Ratio (%)
80
60
40
20
0
<0.01%
FFC Data-loss / Non-FFC Data-loss
FFC Throughput / Optimal Throughput
22
Conclusions
• Centralized TE is critical to high network utilization but is
vulnerable to control and data plane faults.
• FFC proactively handle these faults.
• Guarantee: no congestion when #faults ≤ k.
• Efficiently computable with low throughput overhead in practice.
FFC
High risk of
congestion
Heavy network
over-provisioning
23