Why is P2P the Most Effective Way to Xiaodong Zhang

Why is P2P the Most Effective Way to
Deliver Internet Media Content
Xiaodong Zhang
Ohio State University
In collaborations with
Lei Guo, Yahoo!
Songqing Chen, George Mason
Enhua Tan, Ohio State
Zhen Xiao, IBM T. J. Watson Research
1
Media contents on the Internet
• Video applications are mainstream
• Video traffic is doubling every 3 to 4 months
No. 3
1. Yahoo
2. Google
3. YouTube
2
Different media delivery approaches
Complete fetch
Content Delivery Network
HTTP
Downloading
User controlled access
RTSP
Streaming
Pseudo
Streaming
Progressive download & play
HTTP
♫
♫
Overlay multicast
P2P exchange
P2P swarming
3
The Power of measurements and modeling
• Media delivery on the Internet
– Internet is an open, complex system
– Media traffic is user-behavior driven
• Challenges
– Lack of QoS support
– Lack of Internet management and control for media flow
– Thousands of concurrent streams from diverse clients
• Measurements and modeling are critical for
– Evaluating system performance under the Internet environment
– Understanding user access patterns in media systems
– Providing guidance to media system design and management
4
Zipf distribution is believed the general
model of Internet traffic patterns
• Zipf distribution (power law)
logy y
– Characterizes the property of scale
invariance
– Heavy tailed, scale free
slope: -α
• 80-20 rule
heavy tail
– Income distribution: 80% of social wealth
owned by 20% people (Pareto law)
– Web traffic: 80% Web requests access
20% pages (Breslau, INFOCOM’99)
• System implications
– Objectively caching the working set in
proxy
– Significantly reduce network traffic
log
i i
yi ∝ i −α
α: 0.6~0.8
i : rank of objects
yi : number of references
5
Does Internet media traffic follow Zipf’s law?
Web media systems
Chesire, USITS’01: Zipf-like
Cherkasova, NOSSDAV’02: non-Zipf
P2P media systems
Gummadi, SOSP’03: non-Zipf
Iamnitchi, INFOCOM’04: Zipf-like
VoD media systems
Acharya, MMCN’00: non-Zipf
Yu, EUROSYS’06: Zipf-like
Live streaming and IPTV systems
Veloso, IMW’02: Zipf-like
Sripanidkulchai, IMC’04: non-Zipf
6
Inconsistent media access pattern models
• Still based on the Zipf model
–
–
–
–
–
–
–
Zipf with exponential cutoff
Zipf-Mandelbrot distribution
Generalized Zipf-like distribution
Two-mode Zipf distribution
Fetch-at-most-once effect
Parabolic fractal distribution
…
heuristic assumptions
• All case studies
– Based on one or two workloads
– Different from or even conflict with each other
• An insightful understanding is essential to
– Content delivery system design
– Internet resource provisioning
– Performance optimization
7
Challenges of addressing the issues
• Existing studies cannot identify a general media access pattern
– Limited number of workloads
– Constrained scope of media traffic
– Biased measurements and noises in the data set
• Model should be accurate, simple, and meaningful
–
–
–
–
Characterize the unique properties
Have clear physical meanings
Observable and verifiable predictions
Impacts on system designs
• Model validation methodology
– Goodness-of-fit test
– Reexamination of previous observations
– Reappraisal of other models
8
Research Objectives
• Discover a general distribution model of media
access patterns
– Comprehensive measurements and experiments
– Rigorous mathematical analysis and modeling
– Insights into media system designs
9
Outline
• Motivation and objectives
• Stretched exponential model of Internet media traffic
• Dynamics of access patterns in media systems
• Caching implications
• Concluding remarks
10
Workload summary
• 16 workloads in different media systems
– Web, VoD, P2P, and live streaming
– Both client side and server side
nearly all workloads
available on the Internet
• Different delivery techniques
– Downloading, streaming, pseudo streaming
– Overlay multicast, P2P exchange, P2P swarming
all major delivery
techniques
• Data set characteristics
– Workload duration: 5 days - two years
data sets of
different scales
– Number of users: 103 - 105
– Number of requests: 104 - 108
– Number of objects: 102 - 106
11
Stretched exponential distribution
• Media reference rank follows stretched exponential distribution
(passed Chi-square test)
log y
fat head
Probability distribution: Weibull
x
P( X ≤ x) = 1 − exp[−( )c ]
x0
c: stretch factor
thin tail
Rank distribution:
• fat head and thin tail in log-log scale
log i
• straight line in logx-y scale
c
yc
c: stretch factor
i : rank of media objects (N objects)
y : number of references
P( y > yi ) =
i
N
b
slope: -a
yic = −a log i + b (1 ≤ i ≤ N , a = x0c )
b = 1 + a log N (assuming yN = 1)
log i
12
Evidences: Web media systems (server logs)
fat *HPC-98
head (14 MB)thin tail
10
data in log−log scale
2
10
2
10
1
76
1024
10
2
c
3
10
2
data in log−y scale
SE model fit
0
2
10
Rank (log scale)
1
98 c = 0.2, a= 0.738, b = 6.533
R = 0.997444
c
1
10
10
10
data in log−y scale
SE model fit
1 0
10
10
c = 0.3, a= 1.234, b = 8.350
2
R = 0.989825
1
3
5033
# of References (log scale)
c = 0.22
c = 0.22,
R2a=~1.160,1b = 10.198
147 R = 0.976731
1504
# of References (yc scale)
10
# of References (log scale)
6940
2
470
c
log scale
3
# of References (y scale)
# of References (log scale)
10
1 0
10
10
10
log scale in x axis
data in log−yc scale
SE model fit
0
1
1 0
10
10
2
10
Rank (log scale)
10
1
10
0
2
10
3
10
Rank (log scale)
10
x: rank of media object, y: number of references to the object. Title: workload name (median file size)
data in stretched exponential scale
data in log-log scale
R2: coefficient of determination (1 means a perfect fit)
HPC-98: enterprise streaming media server logs of HP corporation (29 months)
HPLabs: logs of video streaming server for employees in HP Labs (21 months)
ST-SVR-01: an enterprise streaming media server log workload like HPC-98 (4 months)
13
Evidences: Web media systems (req packets)
3
243
10
1
13
10
c = 0.2, a= 0.214, b = 2.910
2
R = 0.998192
data in log−yc scale
SE model fit
1
10
2
0
3
10
10
Rank (log scale)
4
10
log scale in x axis
10
1 0
10
1
10
2
2
243
10
c
# of References (y scale)
10
c
data in log−y scale
SE model fit
1 0
10
# of References (log scale)
# of References (y scale)
10
c = 0.2, a= 0.239, b = 3.520
2
R = 0.998088
2
69
c
log scale
# of References (log scale)
c
1
32
10
data in log−log scale
data in log−log scale
2
10
3
1024
10
data in log−log scale
243
ST-CLT-05 (4.5 MB)
ST-CLT-04 (2 MB)
3
1024
# of References (log scale)
PS-CLT-04 (1.5
MB)tail
fat head
thin
# of References (y scale)
powered
scale yc
c
4
16807
10
data in log−log scale
4
21750
# of References (y scale)
3
1516
10
data in log−log scale
powered scale yc
ST-SVR-01 (15 MB)
*HPLabs-99 (120 MB)
5
54151
1
32
10
c = 0.2, a= 0.250, b = 3.192
2
R = 0.995964
c
0
3
10
10
Rank (log scale)
4
10
10
data in log−y scale
SE model fit
1 0
10
1
10
2
0
3
10
10
Rank (log scale)
4
10
10
All collected from a large cable network hosted by a well-known ISP
PS-CLT-04: first IP packets of HTTP requests for media objects (downloading and
pseudo streaming), 9 days
ST-CLT-04: RTSP/MMS streaming requests (on-demand media), 9 days
ST-CLT-05: RTSP/MMS streaming requests (on-demand media), 11 days
14
Evidences: VoD media systems
fat *CTVoD-04
head
thin
(300
MB)tail
3
6
223716
10
powered scale yc
10
data in log−log scale
data in log−log scale
10
62
10
1
c = 0.55, a= 5.147, b = 26.388
2
R = 0.993933
10
40268
10
14877
10
3
1
10
2
10
Rank (log scale)
c = 0.4, a= 12.961, b = 118.864
2
# of References (log scale)
3
10321
10
1516
10
2
10
9128740
10
3372290
10
1016239
10
224855
10
29419
1
10
6
5
4
3
2
c = 0.17, a= 0.998, b = 17.029
2
R = 0.99262
10
1
1281
c
1
7
21388057
# of References (y scale)
# of References (yc scale)
4
10
10
data in log−log scale
10
data in log−y scale
SE model fit
10
data in log−yc scale
SE model fit
0
2
10
Rank (log scale)
10
4
10
3
10
YouTube-06: cumulative
number of requests to
YouTube video clips, by
crawling on web pages
publishing the data
8
44990169
5
10
c = 0.15, a= 0.552, b = 6.308
2
102 R = 0.998219
•
10
3
10
6
10
data in log−log scale
45688
IFILM-06: number of web
page clicks to video clips in
IFILM site, 16 weeks (one
week for the figure)
0
2
10
Rank (log scale)
YouTube-06 (3.4 MB)
IFILM-06 (2.25 MB)
1 0
10
1
10
log scale in x axis
430514
154055
•
1
10
data in log−yc scale
SE model fit
1 0
10
10
CTVoD-04: streaming serer
logs of a large VoD system by
China telecom, 219 days,
reported as Zipf in
EUROSYS’06
2
# of References (log scale)
1 0
10
0
•
4
81921
2773 R = 0.985824
c
data in log−y scale
SE model fit
mMoD-98: logs of a multicast
Media-on-Demand video
server, 194 days
10
log scale
# of References (yc scale)
# of References (log scale)
# of References (yc scale)
2
198
•
5
142337
# of References (log scale)
*mMoD-98 (125 MB)
400
1 0
10
1
2
10
10
0
3
4
5
10
10
Rank (log scale)
10
6
15
10
10
Evidences: P2P media systems
*KaZaa-03 (5 MB)
3
10
data in log−yc scale
SE model fit
1
10
696
10
c
10
2
c = 0.14, a= 0.221, b = 3.481
2
R = 0.999633
1
54
10
3
10
4
10
10
data in log−y scale
SE model fit
1 0
10
1
10
2
0
3
10
10
Rank (log scale)
4
10
10
5
10
3
1479
10
693
10
c = 0.52, a= 7.234, b = 58.142
195 R2 = 0.999138
10
# of References (yc scale)
3
4532
2
1
data in log−yc scale
SE model fit
c
0
2
10
Rank (log scale)
10
data in log−log scale
# of References (log scale)
10
# of References (y scale)
1
36 c = 0.45, a= 1.251, b = 12.573
2
R = 0.999378
1 0
10
# of References (log scale)
# of References (yc scale)
2
10
4
2544
data in log−log scale
data in log−log scale
132
BT-03 (636 MB)
4
19972
10
1 0
10
1
10
0
2
10
Rank (log scale)
3
10
10
KaZaa-02: large video file (> 100 MB. Files smaller than 100 MB are intensively removed)
transferring in KaZaa network, collected in a campus network, 203 days.
KaZaa-03: music files, movie clips, and movie files downloading in KaZaa network, 5 days,
reported as Zipf in INFOCOM’04.
BT-03:
# of References (log scale)
*KaZaa-02 (300 MB)
299
48 days BitTorrent file downloading (large video and DVD images) recorded by
two tracker sites
16
Evidences: Live streaming and other systems
Movie-02
7
10
3
2
10
1
10
data in log−yc scale
SE model fit
1 0
10
1
10
224834
10
144452
10
77439
10
3
3
2
c = 0.65, a= 847.645, b = 4515.774
2
26686 R = 0.999094
0
2
10
Rank (log scale)
4
10
10
1
10
10
155823
10
121336
10
85284
10
4
3
2
c = 1.15, a= 259580.882, b = 1404808.965
2
46677 R = 0.989619
1
10
c
data in log−yc scale
SE model fit
1 0
10
5
189191
c
10
10
# of Votes (y scale)
123346
4
316892
Box office sales in $million (log scale)
10
1220
10
data in log−log scale
5
c
10
460392
# of References (yc scale)
1304485
# of References (log scale)
5
Box office sales in $million (y scale)
10
c = 0.2, a= 2.662, b = 22.694
2
R = 0.99764
6
221694
data in log−log scale
6
3087335
20529
IMDB-06
6
419468
10
data in log−log scale
# of Votes (log scale)
Akamai-03
6436343
data in log−y scale
SE model fit
0
1
10
Rank (log scale)
2
10
10
1 0
10
0
1
2
10
Rank (log scale)
10
10
Akamai-03: server logs of live streaming media collected from akamai CDN, 3 months,
reported as two-mode Zipf in IMC’04
Movie-02: US movie box office ticket sales of year 2002.
IMDB-06: cumulative number of votes for top 250 movies in Internet Movie Database web site
17
Why Zipf observed before?
ad server
cache
proxy
media server
• Media traffic is driven by user requests
• Intermediate systems may affect traffic pattern
– Effect of extraneous traffic
– Filtering effect due to caching
• Biased measurements may cause Zipf observation
18
Extraneous media traffic
ads clip
flag clip
video prog 1
flag clip
video prog 2
ads server
ads clip
meta file link
ads clip
flag clip
video prog 1
flag clip
web
videoserver
prog 2
flag clip
ad and flag video are pushed
to clients mandatorily
video
program
streaming
media server
19
Effects of extraneous traffic on
reference rank distributions
Reference rates
• Do not represent user access patterns
– High request rate (high popularity)
– High total number of requests
• Not necessary Zipf with extraneous traffic
– Extraneous traffic changes
– Always SE without extraneous traffic
• Small object sizes, small traffic volume
without
extraneous
traffic
with extraneous
traffic
3
10
1
13
1
10
c = 0.2, a= 0.214, b = 2.910
2
R = 0.998192
data in log−y
2004
c
1
1
1010
SE
Non-Zipf
2005
0
prog
prog
ads
ads
flag
flag
2005: merged
into 1 object
1
32
10
1
10 c = 0.2, a= 0.250, b = 3.192
2
R = 0.995964
2005
data in log−y scale
c
0
2
2 10
5
2
10
2
10
scale
SE model fit
0
101 0 0
10
10
3
10
243
c
2
10
# of
References(log
(y scale)
# of
References
scale)
2
10
Zipf−like model (α = 0.92)
# of References (log scale)
# of References (y scale)
# of References (logc scale)
SE
2004
Zipf
10
10
10
data
in log−log scale
raw data
Zipf−like model (α = 0.71)
69
15
2004: 2 objects
3
4
1024
10
data
in log−log scale
raw data
3
10
20
# of References (log scale)
243
Avg # of references per object
25
3
10
10 10
Rank (log scale)
3
Rank (log scale)
4
10
10
4
10 105
0
101 00
10
10
SE model fit
1
1
1010
0
2
2
3
3
10 10
10 10
Rank (log scale)
Rank (log scale)
4
4
10 10
10
5
10
20
Caching effect
log y
Zipf
Filtered by Web cache
• Web workload: caching can cause a
“flattened head” in log-log scale
• Stretched exponential is not caused
by caching effect
• Local replay events can be traced by
WM/RM streaming media protocols
log i
log y
Stretched exponential
–
–
–
–
Before replay: cache validation
After replay: send feed back
Recorded in server logs
Captured in our network measurement
Server logs
DESCRIBE
packet
sniffer
log i
SET_PARAMETER
Logplaystats
21
Fetch-at-most-once effect
Unlimited cache for all users
Contradict with streaming media
measurements
ST-CLT-05
2
2
10
10
raw data
raw data
# of References (log scale)
– Users fetch a file at most once
ST-CLT-04
# of References (log scale)
– Media access pattern is Zipf-like
•
Multiple fetches by the same user
SOSP’03: “flattened head” of P2P
access pattern
# of requests (log)
•
1
10
0
1
10
0
10 0
10
1
2
10
3
10
10
Rank (log scale)
4
10
10 0
10
5
10
1
2
10
3
4
10
10
Rank (log scale)
10
Rank of object-user pair (log)
5
10
– SE access pattern, without caching
CTVoD-04
Small streaming media objects
KaZaa-02
6
223716
10
•
Conclusion
– No relation with “fetch-at-most-once”
4
81921
10
40268
10
3
2
14877
10
c = 0.4, a= 12.961, b = 118.864
2
2773 R = 0.985824
1
1
36 c = 0.45, a= 1.251, b = 12.573
2
R = 0.999378
10
1
c
data in log−y scale
SE model fit
data in log−y scale
SE model fit
10
10
10
c
1 0
10
2
132
c
# of References (log scale)
– User may fetch and view only once
10
c
Large streaming media objects
142337
# of References (y scale)
•
10
data in log−log scale
SOSP’03
5
# of References (y scale)
– Users fetch an object multiple times
3
299
data in log−log scale
# of References (log scale)
•
0
2
10
Rank (log scale)
3
10
Streaming media
File size 300 MB, c = 0.4
10
1 0
10
1
10
0
2
10
Rank (log scale)
3
10
4
10
10
KaZaa network
File size 300 MB, c = 22
0.45
Why media access pattern is not Zipf
• “Rich-get-richer” phenomenon
Popular pages can attract more users
Pages update to keep popular
Yahoo ranks No.1 more than six years
Zipf-like for long duration
• Media accesses are different
– Popularity decreases with time
exponentially
– Media objects are immutable
– Rich-get-richer not present
– Non-Zipf in long duration
Number of distinct objects
–
–
–
–
2
10
1
10
0
3
103Web
10
raw data
Video
number of .torrent file downloading
• Web accesses are Zipf
CCDF of req (log)
– Pareto, power law, …
– The structure of WWW
BitTorrent media file
3
10
------ rawlinear
datafit
------ linear fit
2
102
10
1
10
10
16
10 0
10
1
00
10
10
00
1
50
1001
150
200
100
200
10 torrent birth (day)
time after
Popularity rank
2250
10
Time after object birth (day)
Number of distinct weekly top N
popular objects in 16 weeks
Top 1 Web object never changes
Top 1 video object changes every
week
23
Outline
• Motivation and objectives
• Stretched exponential model of Internet media traffic
• Dynamics of access patterns in media systems
• Caching implications
• Concluding remarks
24
Dynamics of Access Patterns in Media Systems
• Media reference rank distribution in log-log scale
– Different systems have different access patterns
– The distribution changes over time in a system (NOSSDAV’02)
• All follow stretched exponential distribution
– Stretch factor c
– Minus of slope a
yc
• Physical meanings
c: stretch factor
slope: -a
b
– Media file sizes
– Aging effects of media objects
– Deviation from the Zipf model
log i
25
Stretched factors of different systems
Streaming systems, different file sizes
0.60
Stretch factor c
Stretch factor c
KaZaa systems, different file sizes
0.50
0.40
0.30
0.20
0.10
0.00
5 MB
300 MB
0.60
0.50
0.40
0.30
0.20
0.10
0.00
2.25 MB
Median file size
streaming
300 MB
300 MB
Median file size
300 MB
Different systems, similar file sizes
Stretch factor c
Stretch factor c
P2P
120 MB
Median file size
Different systems, similar file sizes
0.60
0.50
0.40
0.30
0.20
0.10
0.00
4.5 MB
0.60
0.50
0.40
0.30
0.20
0.10
0.00
streaming
2.25 MB
P2P
5 MB
Median file size
26
Stretched factor and media file sizes
Stretch factor c
0.60
EDU
0.50
file size vs. stretch factor c
0.40
0.30
• 0 – 5 MB:
c <= 0.2
• 5 – 100 MB: 0.2 ~ 0.3
• > 100 MB:
c >= 0.3
BIZ
0.20
0.10
c increases with file size
0.00
1
10
100
1000
Median file size (MB)
• Other factors besides file size
– Different encoding rates and compression ratios
– Video and audio are different
– Different content type: entertainment, educational, business
27
Conservations in dynamic media systems
slope = λreq
slope = λobj
system
upgrade
• Media requests over time
– Constant media request rate λreq
– Constant object birth rate λobj
Number of accessed objects:
N (t ) = λobj t + N ′(t ) = λobj t + O (log t )
Objects created in [0, t)
Objects created in (-∞, 0]
28
Stretched exponential parameters
• In a media system
yc
– Constant request rate
– Constant object birth rate
– Constant median file size
slope: -a
b
• Stretch factor c is a time invariant
constant
• Parameter a increases with time
log i
⎡ λreq
1
1 ⎤
a=⎢
⎥
N ′(t )
1
⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦
t →∞:
λreq
λobj
y →
; a → ⎡ λreq
⎣ obj
λ
1
Γ (1+ 1c )
⎤
⎦
c
c
29
Evolution of media reference rank distribution
IFILM-06
BT-03
ST-SVR-01
7.4
0.9
0.75
7.2
0.7
7
0.65
0.75
0.7
Parameter a
6.8
Parameter a
Parameter a
0.8
6.6
6.4
6.2
6
0.65
0.6
Slow down
0.55
0.5
system
upgrade
5.8
0.6
0.45
5.6
0.55
0
20
40
60
80
Time (day)
100
5.4
0
120
4641589
20
30
Time (day)
40
0.4
0
50
7 days
SE model
28 days
..
35 days
..
48 days
..
1850
c
# of References (y scale)
c
154055
10
10321
1205
0
1
10
2
10
3
10
Rank (log scale)
4
10
40
60
80
Time (day)
100
120
a increases with time
Caused by objects created
before workload collection
693
318
a converge to a constant
84
102
10
20
2627
1 weeks
SE model
4 weeks
..
8 weeks
..
16 weeks
..
1048576
# of References (y scale)
Parameter a
0.85
5
10
0
10
1
10
2
10
Rank (log scale)
Reference rank distribution
3
10
4
10
30
Deviation from the Zipf model
4
| EF |
→ 1 when a log N → ∞
| OE |
10
Number of references (log scale)
C
3
10
A
F
2
10
(X , Y )
0
E
0
1
10
0 O
10 0
10
B
1
2
D
3
10
10
10
File reference rank (log scale)
4
10
⎡ λreq
1
1 ⎤
a=⎢
⎥
N ′(t )
1
⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦
c
0.5
0.45
• a increases with c (c < 2)
• a increases with λreq/λobj
• a increases with t
0.4
|EF|/|OE|
0.35
0.3
0.25
0.2
0.15
Big media files have large deviation
Deviation increases with time
0.1
N = 1000
N = 10000
N = 100000
0.05
0
−1
10
0
10
Parameter a
1
10
31
Example: YouTube Video Measurements in IMC’07
YouTube video downloads in
a campus network
Campus users: small request rate λreq
Old object dominant
Total number of downloads of each
video clip, for all YouTube up time
Global users: large request rate λreq
⎡ λreq
1
1 ⎤
a=⎢
⎥
N ′(t )
1
⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦
Small a, small deviation
c
No old objects
Large a, large deviation
32
Outline
• Motivation and objectives
• Stretched exponential model of Internet media traffic
• Dynamics of access patterns in media systems
• Caching implications
• Conclusion
33
Caching analysis methodology
• Analyze caching with reference rank distribution
– Requests are independent
– Objects occupy unit storage volume
• Optimal hit ratio
– Unlimited cache
H opt =
1
# of hits # of reqs - # of objs N y − N
=
=
= 1−
N y
y
# of reqs
# of reqs
– N objects, cache size is ηN
• Zipf-like distribution
H zf (η ) = η 1−α − η (1 − α ) for α < 1
• Stretched exponential
H se (η ) =
Γ (1+ 1c ) −γ (1+ 1c , 1a − logη )
Γ (1+ 1c ) −γ (1+ 1c , 1a )
−
η
y
34
Modeling caching performance
0.9
Asymptotic analysis for
small cache size k (k << N)
0.8
cache hit ratio
0.7
• Zipf
H zf (
k
k
1− α
1
) = ∑ α × 1−α
N
N
i =1 i
• SE
H se (
k
k (log N ) c
)=
×
N
y
N
0.6
Web
0.5
0.4
1
0.3
media
0.2
Zipf (α = 0.8)
SE (c = 0.2, a = 0.25)
0.1 −2
10
−1
1
0
10
η
10
Parameter selection
Zipf: typical Web workload (α=0.8)
SE: typical streaming workload
(c = 0.2, a = 0.25, same as ST-CLT-05)
H (k)
(log N ) c
lim se Nk = lim c1
=0
N →∞ H ( )
N →∞
Nα
zf N
Media caching is far less
efficient than Web caching
35
Potential of long term media caching
CDF of requests
CDF of object ages
1
old objects
new objects
old objects
5000
3000
2000
•
2
4
6
8
Time (day)
10
•
0.7
0.6
0.4
•
0.3
PS−CLT−04
ST−CLT−04
ST−CLT−05
0.1
12
0
0
500
1000
Time (day)
Requests dominated by old
objects N’(t) >> λobjt
Long term
–
Requests dominated by
new objects λobjt >> N’(t)
1500
Optimal hit ratio of caching 10% objects
– PS-CLT-04: 0.52 for 9 days, 0.85 maximal
– ST-CLT-04: 0.48 for 9 days, 0.84 maximal
– ST-CLT-05: 0.54 for 11 days, 0.85 maximal
•
Short term
–
0.5
0.2
new objects
0
0
λobj ⎛ N ′(t ) ⎞
1
⎜1 +
⎟
= 1−
y (t )
λreq ⎜⎝ λobj t ⎟⎠
0.8
4000
1000
H opt = 1 −
0.9
6000
CDF of old object ages
Cumulative number of accessed objects
7000
Great improvement
when λobjt >> N’(t)
Request correlation can be further exploited
– Object popularity decreases with time
36
Long time to reach optimal
0.8
50%
0.6
0.4
200 days
0.2
0 0
10
•
1
PS−CLT−04
ST−CLT−04
ST−CLT−05
1
10
2
10
Age of object (day)
3
10
Cumulative number of requests
Cumulative number of requested objects
1
0.8
50%
0.6
0.4
150 days
0.2
0 0
10
PS−CLT−04
ST−CLT−04
ST−CLT−05
1
10
2
10
Age of object (day)
3
10
Media objects have long lifespan
– Most requested objects are created long time ago
– Most requests are for objects created long time ago
•
To achieve maximal concentration
– Very long time (months to years)
– Huge amount of storage
– Only peer-to-peer systems provide such a huge space with a long time
37
Summary
• Media access patterns do not fit Zipf model
• We give reasons why previous results were confusing
• Media access patterns are stretched exponential
• Our findings imply that
– Client-server based proxy systems are not effective to deliver
media contents
– P2P systems are most suitable for this purpose
• We provide an analytical basis for the effectiveness of a
P2P media content delivery infrastructure
38
Stretched Exponential Distribution:
Decentralized Content Delivery in Internet
• Centralized Internet accesses follows zipf
• Decentralized Internet accesses (in an organized way, such as
P2P) follow SE
• Other P2P-like accesses fitting SE reported since PODC’08
– IPTV, user channel selection distribution (SIGMETRICS’09)
– PPLive, P2P streaming request distribution (ICDCS’09)
– Wikipedia, Yahoo answers, social network posting distribution (KDD’09)
– Access distribution in PPStream is converting from zipf (2007) to stretched
exponential (2009) (a report from Nanjing Statistical Instutute)
39
References
‰ The stretched exponential distribution, PODC’08
‰ Social network contributors’ distribution, KDD’09
‰PSM-throttling, streaming in WLAN with low power, ICNP’07
‰ SCAP, wireless AP caching for streaming, ICDCS’07.
‰ Quality and resource utilization of Internet streaming, IMC’06
‰ Internet streaming workload analysis, WWW’05
‰ Measuring and modeling BitTorrent, IMC’05
‰ Sproxy, caching for streaming, INFOCOM’04
40
Caching effect on SE distribution
• Pages can be cached by
web browser and proxy
IFILM-06 (1-16 weeks)
4641589
1 weeks
SE model
4 weeks
..
8 weeks
..
16 weeks
..
154055
• Page reload events can be
accumulated over time
– Trivial in one week
– Increase with time gradually
10321
102
0
1
10
2
10
3
10
4
10
• Number of affected objects is
small
5
10
10
Rank (log scale)
– Movie replay events are not
common
Page clicks of movie trailers,
published in IFILM Web site
41
Client requests with time
PS-CLT-04 (9 days)
ST-CLT-04 (9 days)
4
x 10
Cumulative number of accessed objects
5
4
3
2
1
0
0
2
4
6
Time (day)
ST-CLT-05 (11 days)
8000
old objects
new objects
8
10
7000
7000
old objects
new objects
Cumulative number of accessed objects
6
Cumulative number of accessed objects
# of References (yc scale)
1048576
6000
5000
4000
3000
2000
1000
0
0
2
4
6
Time (day)
8
10
old objects
new objects
6000
5000
4000
3000
2000
1000
0
0
2
4
6
8
Time (day)
10
12
Cumulative number of requested objects born before and after workload collection
Before
workload collection
After
workload collection
Both are linear
N’(t) ∝ t
⎡ λreq
1
1 ⎤
a=⎢
⎥
N ′(t )
1
⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦
c
constant
In short duration, media reference rank distribution is stationary42
Number of references (y scale)
1845
1845
1024
c
1024
c
Number of references (y scale)
Segment-based streaming media caching
525
243
98
32
8
1 0
10
1
2
525
243
M
32
8
1 0
10
3
10
10
Object reference rank (log scale)
98
10
1
2
3
4
5
10
10
10
10
Segment reference rank (log scale)
•
Streaming media are often partially accessed
•
“Ideal” segment reference rank distribution
10
– Segment caching is efficient
– M segments per object, no partial access
– Two mode SE distribution
43
Segment-based streaming media caching
ST-SVR-01
ST-CLT-04
59049
3125
1024
243
32
98
32
c = 0.2
1st mode: a= 0.274, b = 6.886
2nd mode: a= 0.587, b = 8.532
1 0
10
1
10
2
10
3
10
8 c = 0.2
1st mode: a= 0.097, b = 2.775
2nd mode: a= 0.186, b = 3.220
4
10
raw data
SE model
243
c
7776
243
# of References (y scale)
16807
525
raw data
SE model
# of References (yc scale)
# of References (yc scale)
32768
ST-CLT-05
525
raw data
SE model
5
10
Rank (log scale)
1 0
10
1
10
2
10
3
98
32
8 c = 0.2
1st mode: a= 0.128, b = 2.999
2nd mode: a= 0.215, b = 3.461
4
10
10
Rank (log scale)
5
10
1 0
10
1
10
2
10
3
4
10
10
Rank (log scale)
5
10
• Segmentation: 5 seconds of media data
• Segments rank distribution: two-mode SE
– Same stretch factor c
– Smaller a than object rank distribution
• Less temporal locality
44
Segment caching performance
ST-CLT-04
ST-CLT-05
LRU efficiency
0.7
0.7
0.12
0.6
0.6
0.1
0.5
0.5
SE model
Zipf−like model
0.3
0.2
•
0.2
0.4
0.6
0.8
Normalized Cache Size (η)
0.04
Optimal object hit ratio
Optimal segment hit ratio
LRU segment hit ratio
Perfect LFU segment hit ratio
0.1
1
0
0
0.06
0.2
0.4
0.6
0.8
Normalized Cache Size (η)
Much lower than object hit ratio
0.02
1
0
0
0.2
0.4
opt
0.6
0.8
1
η
η(i): minimum cache
size to hold object i
LRU is less efficient for media caching
–
–
•
0.3
Segment hit ratio ≈ byte hit ratio
–
•
0.4
0.2
Optimal object hit ratio
Optimal segment hit ratio
LRU segment hit ratio
Perfect LFU segment hit ratio
0.1
0
0
ηLRU − ηopt
Hit Ratio
Hit Ratio
0.08
0.4
Less request concentration
Larger working set
Segment LFU is even worse
–
Sequential order in an object not captured
Conventional replacement policies are not efficient
45