Why is P2P the Most Effective Way to Deliver Internet Media Content Xiaodong Zhang Ohio State University In collaborations with Lei Guo, Yahoo! Songqing Chen, George Mason Enhua Tan, Ohio State Zhen Xiao, IBM T. J. Watson Research 1 Media contents on the Internet • Video applications are mainstream • Video traffic is doubling every 3 to 4 months No. 3 1. Yahoo 2. Google 3. YouTube 2 Different media delivery approaches Complete fetch Content Delivery Network HTTP Downloading User controlled access RTSP Streaming Pseudo Streaming Progressive download & play HTTP ♫ ♫ Overlay multicast P2P exchange P2P swarming 3 The Power of measurements and modeling • Media delivery on the Internet – Internet is an open, complex system – Media traffic is user-behavior driven • Challenges – Lack of QoS support – Lack of Internet management and control for media flow – Thousands of concurrent streams from diverse clients • Measurements and modeling are critical for – Evaluating system performance under the Internet environment – Understanding user access patterns in media systems – Providing guidance to media system design and management 4 Zipf distribution is believed the general model of Internet traffic patterns • Zipf distribution (power law) logy y – Characterizes the property of scale invariance – Heavy tailed, scale free slope: -α • 80-20 rule heavy tail – Income distribution: 80% of social wealth owned by 20% people (Pareto law) – Web traffic: 80% Web requests access 20% pages (Breslau, INFOCOM’99) • System implications – Objectively caching the working set in proxy – Significantly reduce network traffic log i i yi ∝ i −α α: 0.6~0.8 i : rank of objects yi : number of references 5 Does Internet media traffic follow Zipf’s law? Web media systems Chesire, USITS’01: Zipf-like Cherkasova, NOSSDAV’02: non-Zipf P2P media systems Gummadi, SOSP’03: non-Zipf Iamnitchi, INFOCOM’04: Zipf-like VoD media systems Acharya, MMCN’00: non-Zipf Yu, EUROSYS’06: Zipf-like Live streaming and IPTV systems Veloso, IMW’02: Zipf-like Sripanidkulchai, IMC’04: non-Zipf 6 Inconsistent media access pattern models • Still based on the Zipf model – – – – – – – Zipf with exponential cutoff Zipf-Mandelbrot distribution Generalized Zipf-like distribution Two-mode Zipf distribution Fetch-at-most-once effect Parabolic fractal distribution … heuristic assumptions • All case studies – Based on one or two workloads – Different from or even conflict with each other • An insightful understanding is essential to – Content delivery system design – Internet resource provisioning – Performance optimization 7 Challenges of addressing the issues • Existing studies cannot identify a general media access pattern – Limited number of workloads – Constrained scope of media traffic – Biased measurements and noises in the data set • Model should be accurate, simple, and meaningful – – – – Characterize the unique properties Have clear physical meanings Observable and verifiable predictions Impacts on system designs • Model validation methodology – Goodness-of-fit test – Reexamination of previous observations – Reappraisal of other models 8 Research Objectives • Discover a general distribution model of media access patterns – Comprehensive measurements and experiments – Rigorous mathematical analysis and modeling – Insights into media system designs 9 Outline • Motivation and objectives • Stretched exponential model of Internet media traffic • Dynamics of access patterns in media systems • Caching implications • Concluding remarks 10 Workload summary • 16 workloads in different media systems – Web, VoD, P2P, and live streaming – Both client side and server side nearly all workloads available on the Internet • Different delivery techniques – Downloading, streaming, pseudo streaming – Overlay multicast, P2P exchange, P2P swarming all major delivery techniques • Data set characteristics – Workload duration: 5 days - two years data sets of different scales – Number of users: 103 - 105 – Number of requests: 104 - 108 – Number of objects: 102 - 106 11 Stretched exponential distribution • Media reference rank follows stretched exponential distribution (passed Chi-square test) log y fat head Probability distribution: Weibull x P( X ≤ x) = 1 − exp[−( )c ] x0 c: stretch factor thin tail Rank distribution: • fat head and thin tail in log-log scale log i • straight line in logx-y scale c yc c: stretch factor i : rank of media objects (N objects) y : number of references P( y > yi ) = i N b slope: -a yic = −a log i + b (1 ≤ i ≤ N , a = x0c ) b = 1 + a log N (assuming yN = 1) log i 12 Evidences: Web media systems (server logs) fat *HPC-98 head (14 MB)thin tail 10 data in log−log scale 2 10 2 10 1 76 1024 10 2 c 3 10 2 data in log−y scale SE model fit 0 2 10 Rank (log scale) 1 98 c = 0.2, a= 0.738, b = 6.533 R = 0.997444 c 1 10 10 10 data in log−y scale SE model fit 1 0 10 10 c = 0.3, a= 1.234, b = 8.350 2 R = 0.989825 1 3 5033 # of References (log scale) c = 0.22 c = 0.22, R2a=~1.160,1b = 10.198 147 R = 0.976731 1504 # of References (yc scale) 10 # of References (log scale) 6940 2 470 c log scale 3 # of References (y scale) # of References (log scale) 10 1 0 10 10 10 log scale in x axis data in log−yc scale SE model fit 0 1 1 0 10 10 2 10 Rank (log scale) 10 1 10 0 2 10 3 10 Rank (log scale) 10 x: rank of media object, y: number of references to the object. Title: workload name (median file size) data in stretched exponential scale data in log-log scale R2: coefficient of determination (1 means a perfect fit) HPC-98: enterprise streaming media server logs of HP corporation (29 months) HPLabs: logs of video streaming server for employees in HP Labs (21 months) ST-SVR-01: an enterprise streaming media server log workload like HPC-98 (4 months) 13 Evidences: Web media systems (req packets) 3 243 10 1 13 10 c = 0.2, a= 0.214, b = 2.910 2 R = 0.998192 data in log−yc scale SE model fit 1 10 2 0 3 10 10 Rank (log scale) 4 10 log scale in x axis 10 1 0 10 1 10 2 2 243 10 c # of References (y scale) 10 c data in log−y scale SE model fit 1 0 10 # of References (log scale) # of References (y scale) 10 c = 0.2, a= 0.239, b = 3.520 2 R = 0.998088 2 69 c log scale # of References (log scale) c 1 32 10 data in log−log scale data in log−log scale 2 10 3 1024 10 data in log−log scale 243 ST-CLT-05 (4.5 MB) ST-CLT-04 (2 MB) 3 1024 # of References (log scale) PS-CLT-04 (1.5 MB)tail fat head thin # of References (y scale) powered scale yc c 4 16807 10 data in log−log scale 4 21750 # of References (y scale) 3 1516 10 data in log−log scale powered scale yc ST-SVR-01 (15 MB) *HPLabs-99 (120 MB) 5 54151 1 32 10 c = 0.2, a= 0.250, b = 3.192 2 R = 0.995964 c 0 3 10 10 Rank (log scale) 4 10 10 data in log−y scale SE model fit 1 0 10 1 10 2 0 3 10 10 Rank (log scale) 4 10 10 All collected from a large cable network hosted by a well-known ISP PS-CLT-04: first IP packets of HTTP requests for media objects (downloading and pseudo streaming), 9 days ST-CLT-04: RTSP/MMS streaming requests (on-demand media), 9 days ST-CLT-05: RTSP/MMS streaming requests (on-demand media), 11 days 14 Evidences: VoD media systems fat *CTVoD-04 head thin (300 MB)tail 3 6 223716 10 powered scale yc 10 data in log−log scale data in log−log scale 10 62 10 1 c = 0.55, a= 5.147, b = 26.388 2 R = 0.993933 10 40268 10 14877 10 3 1 10 2 10 Rank (log scale) c = 0.4, a= 12.961, b = 118.864 2 # of References (log scale) 3 10321 10 1516 10 2 10 9128740 10 3372290 10 1016239 10 224855 10 29419 1 10 6 5 4 3 2 c = 0.17, a= 0.998, b = 17.029 2 R = 0.99262 10 1 1281 c 1 7 21388057 # of References (y scale) # of References (yc scale) 4 10 10 data in log−log scale 10 data in log−y scale SE model fit 10 data in log−yc scale SE model fit 0 2 10 Rank (log scale) 10 4 10 3 10 YouTube-06: cumulative number of requests to YouTube video clips, by crawling on web pages publishing the data 8 44990169 5 10 c = 0.15, a= 0.552, b = 6.308 2 102 R = 0.998219 • 10 3 10 6 10 data in log−log scale 45688 IFILM-06: number of web page clicks to video clips in IFILM site, 16 weeks (one week for the figure) 0 2 10 Rank (log scale) YouTube-06 (3.4 MB) IFILM-06 (2.25 MB) 1 0 10 1 10 log scale in x axis 430514 154055 • 1 10 data in log−yc scale SE model fit 1 0 10 10 CTVoD-04: streaming serer logs of a large VoD system by China telecom, 219 days, reported as Zipf in EUROSYS’06 2 # of References (log scale) 1 0 10 0 • 4 81921 2773 R = 0.985824 c data in log−y scale SE model fit mMoD-98: logs of a multicast Media-on-Demand video server, 194 days 10 log scale # of References (yc scale) # of References (log scale) # of References (yc scale) 2 198 • 5 142337 # of References (log scale) *mMoD-98 (125 MB) 400 1 0 10 1 2 10 10 0 3 4 5 10 10 Rank (log scale) 10 6 15 10 10 Evidences: P2P media systems *KaZaa-03 (5 MB) 3 10 data in log−yc scale SE model fit 1 10 696 10 c 10 2 c = 0.14, a= 0.221, b = 3.481 2 R = 0.999633 1 54 10 3 10 4 10 10 data in log−y scale SE model fit 1 0 10 1 10 2 0 3 10 10 Rank (log scale) 4 10 10 5 10 3 1479 10 693 10 c = 0.52, a= 7.234, b = 58.142 195 R2 = 0.999138 10 # of References (yc scale) 3 4532 2 1 data in log−yc scale SE model fit c 0 2 10 Rank (log scale) 10 data in log−log scale # of References (log scale) 10 # of References (y scale) 1 36 c = 0.45, a= 1.251, b = 12.573 2 R = 0.999378 1 0 10 # of References (log scale) # of References (yc scale) 2 10 4 2544 data in log−log scale data in log−log scale 132 BT-03 (636 MB) 4 19972 10 1 0 10 1 10 0 2 10 Rank (log scale) 3 10 10 KaZaa-02: large video file (> 100 MB. Files smaller than 100 MB are intensively removed) transferring in KaZaa network, collected in a campus network, 203 days. KaZaa-03: music files, movie clips, and movie files downloading in KaZaa network, 5 days, reported as Zipf in INFOCOM’04. BT-03: # of References (log scale) *KaZaa-02 (300 MB) 299 48 days BitTorrent file downloading (large video and DVD images) recorded by two tracker sites 16 Evidences: Live streaming and other systems Movie-02 7 10 3 2 10 1 10 data in log−yc scale SE model fit 1 0 10 1 10 224834 10 144452 10 77439 10 3 3 2 c = 0.65, a= 847.645, b = 4515.774 2 26686 R = 0.999094 0 2 10 Rank (log scale) 4 10 10 1 10 10 155823 10 121336 10 85284 10 4 3 2 c = 1.15, a= 259580.882, b = 1404808.965 2 46677 R = 0.989619 1 10 c data in log−yc scale SE model fit 1 0 10 5 189191 c 10 10 # of Votes (y scale) 123346 4 316892 Box office sales in $million (log scale) 10 1220 10 data in log−log scale 5 c 10 460392 # of References (yc scale) 1304485 # of References (log scale) 5 Box office sales in $million (y scale) 10 c = 0.2, a= 2.662, b = 22.694 2 R = 0.99764 6 221694 data in log−log scale 6 3087335 20529 IMDB-06 6 419468 10 data in log−log scale # of Votes (log scale) Akamai-03 6436343 data in log−y scale SE model fit 0 1 10 Rank (log scale) 2 10 10 1 0 10 0 1 2 10 Rank (log scale) 10 10 Akamai-03: server logs of live streaming media collected from akamai CDN, 3 months, reported as two-mode Zipf in IMC’04 Movie-02: US movie box office ticket sales of year 2002. IMDB-06: cumulative number of votes for top 250 movies in Internet Movie Database web site 17 Why Zipf observed before? ad server cache proxy media server • Media traffic is driven by user requests • Intermediate systems may affect traffic pattern – Effect of extraneous traffic – Filtering effect due to caching • Biased measurements may cause Zipf observation 18 Extraneous media traffic ads clip flag clip video prog 1 flag clip video prog 2 ads server ads clip meta file link ads clip flag clip video prog 1 flag clip web videoserver prog 2 flag clip ad and flag video are pushed to clients mandatorily video program streaming media server 19 Effects of extraneous traffic on reference rank distributions Reference rates • Do not represent user access patterns – High request rate (high popularity) – High total number of requests • Not necessary Zipf with extraneous traffic – Extraneous traffic changes – Always SE without extraneous traffic • Small object sizes, small traffic volume without extraneous traffic with extraneous traffic 3 10 1 13 1 10 c = 0.2, a= 0.214, b = 2.910 2 R = 0.998192 data in log−y 2004 c 1 1 1010 SE Non-Zipf 2005 0 prog prog ads ads flag flag 2005: merged into 1 object 1 32 10 1 10 c = 0.2, a= 0.250, b = 3.192 2 R = 0.995964 2005 data in log−y scale c 0 2 2 10 5 2 10 2 10 scale SE model fit 0 101 0 0 10 10 3 10 243 c 2 10 # of References(log (y scale) # of References scale) 2 10 Zipf−like model (α = 0.92) # of References (log scale) # of References (y scale) # of References (logc scale) SE 2004 Zipf 10 10 10 data in log−log scale raw data Zipf−like model (α = 0.71) 69 15 2004: 2 objects 3 4 1024 10 data in log−log scale raw data 3 10 20 # of References (log scale) 243 Avg # of references per object 25 3 10 10 10 Rank (log scale) 3 Rank (log scale) 4 10 10 4 10 105 0 101 00 10 10 SE model fit 1 1 1010 0 2 2 3 3 10 10 10 10 Rank (log scale) Rank (log scale) 4 4 10 10 10 5 10 20 Caching effect log y Zipf Filtered by Web cache • Web workload: caching can cause a “flattened head” in log-log scale • Stretched exponential is not caused by caching effect • Local replay events can be traced by WM/RM streaming media protocols log i log y Stretched exponential – – – – Before replay: cache validation After replay: send feed back Recorded in server logs Captured in our network measurement Server logs DESCRIBE packet sniffer log i SET_PARAMETER Logplaystats 21 Fetch-at-most-once effect Unlimited cache for all users Contradict with streaming media measurements ST-CLT-05 2 2 10 10 raw data raw data # of References (log scale) – Users fetch a file at most once ST-CLT-04 # of References (log scale) – Media access pattern is Zipf-like • Multiple fetches by the same user SOSP’03: “flattened head” of P2P access pattern # of requests (log) • 1 10 0 1 10 0 10 0 10 1 2 10 3 10 10 Rank (log scale) 4 10 10 0 10 5 10 1 2 10 3 4 10 10 Rank (log scale) 10 Rank of object-user pair (log) 5 10 – SE access pattern, without caching CTVoD-04 Small streaming media objects KaZaa-02 6 223716 10 • Conclusion – No relation with “fetch-at-most-once” 4 81921 10 40268 10 3 2 14877 10 c = 0.4, a= 12.961, b = 118.864 2 2773 R = 0.985824 1 1 36 c = 0.45, a= 1.251, b = 12.573 2 R = 0.999378 10 1 c data in log−y scale SE model fit data in log−y scale SE model fit 10 10 10 c 1 0 10 2 132 c # of References (log scale) – User may fetch and view only once 10 c Large streaming media objects 142337 # of References (y scale) • 10 data in log−log scale SOSP’03 5 # of References (y scale) – Users fetch an object multiple times 3 299 data in log−log scale # of References (log scale) • 0 2 10 Rank (log scale) 3 10 Streaming media File size 300 MB, c = 0.4 10 1 0 10 1 10 0 2 10 Rank (log scale) 3 10 4 10 10 KaZaa network File size 300 MB, c = 22 0.45 Why media access pattern is not Zipf • “Rich-get-richer” phenomenon Popular pages can attract more users Pages update to keep popular Yahoo ranks No.1 more than six years Zipf-like for long duration • Media accesses are different – Popularity decreases with time exponentially – Media objects are immutable – Rich-get-richer not present – Non-Zipf in long duration Number of distinct objects – – – – 2 10 1 10 0 3 103Web 10 raw data Video number of .torrent file downloading • Web accesses are Zipf CCDF of req (log) – Pareto, power law, … – The structure of WWW BitTorrent media file 3 10 ------ rawlinear datafit ------ linear fit 2 102 10 1 10 10 16 10 0 10 1 00 10 10 00 1 50 1001 150 200 100 200 10 torrent birth (day) time after Popularity rank 2250 10 Time after object birth (day) Number of distinct weekly top N popular objects in 16 weeks Top 1 Web object never changes Top 1 video object changes every week 23 Outline • Motivation and objectives • Stretched exponential model of Internet media traffic • Dynamics of access patterns in media systems • Caching implications • Concluding remarks 24 Dynamics of Access Patterns in Media Systems • Media reference rank distribution in log-log scale – Different systems have different access patterns – The distribution changes over time in a system (NOSSDAV’02) • All follow stretched exponential distribution – Stretch factor c – Minus of slope a yc • Physical meanings c: stretch factor slope: -a b – Media file sizes – Aging effects of media objects – Deviation from the Zipf model log i 25 Stretched factors of different systems Streaming systems, different file sizes 0.60 Stretch factor c Stretch factor c KaZaa systems, different file sizes 0.50 0.40 0.30 0.20 0.10 0.00 5 MB 300 MB 0.60 0.50 0.40 0.30 0.20 0.10 0.00 2.25 MB Median file size streaming 300 MB 300 MB Median file size 300 MB Different systems, similar file sizes Stretch factor c Stretch factor c P2P 120 MB Median file size Different systems, similar file sizes 0.60 0.50 0.40 0.30 0.20 0.10 0.00 4.5 MB 0.60 0.50 0.40 0.30 0.20 0.10 0.00 streaming 2.25 MB P2P 5 MB Median file size 26 Stretched factor and media file sizes Stretch factor c 0.60 EDU 0.50 file size vs. stretch factor c 0.40 0.30 • 0 – 5 MB: c <= 0.2 • 5 – 100 MB: 0.2 ~ 0.3 • > 100 MB: c >= 0.3 BIZ 0.20 0.10 c increases with file size 0.00 1 10 100 1000 Median file size (MB) • Other factors besides file size – Different encoding rates and compression ratios – Video and audio are different – Different content type: entertainment, educational, business 27 Conservations in dynamic media systems slope = λreq slope = λobj system upgrade • Media requests over time – Constant media request rate λreq – Constant object birth rate λobj Number of accessed objects: N (t ) = λobj t + N ′(t ) = λobj t + O (log t ) Objects created in [0, t) Objects created in (-∞, 0] 28 Stretched exponential parameters • In a media system yc – Constant request rate – Constant object birth rate – Constant median file size slope: -a b • Stretch factor c is a time invariant constant • Parameter a increases with time log i ⎡ λreq 1 1 ⎤ a=⎢ ⎥ N ′(t ) 1 ⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦ t →∞: λreq λobj y → ; a → ⎡ λreq ⎣ obj λ 1 Γ (1+ 1c ) ⎤ ⎦ c c 29 Evolution of media reference rank distribution IFILM-06 BT-03 ST-SVR-01 7.4 0.9 0.75 7.2 0.7 7 0.65 0.75 0.7 Parameter a 6.8 Parameter a Parameter a 0.8 6.6 6.4 6.2 6 0.65 0.6 Slow down 0.55 0.5 system upgrade 5.8 0.6 0.45 5.6 0.55 0 20 40 60 80 Time (day) 100 5.4 0 120 4641589 20 30 Time (day) 40 0.4 0 50 7 days SE model 28 days .. 35 days .. 48 days .. 1850 c # of References (y scale) c 154055 10 10321 1205 0 1 10 2 10 3 10 Rank (log scale) 4 10 40 60 80 Time (day) 100 120 a increases with time Caused by objects created before workload collection 693 318 a converge to a constant 84 102 10 20 2627 1 weeks SE model 4 weeks .. 8 weeks .. 16 weeks .. 1048576 # of References (y scale) Parameter a 0.85 5 10 0 10 1 10 2 10 Rank (log scale) Reference rank distribution 3 10 4 10 30 Deviation from the Zipf model 4 | EF | → 1 when a log N → ∞ | OE | 10 Number of references (log scale) C 3 10 A F 2 10 (X , Y ) 0 E 0 1 10 0 O 10 0 10 B 1 2 D 3 10 10 10 File reference rank (log scale) 4 10 ⎡ λreq 1 1 ⎤ a=⎢ ⎥ N ′(t ) 1 ⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦ c 0.5 0.45 • a increases with c (c < 2) • a increases with λreq/λobj • a increases with t 0.4 |EF|/|OE| 0.35 0.3 0.25 0.2 0.15 Big media files have large deviation Deviation increases with time 0.1 N = 1000 N = 10000 N = 100000 0.05 0 −1 10 0 10 Parameter a 1 10 31 Example: YouTube Video Measurements in IMC’07 YouTube video downloads in a campus network Campus users: small request rate λreq Old object dominant Total number of downloads of each video clip, for all YouTube up time Global users: large request rate λreq ⎡ λreq 1 1 ⎤ a=⎢ ⎥ N ′(t ) 1 ⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦ Small a, small deviation c No old objects Large a, large deviation 32 Outline • Motivation and objectives • Stretched exponential model of Internet media traffic • Dynamics of access patterns in media systems • Caching implications • Conclusion 33 Caching analysis methodology • Analyze caching with reference rank distribution – Requests are independent – Objects occupy unit storage volume • Optimal hit ratio – Unlimited cache H opt = 1 # of hits # of reqs - # of objs N y − N = = = 1− N y y # of reqs # of reqs – N objects, cache size is ηN • Zipf-like distribution H zf (η ) = η 1−α − η (1 − α ) for α < 1 • Stretched exponential H se (η ) = Γ (1+ 1c ) −γ (1+ 1c , 1a − logη ) Γ (1+ 1c ) −γ (1+ 1c , 1a ) − η y 34 Modeling caching performance 0.9 Asymptotic analysis for small cache size k (k << N) 0.8 cache hit ratio 0.7 • Zipf H zf ( k k 1− α 1 ) = ∑ α × 1−α N N i =1 i • SE H se ( k k (log N ) c )= × N y N 0.6 Web 0.5 0.4 1 0.3 media 0.2 Zipf (α = 0.8) SE (c = 0.2, a = 0.25) 0.1 −2 10 −1 1 0 10 η 10 Parameter selection Zipf: typical Web workload (α=0.8) SE: typical streaming workload (c = 0.2, a = 0.25, same as ST-CLT-05) H (k) (log N ) c lim se Nk = lim c1 =0 N →∞ H ( ) N →∞ Nα zf N Media caching is far less efficient than Web caching 35 Potential of long term media caching CDF of requests CDF of object ages 1 old objects new objects old objects 5000 3000 2000 • 2 4 6 8 Time (day) 10 • 0.7 0.6 0.4 • 0.3 PS−CLT−04 ST−CLT−04 ST−CLT−05 0.1 12 0 0 500 1000 Time (day) Requests dominated by old objects N’(t) >> λobjt Long term – Requests dominated by new objects λobjt >> N’(t) 1500 Optimal hit ratio of caching 10% objects – PS-CLT-04: 0.52 for 9 days, 0.85 maximal – ST-CLT-04: 0.48 for 9 days, 0.84 maximal – ST-CLT-05: 0.54 for 11 days, 0.85 maximal • Short term – 0.5 0.2 new objects 0 0 λobj ⎛ N ′(t ) ⎞ 1 ⎜1 + ⎟ = 1− y (t ) λreq ⎜⎝ λobj t ⎟⎠ 0.8 4000 1000 H opt = 1 − 0.9 6000 CDF of old object ages Cumulative number of accessed objects 7000 Great improvement when λobjt >> N’(t) Request correlation can be further exploited – Object popularity decreases with time 36 Long time to reach optimal 0.8 50% 0.6 0.4 200 days 0.2 0 0 10 • 1 PS−CLT−04 ST−CLT−04 ST−CLT−05 1 10 2 10 Age of object (day) 3 10 Cumulative number of requests Cumulative number of requested objects 1 0.8 50% 0.6 0.4 150 days 0.2 0 0 10 PS−CLT−04 ST−CLT−04 ST−CLT−05 1 10 2 10 Age of object (day) 3 10 Media objects have long lifespan – Most requested objects are created long time ago – Most requests are for objects created long time ago • To achieve maximal concentration – Very long time (months to years) – Huge amount of storage – Only peer-to-peer systems provide such a huge space with a long time 37 Summary • Media access patterns do not fit Zipf model • We give reasons why previous results were confusing • Media access patterns are stretched exponential • Our findings imply that – Client-server based proxy systems are not effective to deliver media contents – P2P systems are most suitable for this purpose • We provide an analytical basis for the effectiveness of a P2P media content delivery infrastructure 38 Stretched Exponential Distribution: Decentralized Content Delivery in Internet • Centralized Internet accesses follows zipf • Decentralized Internet accesses (in an organized way, such as P2P) follow SE • Other P2P-like accesses fitting SE reported since PODC’08 – IPTV, user channel selection distribution (SIGMETRICS’09) – PPLive, P2P streaming request distribution (ICDCS’09) – Wikipedia, Yahoo answers, social network posting distribution (KDD’09) – Access distribution in PPStream is converting from zipf (2007) to stretched exponential (2009) (a report from Nanjing Statistical Instutute) 39 References The stretched exponential distribution, PODC’08 Social network contributors’ distribution, KDD’09 PSM-throttling, streaming in WLAN with low power, ICNP’07 SCAP, wireless AP caching for streaming, ICDCS’07. Quality and resource utilization of Internet streaming, IMC’06 Internet streaming workload analysis, WWW’05 Measuring and modeling BitTorrent, IMC’05 Sproxy, caching for streaming, INFOCOM’04 40 Caching effect on SE distribution • Pages can be cached by web browser and proxy IFILM-06 (1-16 weeks) 4641589 1 weeks SE model 4 weeks .. 8 weeks .. 16 weeks .. 154055 • Page reload events can be accumulated over time – Trivial in one week – Increase with time gradually 10321 102 0 1 10 2 10 3 10 4 10 • Number of affected objects is small 5 10 10 Rank (log scale) – Movie replay events are not common Page clicks of movie trailers, published in IFILM Web site 41 Client requests with time PS-CLT-04 (9 days) ST-CLT-04 (9 days) 4 x 10 Cumulative number of accessed objects 5 4 3 2 1 0 0 2 4 6 Time (day) ST-CLT-05 (11 days) 8000 old objects new objects 8 10 7000 7000 old objects new objects Cumulative number of accessed objects 6 Cumulative number of accessed objects # of References (yc scale) 1048576 6000 5000 4000 3000 2000 1000 0 0 2 4 6 Time (day) 8 10 old objects new objects 6000 5000 4000 3000 2000 1000 0 0 2 4 6 8 Time (day) 10 12 Cumulative number of requested objects born before and after workload collection Before workload collection After workload collection Both are linear N’(t) ∝ t ⎡ λreq 1 1 ⎤ a=⎢ ⎥ N ′(t ) 1 ⎢⎣ λobj 1 + λobj t Γ(1 + c ) ⎥⎦ c constant In short duration, media reference rank distribution is stationary42 Number of references (y scale) 1845 1845 1024 c 1024 c Number of references (y scale) Segment-based streaming media caching 525 243 98 32 8 1 0 10 1 2 525 243 M 32 8 1 0 10 3 10 10 Object reference rank (log scale) 98 10 1 2 3 4 5 10 10 10 10 Segment reference rank (log scale) • Streaming media are often partially accessed • “Ideal” segment reference rank distribution 10 – Segment caching is efficient – M segments per object, no partial access – Two mode SE distribution 43 Segment-based streaming media caching ST-SVR-01 ST-CLT-04 59049 3125 1024 243 32 98 32 c = 0.2 1st mode: a= 0.274, b = 6.886 2nd mode: a= 0.587, b = 8.532 1 0 10 1 10 2 10 3 10 8 c = 0.2 1st mode: a= 0.097, b = 2.775 2nd mode: a= 0.186, b = 3.220 4 10 raw data SE model 243 c 7776 243 # of References (y scale) 16807 525 raw data SE model # of References (yc scale) # of References (yc scale) 32768 ST-CLT-05 525 raw data SE model 5 10 Rank (log scale) 1 0 10 1 10 2 10 3 98 32 8 c = 0.2 1st mode: a= 0.128, b = 2.999 2nd mode: a= 0.215, b = 3.461 4 10 10 Rank (log scale) 5 10 1 0 10 1 10 2 10 3 4 10 10 Rank (log scale) 5 10 • Segmentation: 5 seconds of media data • Segments rank distribution: two-mode SE – Same stretch factor c – Smaller a than object rank distribution • Less temporal locality 44 Segment caching performance ST-CLT-04 ST-CLT-05 LRU efficiency 0.7 0.7 0.12 0.6 0.6 0.1 0.5 0.5 SE model Zipf−like model 0.3 0.2 • 0.2 0.4 0.6 0.8 Normalized Cache Size (η) 0.04 Optimal object hit ratio Optimal segment hit ratio LRU segment hit ratio Perfect LFU segment hit ratio 0.1 1 0 0 0.06 0.2 0.4 0.6 0.8 Normalized Cache Size (η) Much lower than object hit ratio 0.02 1 0 0 0.2 0.4 opt 0.6 0.8 1 η η(i): minimum cache size to hold object i LRU is less efficient for media caching – – • 0.3 Segment hit ratio ≈ byte hit ratio – • 0.4 0.2 Optimal object hit ratio Optimal segment hit ratio LRU segment hit ratio Perfect LFU segment hit ratio 0.1 0 0 ηLRU − ηopt Hit Ratio Hit Ratio 0.08 0.4 Less request concentration Larger working set Segment LFU is even worse – Sequential order in an object not captured Conventional replacement policies are not efficient 45
© Copyright 2024