Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 EE282 Computer Architecture Lecture 1: What is Computer Architecture? Kunle Olukotun and Stephen Richardson Computer Systems Laboratory Stanford University WJD/KO EE282 Lecture 1 1 Logistics Lectures Lecturers TAs Mon & Wed. 9:00am-10:15am Kunle Olukotun and Stephen Richardson John Kim Lee Kenyon Grading Final Exam 1 35% Midterm 1 25% Homework 5+2 40% Text Hennessy & Patterson Computer Architecture: A Quantitative Approach (2nd Edition) See Information Sheet for details WJD/KO EE282 Lecture 1 2 1 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 EE282 Online URL: http://www.stanford.edu/class/ee282 e-mail: [email protected] [email protected] WJD/KO EE282 Lecture 1 3 Wanted: A Few Good Graders • Grade a portion of every problem set • In exchange: – requirement to take problem sets is waived – full credit on all problem sets WJD/KO EE282 Lecture 1 4 2 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Goals • Understand how computer systems are organized and why they are organized that way. – – – – Application behavior Instruction-set architecture Microarchitecture Memory Hierarchy • Be conversant with measures of computer performance and methods for increasing performance – metrics – benchmarks – performance impact of organization • pipelining, multiple issue, branch prediction, caches WJD/KO EE282 Lecture 1 5 Link I/O Chan ISA API What is Computer Architecture? Interfaces Technology IR Regs Machine Organization Applications WJD/KO Computer Architect EE282 Lecture 1 Measurement & Evaluation 6 3 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Technology Constraints • Yearly improvement – Semiconductor technology • 60% more devices per chip (doubles every 18 months) • 15% faster devices (doubles every 5 years) • Slower wires 1992 1995 1998 – Magnetic Disks • 60% increase in density – Circuit boards • 5% increase in wire density – Cables • no change 2001 64x more devices since 1992 4x faster devices WJD/KO EE282 Lecture 1 7 Changing Technology leads to Changing Architecture • 1970s • 1990s – multi-chip CPUs – semiconductor memory very expensive – complex instruction sets (good code density) – microcoded control – 1 M - 64M transistors – complex control to exploit instruction-level parallelism – Super deep pipelines • 2000s – – – – – • 1980s – 5K – 500 K transistors – single-chip CPUs, on-chip RAM feasible – simple, hard-wired control – simple instruction sets – small on-chip caches 100 M - 5 B transistors slow wires limited cooling capacity Managing complexity ???? Keeps computer architecture interesting and challenging WJD/KO EE282 Lecture 1 8 4 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Application Constraints • Applications drive machine ‘balance’ – Numerical simulations • floating-point performance • main memory bandwidth – Transaction processing • I/Os per second • Memory bandwidth • integer CPU performance – Media processing • low-precision ‘pixel’ arithmetic • multiply-accumulate rates • bit manipulation – Embedded control • I/O timing • Real-time behavior Architecture concepts often exploit application behavior WJD/KO EE282 Lecture 1 9 The Architecture Process: Quantitative Approach Estimate Cost & Performance Sort New concepts created Bad ideas WJD/KO EE282 Lecture 1 Mediocre ideas Good ideas 10 5 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Evaluation Tools • Benchmarks, traces, & mixes MOVE BR LOAD STORE ALU – macrobenchmarks & suites • application execution time – microbenchmarks 39% 20% 20% 10% 11% • measure one aspect of performance LD 5EA3 ST 31FF …. LD 1EA2 …. – traces • replay recorded accesses – • memory, branch, register Simulation at many levels – ISA, microarchitecture, RTL, gate, circuit • trade fidelity for simulation rate • • Area and delay estimation Analysis – e.g., queuing theory – Back-of-the-envelope WJD/KO EE282 Lecture 1 11 Performance Basics • Performance equation – – – – I - instructions executed CPI - clock cycles per instruction tcy - clock cycle Need to consider all three elements when making changes – ISA, architecture, technology T = I ¥ CPI ¥ t cy • Amdahl’s Law – speedup fraction p by S p T1 = T0 È(1 - p) + ˘ ÍÎ S ˙˚ WJD/KO EE282 Lecture 1 12 6 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Performance based on instruction types • Another way to compute CPU time, based on n different instruction types with different execution times: n CPU time =  ( CPI i ¥ ICi ) ¥ Clock cycletime i =1 CPI is now: nÊ ICi ˆ CPI =  Á CPI i ¥ ˜ Instruction count Ë ¯ i =1 where Frequencyof Instruction i = WJD/KO ICi Instruction count EE282 Lecture 1 13 CPU Performance Example • Assume a simple load/store machine with the following instruction frequencies: Instruction type loads stores branches ALU Frequency 25% 15% 20% 40% Cycles 2 2 2 1 • Conditional branches currently use simple test against zero. (BEQZ Rn,loc; BNEZ Rn,loc) • Should we add complex comparison/branch combination? (BEQ Rn,Rm,loc; BNE Rn,Rm,loc) – 25% of branches can use the complex scheme and save the preceding ALU instruction – The CPU cycle time of the new machine has to be 10% longer – Will this increase CPU performance? WJD/KO EE282 Lecture 1 14 7 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 CPU Performance Example: Solution #1 • Old CPU performance CPIold= 0.25x2 + 0.15x2 + 0.20x2 + 0.40x1 = 1.6 CPUtimeold = 1.6 x ICold x CCTold • New CPU performance CPInew = 0.25x2 + 0.15x2 + 0.20x2 + (0.40 - 0.25x0.2) x 1 1-0.25x0.2 = 1.63 ICnew = 0.95 x ICold CCTnew = 1.1 x CCTold CPUtimenew = ICnew x CPInew x CCTnew = 1.63 x (0.95xICold) x (1.1xCCTold) = 1.71 x ICold x CCTold • Answer: Don’t do it! WJD/KO EE282 Lecture 1 15 CPU Performance Example: Solution #2 • If you don’t need to know CPInew: • Old CPU performance CPIold= 0.25x2 + 0.15x2 + 0.20x2 + 0.40x1 = 1.6 CPUtimeold = 1.6 x ICold x CCTold • New cycles needed for same mix: CnewPIold = 0.25x2 + 0.15x2 + 0.20x2 + (0.40 - 0.25x0.2) x 1 = 1.55 CPUtimenew= CnewPIold x ICold x CCTnew = 1.55 x ICold x (1.1 x CCTold) = 1.71 x ICold x CCTold WJD/KO EE282 Lecture 1 16 8 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Megaflops -- Floating point intensive applications • MFLOPS = FP op count Execution time ¥106 • Applies only to floating-point (non-integer) problems • Assumes that the same number and type of floating-point operations are executed on all machines • Livermore Loop benchmarks use “normalized” MFLOPS based on operations in the source code weighted as follows: Original FP Operation Normalized FP ops add, sub, cmp, mult 1 divide, sqrt 4 exp, sin, etc. 8 WJD/KO EE282 Lecture 1 17 Amdahl’s Law The performance enhancement of an improvement is limited by how much the 100 improved feature is used Speedup vs Vector Fraction 90 80 70 60 Speedup p T1 = T0 È(1 - p) + ˘ ÍÎ S ˙˚ 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of Code Vectorizable WJD/KO EE282 Lecture 1 18 9 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Benchmarks • Macrobenchmarks • Microbenchmarks – application execution time – measure one performance dimension cache bandwidth main memory bandwidth procedure call overhead FP performance Perf. Dimensions WJD/KO Applications – gives insight into the cause of performance – not a good predictor of application performance Macro Micro • • • • • measures overall performance, but on just one application • Need application suite EE282 Lecture 1 19 Some Warnings about Benchmarks • Benchmarks measure the whole system – – – – – – application input compiler operating system architecture implementation • Performance on some benchmarks can be misleading – kernels – synthetic benchmarks • Popular benchmarks typically reflect yesterday’s programs – computers need to be designed for tomorrow’s programs • Benchmarks become obsolete – SPEC92, SPEC95, SPEC2000 – TPC-A, TPC-B, TPC-C WJD/KO EE282 Lecture 1 20 10 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Summarizing Performance Means Arithmetic mean 1 n  Ti n i =1 times Harmonic mean n 1  i =1 R i Rates: MFLOPS, MIPS n Can be weighted. Represents total execution time 1 Geometric mean Ê1 n Ê n Ti ˆ n Ê T ˆˆ ÁÁ ’ ˜˜ = expÁÁ  logÁÁ i ˜˜ ˜˜ Ë i =1 Tri ¯ Ë Tri ¯ ¯ Ë n i =1 Consistent independent of reference Does not represent total execution time WJD/KO EE282 Lecture 1 21 Cost • Chip cost is primarily a function of die area – – – – increases faster than linearly due to yield a very powerful processor can be built on a small die (10mm) going larger gives diminishing performance returns but … Wafer Cost Wafer Diameter Wafer Area 2,500.00 200 mm 31416 mm^2 Chip Size Die Area Die/Wafer Yield Good Die Cost/Die 1 1 30971 0.98 30351 $0.08 5 25 1167 0.96 1122 $2.23 7.5 56.25 499 0.81 406 $6.16 10 100 269 0.62 166 $15.06 15 225 110 0.35 38 $65.79 17.5 306.25 77 0.27 21 $119.05 WJD/KO EE282 Lecture 1 22 11 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Chip cost is a function of size $250.00 Unpackaged Cost ($) $200.00 $150.00 $100.00 $50.00 $0.00 0 2 4 6 8 10 12 14 16 18 20 Chip Size (mm) WJD/KO EE282 Lecture 1 23 Other costs and Real Examples • IC Packaging and testing cost – cost/die = (cost/hour x test time) / yield – could be $10-$20 or more for complex chips – IC Packaging: depends on die size, power dissipation and number of pins • Chip Costs WJD/KO EE282 Lecture 1 24 Microprocessor Report, March 25, 1996 12 Spring 01/02 K. Olukotun / S. Richardson Handout #2 EE282 Cost Microprocessor Report, March 2001 WJD/KO EE282 Lecture 1 25 Performance Microprocessor Report March 2001 WJD/KO EE282 Lecture 1 26 13
© Copyright 2024