EE382A Advanced Processor Architecture Christos Kozyrakis & John Shen Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Spring 2009 Lecture 1 - 1 Christos Kozyrakis A Few Words About Christos • VIRAM media-processor Associate professor of EE & CS – Ph.D. from U.C. Berkeley IRAM test chip – B.Sc. from University of Crete • Current research – Parallel systems (scheduling, TM) ATLAS ATM Switch Telegraphos DSM switch – Energy efficient data-centers – Security systems – More info at http://csl.stanford.edu/~christos • Systems I have worked on – Networking chips: ATLAS & Telegraphos switches – Processor chips: VIRAM media-processor • Raksha Security System ATLAS TM System 125 million transistors, 9.6 billion ops/sec – FPGA prototypes: Raksha & Atlas – Server prototypes: CoolSort EE282 – Autumn 2009 Lecture 1 - 2 Christos Kozyrakis A Few Words About John • Head of Nokia Research Center in Palo Alto – Ph.D. from USC – B.Sc. from University of Michigan • Prior to Nokia – Director of the Microarchitecture Research Lab (MRL) at Intel • Superscalar architecture, speculative multithreading and memory prefetching, 3D die-stacking technology, and heterogeneous multi-sequencer architectures – Professor of Computer Engineering at CMU • Author of the main textbook for EE382a EE282 – Autumn 2009 Lecture 1 - 3 Christos Kozyrakis EE382a Team • Instructors: Christos Kozyrakis & John Shen • Teaching assistant: David Signiorelli • Guest lectures: Ben Lee + one more • Administrative support: Teresa Lynn • Contact info & office hours: up-to-date info on class webpage – http://eeclass.stanford.edu/ee382a – Check frequently EE282 – Autumn 2009 Lecture 1 - 4 Christos Kozyrakis You… • Class participation is EXTREMELY important in EE382a • Your goals – Ask questions – Offer answers – Suggest discussion topics – Make us learn your name • Will take and post photos of everyone next week EE282 – Autumn 2009 Lecture 1 - 5 Christos Kozyrakis Class Basics • Lectures: Mo & We, 11am-12.15pm, Hewlett 101 – There will also be some discussion sessions on Fridays • Friday 2-3pm, Gates Hall 498 • Discussion sessions will be explicitly announced – The class is not available on SCPD this quarter • Web page: http://eeclass.stanford.edu/ee382a – Announcements, handouts, office hours, latest schedule, bulletin board – Check frequently – Signup with webpage for on-line access to grades • We will let you know when registration is open… EE282 – Autumn 2009 Lecture 1 - 6 Christos Kozyrakis The Bulletin Board • The preferred way to ask class-related questions – We promise to check & answer often, especially close to deadlines – We encourage you to contribute to answers & have on-line discussions on class material • The bulletin rules – Before posting a new question • Check if question has already been asked or even answered – Use the search capabilities of your web browser • Check the FAQ page for the assignment – Choose an appropriate subject for your question • E.g. “HW2, problem 3, definition of memory latency” • For questions not appropriate for the public: send us an email EE282 – Autumn 2009 Lecture 1 - 7 Christos Kozyrakis EE382a Topics • Pipelining overview and analysis • Architectures for instruction level parallelism – Supersalar: instruction fetch, branch prediction, dynamic scheduling & register renaming, memory disambiguation – VLIW and dynamic binary translation • Architecture for task and data level paralellism – Multithreading, multi-core architectures, vector processing, GPUs, tradeoffs in designing multi-core chips, memory hierarchy for multi-core • Cross-cutting issues – Checkpointed processors, phase-change memory, … EE282 – Autumn 2009 Lecture 1 - 8 Christos Kozyrakis Textbooks and Papers • Textbooks – Required: "Modern Processor Design: Fundamentals of Superscalar Processors", J.P. Shen and M. Lipasti, 1st edition, McGraw-Hill • Do not use/buy the beta edition! – Reference: “Computer Architecture: A Quantitative Approach”, J. Hennessy & D. Patterson, 4th edition, Morgan Kaufmann – Reference: “Computer Organization and Design: The Hardware/Software Interface”, D. Patterson & J. Hennessy, 4th edition, Morgan Kaufmann • Papers (check handouts link on the webpage) – A few required papers • These papers are included in the exam materials • Have to submit a 1-page paper summary by the next lecture – Several optional papers • Further in-depth information, references for projects, … EE282 – Autumn 2009 Lecture 1 - 9 Christos Kozyrakis Assignments, Exams, and Class Load • Single exam and 1+2 homework assignments • Large research project – – – – On an open question in computer architecture Work in groups of up to 3 students See topic suggestions on-line or suggest your own project Milestones: proposal, halfway review/status, presentation, paper… • Grade breakdown (tentative) – Exam 40%, Project 40%, HW + summaries + participation 20% – All deadlines are final, no extensions, no exceptions – Remember the honor code (more info on web page) • Warnings – This will be a loaded class!! – This class will be as good as your participation… EE282 – Autumn 2009 Lecture 1 - 10 Christos Kozyrakis Prerequisites and Registration • Prerequisites: EE108B or equivalent – Expected to know: simple pipelines, basic caching, virtual memory, main memory • EE282 is not a required prerequisite • Class registration: – Limited to 30 students; all students must receive instructor’s approval • Homework 1: prerequisite assessment – Due on in-class on Monday – Work on it on your own – Will send you email about your registration by Wednesday EE282 – Autumn 2009 Lecture 1 - 11 Christos Kozyrakis Should I Take EE382A? • Good reason to take EE382A – Prepare for research in computer architecture – Broaden your Ph.D. research perspective – Become a digital systems architect in industry – Honest curiosity (how do Intel/AMD/… processors work?) – Want to take a class with a research project • Not a good reason to take EE382A – Prepare for quals, comps, etc… – Need another course for your degree program • “EE382A is supposed to be an easy A, right?” – Learn about digital circuits and CAD tools EE282 – Autumn 2009 Lecture 1 - 12 Christos Kozyrakis On Reading & Summarizing Papers • Look for the following – The issue or problem addressed by the paper – The original contributions (real or claimed, you have to check) – Critique: what are the major strengths and weaknesses of the papers? • Look at the claims and assumptions, the methodology, the analysis of data, and the presentation style – Future work: what are the natural extensions or improvements to this work? • Or, can we apply a similar methodology to other problems of interest • Do not submit the paper abstract as your summary :) • Helpful tips – Read the abstract, introduction, and conclusions sections first. – Read the rest of the paper twice • First a quick pass to get rough idea of details, then a detailed reading – Underline/highlight the important parts of the paper – Keep notes on the paper margins about comments or questions • Important insights, questionable claims, relevance to other topics, ways to improve some technique etc. – Look up references that seem to be important or missing • In some cases, you may also want to check who and how references this paper EE282 – Autumn 2009 Lecture 1 - 13 Christos Kozyrakis EE382A Lecture 1: Introduction to Advanced Processor Architecture Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Spring 2009 Lecture 1 - 14 Christos Kozyrakis Historical Perspectives on Processors • The Decade of the 1970’s: “Birth of Microprocessors” – Programmable Controller – Single-Chip Microprocessors – Personal Computers (PC) • The Decade of the 1980’s: “Quantitative Architecture” – Instruction Pipelining – Fast Cache Memories – Compiler Considerations – Workstations • The Decade of the 1990’s: “Instruction-Level Parallelism” – Superscalar,Speculative Microarchitectures – Aggressive Compiler Optimizations – Low-Cost Desktop Supercomputing EE282 – Autumn 2009 Lecture 1 - 15 Christos Kozyrakis Performance Growth • Doubling every 18 months (1982-2000): – total of 3,200X – Cars travel at 176,000 MPH; get 64,000 miles/gal. – Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) – Wheat yield: 320,000 bushels per acre • Doubling every 24 months (1971-2001): – total of 36,000X – Cars travel at 2,400,000 MPH; get 600,000 miles/gal. – Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) – Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!! [John Crawford, Intel, 1993] EE282 – Autumn 2009 Lecture 1 - 16 Christos Kozyrakis Convergence of Key Enabling Technologies • CMOS VLSI: – Submicron feature sizes: 0.3u 0.25u 0.18u 0.13u 90n 65n 45nm… – Metal layers: 3 4 5 6 7 (copper) 12 … – Power supply voltage: 5V 3.3V 2.4V 1.8V 1.3V 1.1V … • CAD Tools: – Interconnect simulation and critical path analysis – Clock signal propagation analysis – Process simulation and yield analysis/learning • Architecture & Microarchitecture: – Superpipelined and superscalar machines – Speculative and dynamic microarchitectures – Simulation tools and emulation systems • Compilers: – Extraction of instruction-level parallelism – Aggressive and speculative code scheduling – Object code translation and optimization EE282 – Autumn 2009 Lecture 1 - 17 Christos Kozyrakis Evolution of Single-Chip Processors 1970’s 1980’s 1990’s Transistor Count 10K-100K 100K-1M 1M-100M 0.5-1B Clock Frequency 0.2-2MHz 2-20MHz 20M-1GHz 1-5GHz Instruction/Cycle < 0.1 0.1-0.9 0.9- 2.0 10 MIPS or MFLOPS < 0.2 0.2-20 20-2,000 100,000 Watt <2 <10 <40 1-100+ (?) CPUs/chip` 1 1 1 4-10 EE282 – Autumn 2009 Lecture 1 - 18 2010 Christos Kozyrakis Aspects of Computer Architecture • ARCHITECTURE (instruction set architecture) – programmer/compiler view - “Functional appearance to its immediate user/ system programmer” • IMPLEMENTATION (microarchitecture) – processor designer view - “Logical structure or organization that implements the instruction set” • DESIGN (chip realization) – chip/system designer view - “Physical structure that embodies the implementation” EE282 – Autumn 2009 Lecture 1 - 19 Christos Kozyrakis Our Objective for this Quarter • The “What’s-How’s-Why’s” of Processor Design 1. Knowledge (“what’s”) - Technology - Techniques 2. Design Skills (“how’s”) - Critical Issues - Trade-off Intuitions 3. Understanding (“why’s”) - Deeper Insights - Fundamental Principles EE282 – Autumn 2009 Lecture 1 - 20 Christos Kozyrakis Basic Tools and Principles for Architects EE282 – Autumn 2009 Lecture 1 - 21 Christos Kozyrakis Amdahl’s Law • Speedup= timewithout enhancement / timewith enhancement • Suppose an enhancement speeds up a fraction f of a task by a factor of S timenew = timeold·( (1-f) + f/S ) Soverall = 1 / ( (1-f) + f/S ) timeold (1 - f) f timenew (1 - f) EE282 – Autumn 2009 f/S Lecture 1 - 22 Christos Kozyrakis Amdahl’s Law (continued) • Real life analogy: After driving through 60 minutes of traffic jam, how much time can you make up by speeding in the final mile? • Applications in Computer Architecture – RISC - Reduced Instruction Set Computer – Optimized to execute frequently used instructions quickly – Infrequently used instructions take longer, or even emulated with SW We should concentrate efforts on improving frequently occurring events or frequently used mechanisms EE282 – Autumn 2009 Lecture 1 - 23 Christos Kozyrakis Pipelining • Latency : Elapsed time from start to completion of a particular task • Throughput : How many tasks can be completed per unit of time • A pipeline is like an assembly line! stage1 stage2 stage3 stage4 stage5 start • finish Pipelining only improves throughput – Latency: each job still takes 5 cycles to complete – Throughput: 1 job per cycle if pipelined vs. 1 job per 5 cycles if not pipelined EE282 – Autumn 2009 Lecture 1 - 24 Christos Kozyrakis Pipelining (continued) • Real life analogy: Henry Ford’s automobile assembly line. • Example in computer architecture: – 5-stage Instruction Execution Pipeline – Fetch-Decode-Execute-Memory-Writeback Stages Fetch Decode Execute Memory Writeback EE282 – Autumn 2009 time t0 t1 t2 I1 I2 I1 I3 I2 I1 Lecture 1 - 25 t3 t4 t5 t6 t7 .... I4 I3 I2 I1 I5 I4 I3 I2 I1 I5 I4 I3 I2 I5 I4 I3 I5 I4 I5 Christos Kozyrakis Parallel Processing • Parallelism - the amount of independent sub-tasks available • If sub-tasks are independent, the order that they are carried out does not matter • Thus by executing the independent subtasks concurrently, we can finish the entire task faster Improve Speedup!!! EE282 – Autumn 2009 Lecture 1 - 26 Christos Kozyrakis Parallel Processing • Real life analogy: collaboration on problem sets (although not always encouraged) • Examples in computer architecture: – Parallel computers – Superscalar processors – Multi-core processors EE282 – Autumn 2009 Lecture 1 - 27 Christos Kozyrakis Our-of-order Execution • Specification (or Program) Order vs Dataflow Order • Dataflow: Data-driven scheduling of events – The start of an event should be enabled by the availability of its required input (data dependency) – The completion of an event will produce an output that will enable the start a of other events b *2 + x = a + b; y = b * 2 z = (x-y) * (x+y) x y - + * EE282 – Autumn 2009 Lecture 1 - 28 Christos Kozyrakis Our-of-order Execution • Real life analogy: – A tip on taking tests: work on the questions you know first • Examples in computer architecture – Most modern microprocessors (Intel P4, Opteron etc) all schedule instruction execution in dataflow order EE282 – Autumn 2009 Lecture 1 - 29 Christos Kozyrakis Work and Critical Path • Work T1 - time to complete a computation on a sequential system • Critical Path T - time to complete the same computation on an infinitely-parallel system x = a + b; y = b * 2 z =(x-y) * (x+y) a • Average Parallelism b *2 + Pavg = T1 / T x y • For a p wide system - + Tp max{ T1/p, T } * Pavg>>p Tp T1/p EE282 – Autumn 2009 Lecture 1 - 30 Christos Kozyrakis Work and Critical Path • Real life analogy: undergraduate degree requirements – Work = unit requirement – Critical Path to graduation is determined by course sequences and their prerequisites • Added constraints: classes are only available on specific quarters… • Applications to computer architecture – Parallel job scheduling – Given a collection of inter-dependent task: • How much resources should be allocated? • Which sequence of tasks should be given priority? EE282 – Autumn 2009 Lecture 1 - 31 Christos Kozyrakis Speculation Is it possible to parallelize the critical path? i.e. violate data dependence? • Guess the outcome of an operation from its inputs without performing the operation • Even better, guess the outcome of an operation before the inputs to the operation are even known • Speculation techniques must also include mechanisms for 1. Checking if the guesses are correct 2. Undoing “speculative execution” after wrong guesses EE282 – Autumn 2009 Lecture 1 - 32 Christos Kozyrakis Speculation (continued) • Real life analogy: – Another tip on taking tests: You can often guess what is going to be on an exam by looking at lectures and HWs. • Examples in computer architecture – Circuit-level speculations: Carry Select Adder – Architectural-level speculations • Branch target predictions • Load value predictions • Speculative loop execution EE282 – Autumn 2009 Lecture 1 - 33 Christos Kozyrakis Locality Principle • One’s recent past is a very good indication of his near future – Temporal Locality: If you just did something, it is very likely that you will do the same thing again soon – Spatial Locality: If you just did something, it is very likely you will do some thing related or similar next • Locality == Patterns == Predictability – Converse: • Anti-locality : If you haven’t done something for a very long time, it is very likely you won’t do it in the near future either EE282 – Autumn 2009 Lecture 1 - 34 Christos Kozyrakis Locality Principle (continued) • Real life analogy: – spatial locality - where you choose to sit in a room – temporal locality - will you be here again next week? • Examples in computer architecture: – Execution of program loops • Spatial locality - after you execute an instruction, with very good probability, you will execute the next instruction • Temporal locality - you are very likely to repeat the same instructions many times EE282 – Autumn 2009 Lecture 1 - 35 Christos Kozyrakis Memoization • If something is expensive to compute, you might want to remember the answer for a while, just in case you will need the same answer again Why does memoization work?? • Real life analogy: – Keeping a list of frequently used phone numbers by your telephone • Examples in computer architecture – ? EE282 – Autumn 2009 Lecture 1 - 36 Christos Kozyrakis Amortization • Overhead cost : one-time cost to set something up • Per-unit cost : cost for per unit of operation total cost = overhead + per-unit cost x N • It is often okay to have a high overhead cost if the cost can be distributed over a large number of units low the average cost average cost = total cost / N = ( overhead / N ) + per-unit cost EE282 – Autumn 2009 Lecture 1 - 37 Christos Kozyrakis Amortization (continued) • Real life analogy: economy of scale – Why is pasta sauce cheaper when bought by the gallon? • Examples in computer architecture: Cache Access Latency Tmiss= 50 cycles Thit = 1 cycle If on the average a cache line is reused n times before being ejected Tave = ( Tmiss+ (n-1)Thit ) / n Tmiss / n + Thit EE282 – Autumn 2009 n = 50 Tavg 2 n=2 Tavg 25 Lecture 1 - 38 Christos Kozyrakis Basic Equations and Metrics • Performance – CPUtime = Instruction Count * CPI * Clock Cycle Tie – AMAT = Hit Time + Miss Rate * Miss Penalty – Amdahl’s law, amortization • Cost – Processor cost = f(die area4) • Power Consumption – Power = C*Vdd2*F + Vdd*Ishortcircuit*F + Vdd*Ileakage – Energy = Power * Time – E*D, E*D2, ED3, … • Fault tolerance: MTTF, MTTR, … • Design complexity: ? EE282 – Autumn 2009 Lecture 1 - 39 Christos Kozyrakis Ready to Learn More? EE282 – Autumn 2009 Lecture 1 - 40 Christos Kozyrakis
© Copyright 2024