EE382A Advanced Processor Architecture Christos Kozyrakis & John Shen Department of Electrical Engineering

EE382A
Advanced Processor Architecture
Christos Kozyrakis & John Shen
Department of Electrical Engineering
Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Spring 2009
Lecture 1 - 1
Christos Kozyrakis
A Few Words About Christos
•
VIRAM media-processor
Associate professor of EE & CS
– Ph.D. from U.C. Berkeley
IRAM test chip
– B.Sc. from University of Crete
•
Current research
– Parallel systems (scheduling, TM)
ATLAS ATM Switch Telegraphos DSM
switch
– Energy efficient data-centers
– Security systems
– More info at http://csl.stanford.edu/~christos
•
Systems I have worked on
– Networking chips: ATLAS & Telegraphos switches
– Processor chips: VIRAM media-processor
•
Raksha Security
System
ATLAS TM System
125 million transistors, 9.6 billion ops/sec
– FPGA prototypes: Raksha & Atlas
– Server prototypes: CoolSort
EE282 – Autumn 2009
Lecture 1 - 2
Christos Kozyrakis
A Few Words About John
• Head of Nokia Research Center in Palo Alto
– Ph.D. from USC
– B.Sc. from University of Michigan
• Prior to Nokia
– Director of the Microarchitecture Research Lab (MRL) at Intel
• Superscalar architecture, speculative multithreading and memory prefetching,
3D die-stacking technology, and heterogeneous multi-sequencer architectures
– Professor of Computer Engineering at CMU
• Author of the main textbook for EE382a
EE282 – Autumn 2009
Lecture 1 - 3
Christos Kozyrakis
EE382a Team
• Instructors: Christos Kozyrakis & John Shen
• Teaching assistant: David Signiorelli
• Guest lectures: Ben Lee + one more
• Administrative support: Teresa Lynn
• Contact info & office hours: up-to-date info on class webpage
– http://eeclass.stanford.edu/ee382a
– Check frequently
EE282 – Autumn 2009
Lecture 1 - 4
Christos Kozyrakis
You…
• Class participation is EXTREMELY important in EE382a
• Your goals
– Ask questions
– Offer answers
– Suggest discussion topics
– Make us learn your name • Will take and post photos of everyone next week
EE282 – Autumn 2009
Lecture 1 - 5
Christos Kozyrakis
Class Basics
• Lectures: Mo & We, 11am-12.15pm, Hewlett 101
– There will also be some discussion sessions on Fridays
• Friday 2-3pm, Gates Hall 498
• Discussion sessions will be explicitly announced
– The class is not available on SCPD this quarter
• Web page: http://eeclass.stanford.edu/ee382a
– Announcements, handouts, office hours, latest schedule, bulletin board
– Check frequently
– Signup with webpage for on-line access to grades
• We will let you know when registration is open…
EE282 – Autumn 2009
Lecture 1 - 6
Christos Kozyrakis
The Bulletin Board
• The preferred way to ask class-related questions
– We promise to check & answer often, especially close to deadlines
– We encourage you to contribute to answers & have on-line discussions on
class material
• The bulletin rules
– Before posting a new question
• Check if question has already been asked or even answered
– Use the search capabilities of your web browser
• Check the FAQ page for the assignment
– Choose an appropriate subject for your question
• E.g. “HW2, problem 3, definition of memory latency”
• For questions not appropriate for the public: send us an email
EE282 – Autumn 2009
Lecture 1 - 7
Christos Kozyrakis
EE382a Topics
• Pipelining overview and analysis
• Architectures for instruction level parallelism
– Supersalar: instruction fetch, branch prediction, dynamic scheduling &
register renaming, memory disambiguation
– VLIW and dynamic binary translation
• Architecture for task and data level paralellism
– Multithreading, multi-core architectures, vector processing, GPUs, tradeoffs
in designing multi-core chips, memory hierarchy for multi-core
• Cross-cutting issues
– Checkpointed processors, phase-change memory, …
EE282 – Autumn 2009
Lecture 1 - 8
Christos Kozyrakis
Textbooks and Papers
• Textbooks
– Required: "Modern Processor Design: Fundamentals of Superscalar
Processors", J.P. Shen and M. Lipasti, 1st edition, McGraw-Hill
• Do not use/buy the beta edition!
– Reference: “Computer Architecture: A Quantitative Approach”, J. Hennessy
& D. Patterson, 4th edition, Morgan Kaufmann
– Reference: “Computer Organization and Design: The Hardware/Software
Interface”, D. Patterson & J. Hennessy, 4th edition, Morgan Kaufmann
• Papers (check handouts link on the webpage)
– A few required papers
• These papers are included in the exam materials
• Have to submit a 1-page paper summary by the next lecture
– Several optional papers
• Further in-depth information, references for projects, …
EE282 – Autumn 2009
Lecture 1 - 9
Christos Kozyrakis
Assignments, Exams, and Class Load
• Single exam and 1+2 homework assignments
• Large research project
–
–
–
–
On an open question in computer architecture
Work in groups of up to 3 students
See topic suggestions on-line or suggest your own project
Milestones: proposal, halfway review/status, presentation, paper…
• Grade breakdown (tentative)
– Exam 40%, Project 40%, HW + summaries + participation 20%
– All deadlines are final, no extensions, no exceptions
– Remember the honor code (more info on web page)
• Warnings
– This will be a loaded class!!
– This class will be as good as your participation…
EE282 – Autumn 2009
Lecture 1 - 10
Christos Kozyrakis
Prerequisites and Registration
•
Prerequisites: EE108B or equivalent
– Expected to know: simple pipelines, basic caching, virtual memory, main memory
•
EE282 is not a required prerequisite
•
Class registration:
– Limited to 30 students; all students must receive instructor’s approval
•
Homework 1: prerequisite assessment
– Due on in-class on Monday
– Work on it on your own
– Will send you email about your registration by Wednesday
EE282 – Autumn 2009
Lecture 1 - 11
Christos Kozyrakis
Should I Take EE382A?
• Good reason to take EE382A
– Prepare for research in computer architecture
– Broaden your Ph.D. research perspective
– Become a digital systems architect in industry
– Honest curiosity (how do Intel/AMD/… processors work?)
– Want to take a class with a research project
• Not a good reason to take EE382A
– Prepare for quals, comps, etc…
– Need another course for your degree program
• “EE382A is supposed to be an easy A, right?”
– Learn about digital circuits and CAD tools
EE282 – Autumn 2009
Lecture 1 - 12
Christos Kozyrakis
On Reading & Summarizing Papers
•
Look for the following
– The issue or problem addressed by the paper
– The original contributions (real or claimed, you have to check)
– Critique: what are the major strengths and weaknesses of the papers?
•
Look at the claims and assumptions, the methodology, the analysis of data, and the presentation style
– Future work: what are the natural extensions or improvements to this work?
•
Or, can we apply a similar methodology to other problems of interest
•
Do not submit the paper abstract as your summary :)
•
Helpful tips
– Read the abstract, introduction, and conclusions sections first.
– Read the rest of the paper twice
•
First a quick pass to get rough idea of details, then a detailed reading
– Underline/highlight the important parts of the paper
– Keep notes on the paper margins about comments or questions
•
Important insights, questionable claims, relevance to other topics, ways to improve some technique etc.
– Look up references that seem to be important or missing
•
In some cases, you may also want to check who and how references this paper
EE282 – Autumn 2009
Lecture 1 - 13
Christos Kozyrakis
EE382A Lecture 1:
Introduction to Advanced Processor Architecture
Department of Electrical Engineering
Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Spring 2009
Lecture 1 - 14
Christos Kozyrakis
Historical Perspectives on Processors
• The Decade of the 1970’s: “Birth of Microprocessors”
– Programmable Controller
– Single-Chip Microprocessors
– Personal Computers (PC)
• The Decade of the 1980’s: “Quantitative Architecture”
– Instruction Pipelining
– Fast Cache Memories
– Compiler Considerations
– Workstations
• The Decade of the 1990’s: “Instruction-Level Parallelism”
– Superscalar,Speculative Microarchitectures
– Aggressive Compiler Optimizations
– Low-Cost Desktop Supercomputing
EE282 – Autumn 2009
Lecture 1 - 15
Christos Kozyrakis
Performance Growth
• Doubling every 18 months (1982-2000):
– total of 3,200X
– Cars travel at 176,000 MPH; get 64,000 miles/gal.
– Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)
– Wheat yield: 320,000 bushels per acre
• Doubling every 24 months (1971-2001):
– total of 36,000X
– Cars travel at 2,400,000 MPH; get 600,000 miles/gal.
– Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)
– Wheat yield: 3,600,000 bushels per acre
Unmatched by any other industry!!
[John Crawford, Intel, 1993]
EE282 – Autumn 2009
Lecture 1 - 16
Christos Kozyrakis
Convergence of Key Enabling Technologies
•
CMOS VLSI:
– Submicron feature sizes: 0.3u 0.25u 0.18u 0.13u 90n 65n 45nm…
– Metal layers: 3 4 5 6 7 (copper) 12 …
– Power supply voltage: 5V 3.3V 2.4V 1.8V 1.3V 1.1V …
•
CAD Tools:
– Interconnect simulation and critical path analysis
– Clock signal propagation analysis
– Process simulation and yield analysis/learning
•
Architecture & Microarchitecture:
– Superpipelined and superscalar machines
– Speculative and dynamic microarchitectures
– Simulation tools and emulation systems
•
Compilers:
– Extraction of instruction-level parallelism
– Aggressive and speculative code scheduling
– Object code translation and optimization
EE282 – Autumn 2009
Lecture 1 - 17
Christos Kozyrakis
Evolution of Single-Chip Processors
1970’s
1980’s
1990’s
Transistor Count
10K-100K
100K-1M
1M-100M
0.5-1B
Clock Frequency
0.2-2MHz
2-20MHz
20M-1GHz
1-5GHz
Instruction/Cycle
< 0.1
0.1-0.9
0.9- 2.0
10
MIPS or MFLOPS
< 0.2
0.2-20
20-2,000
100,000
Watt
<2
<10
<40
1-100+ (?)
CPUs/chip`
1
1
1
4-10
EE282 – Autumn 2009
Lecture 1 - 18
2010
Christos Kozyrakis
Aspects of Computer Architecture
• ARCHITECTURE
(instruction set architecture)
– programmer/compiler view - “Functional appearance to its immediate user/
system programmer”
• IMPLEMENTATION
(microarchitecture)
– processor designer view - “Logical structure or organization that
implements the instruction set”
•
DESIGN
(chip realization)
– chip/system designer view - “Physical structure that embodies the
implementation”
EE282 – Autumn 2009
Lecture 1 - 19
Christos Kozyrakis
Our Objective for this Quarter
• The “What’s-How’s-Why’s” of Processor Design
1. Knowledge
(“what’s”)
- Technology
- Techniques
2. Design Skills
(“how’s”)
- Critical Issues
- Trade-off Intuitions
3. Understanding
(“why’s”)
- Deeper Insights
- Fundamental Principles
EE282 – Autumn 2009
Lecture 1 - 20
Christos Kozyrakis
Basic Tools and Principles for Architects
EE282 – Autumn 2009
Lecture 1 - 21
Christos Kozyrakis
Amdahl’s Law
• Speedup= timewithout enhancement / timewith enhancement
• Suppose an enhancement speeds up a fraction f of a task by a
factor of S
timenew = timeold·( (1-f) + f/S )
Soverall = 1 / ( (1-f) + f/S )
timeold
(1 - f)
f
timenew
(1 - f)
EE282 – Autumn 2009
f/S
Lecture 1 - 22
Christos Kozyrakis
Amdahl’s Law (continued)
• Real life analogy: After driving through 60 minutes of traffic jam, how
much time can you make up by speeding in the final mile?
• Applications in Computer Architecture
– RISC - Reduced Instruction Set Computer
– Optimized to execute frequently used instructions quickly
– Infrequently used instructions take longer, or even emulated with SW
We should concentrate efforts on improving frequently occurring events or
frequently used mechanisms
EE282 – Autumn 2009
Lecture 1 - 23
Christos Kozyrakis
Pipelining
•
Latency : Elapsed time from start to completion of a particular task
•
Throughput : How many tasks can be completed per unit of time
•
A pipeline is like an assembly line!
stage1
stage2
stage3
stage4
stage5
start
•
finish
Pipelining only improves throughput
– Latency: each job still takes 5 cycles to complete
– Throughput: 1 job per cycle if pipelined vs. 1 job per 5 cycles if not pipelined
EE282 – Autumn 2009
Lecture 1 - 24
Christos Kozyrakis
Pipelining (continued)
• Real life analogy: Henry Ford’s automobile assembly line.
• Example in computer architecture:
– 5-stage Instruction Execution Pipeline
– Fetch-Decode-Execute-Memory-Writeback
Stages
Fetch
Decode
Execute
Memory
Writeback
EE282 – Autumn 2009
time t0 t1 t2
I1
I2
I1
I3
I2
I1
Lecture 1 - 25
t3
t4
t5
t6
t7
....
I4
I3
I2
I1
I5
I4
I3
I2
I1
I5
I4
I3
I2
I5
I4
I3
I5
I4
I5
Christos Kozyrakis
Parallel Processing
• Parallelism - the amount of independent sub-tasks available
• If sub-tasks are independent, the order that they are carried out does
not matter
• Thus by executing the independent subtasks concurrently, we can
finish the entire task faster
Improve Speedup!!!
EE282 – Autumn 2009
Lecture 1 - 26
Christos Kozyrakis
Parallel Processing
• Real life analogy: collaboration on problem sets
(although not always encouraged)
• Examples in computer architecture:
– Parallel computers
– Superscalar processors
– Multi-core processors
EE282 – Autumn 2009
Lecture 1 - 27
Christos Kozyrakis
Our-of-order Execution
• Specification (or Program) Order vs Dataflow Order
• Dataflow: Data-driven scheduling of events
– The start of an event should be enabled by the availability of its required
input (data dependency)
– The completion of an event will produce an output that will enable the start
a
of other events
b
*2
+
x = a + b;
y = b * 2
z = (x-y) * (x+y)
x
y
-
+
*
EE282 – Autumn 2009
Lecture 1 - 28
Christos Kozyrakis
Our-of-order Execution
• Real life analogy:
– A tip on taking tests: work on the questions you know first
• Examples in computer architecture
– Most modern microprocessors (Intel P4, Opteron etc) all schedule
instruction execution in dataflow order
EE282 – Autumn 2009
Lecture 1 - 29
Christos Kozyrakis
Work and Critical Path
• Work
T1 - time to complete a computation on a
sequential system
• Critical Path
T - time to complete the same computation
on an infinitely-parallel system
x = a + b;
y = b * 2
z =(x-y) * (x+y)
a
• Average Parallelism
b
*2
+
Pavg = T1 / T
x
y
• For a p wide system
-
+
Tp max{ T1/p, T }
*
Pavg>>p Tp T1/p
EE282 – Autumn 2009
Lecture 1 - 30
Christos Kozyrakis
Work and Critical Path
• Real life analogy: undergraduate degree requirements
– Work = unit requirement
– Critical Path to graduation is determined by course sequences and their
prerequisites
• Added constraints: classes are only available on specific quarters…
• Applications to computer architecture
– Parallel job scheduling
– Given a collection of inter-dependent task:
• How much resources should be allocated?
• Which sequence of tasks should be given priority?
EE282 – Autumn 2009
Lecture 1 - 31
Christos Kozyrakis
Speculation
Is it possible to parallelize the critical path?
i.e. violate data dependence?
• Guess the outcome of an operation from its inputs without performing
the operation
• Even better, guess the outcome of an operation before the inputs to
the operation are even known
• Speculation techniques must also include mechanisms for
1. Checking if the guesses are correct
2. Undoing “speculative execution” after wrong guesses
EE282 – Autumn 2009
Lecture 1 - 32
Christos Kozyrakis
Speculation (continued)
• Real life analogy:
– Another tip on taking tests: You can often guess what is going to be on an
exam by looking at lectures and HWs.
• Examples in computer architecture
– Circuit-level speculations: Carry Select Adder
– Architectural-level speculations
• Branch target predictions
• Load value predictions
• Speculative loop execution
EE282 – Autumn 2009
Lecture 1 - 33
Christos Kozyrakis
Locality Principle
• One’s recent past is a very good indication of his near future
– Temporal Locality: If you just did something, it is very likely that you will do
the same thing again soon
– Spatial Locality: If you just did something, it is very likely you will do some
thing related or similar next
• Locality == Patterns == Predictability
– Converse:
• Anti-locality : If you haven’t done something for a very long time, it is very likely
you won’t do it in the near future either
EE282 – Autumn 2009
Lecture 1 - 34
Christos Kozyrakis
Locality Principle (continued)
• Real life analogy:
– spatial locality - where you choose to sit in a room
– temporal locality - will you be here again next week?
• Examples in computer architecture:
– Execution of program loops
• Spatial locality - after you execute an instruction, with very good probability, you
will execute the next instruction
• Temporal locality - you are very likely to repeat the same instructions many
times
EE282 – Autumn 2009
Lecture 1 - 35
Christos Kozyrakis
Memoization
• If something is expensive to compute, you might want to remember the
answer for a while, just in case you will need the same answer again
Why does memoization work??
• Real life analogy:
– Keeping a list of frequently used phone numbers by your telephone
• Examples in computer architecture
– ?
EE282 – Autumn 2009
Lecture 1 - 36
Christos Kozyrakis
Amortization
• Overhead cost : one-time cost to set something up
• Per-unit cost : cost for per unit of operation
total cost = overhead + per-unit cost x N
• It is often okay to have a high overhead cost if the cost can be
distributed over a large number of units
low the average cost
average cost = total cost / N
= ( overhead / N ) + per-unit cost
EE282 – Autumn 2009
Lecture 1 - 37
Christos Kozyrakis
Amortization (continued)
• Real life analogy: economy of scale
– Why is pasta sauce cheaper when bought by the gallon?
• Examples in computer architecture:
Cache Access Latency
Tmiss= 50 cycles
Thit = 1 cycle
If on the average a cache line is reused n times before being ejected
Tave = ( Tmiss+ (n-1)Thit ) / n Tmiss / n + Thit
EE282 – Autumn 2009
n = 50
Tavg 2
n=2
Tavg 25
Lecture 1 - 38
Christos Kozyrakis
Basic Equations and Metrics
• Performance
– CPUtime = Instruction Count * CPI * Clock Cycle Tie
– AMAT = Hit Time + Miss Rate * Miss Penalty
– Amdahl’s law, amortization
• Cost
– Processor cost = f(die area4)
• Power Consumption
– Power = C*Vdd2*F + Vdd*Ishortcircuit*F + Vdd*Ileakage
– Energy = Power * Time
– E*D, E*D2, ED3, …
• Fault tolerance: MTTF, MTTR, …
• Design complexity: ?
EE282 – Autumn 2009
Lecture 1 - 39
Christos Kozyrakis
Ready to Learn More?
EE282 – Autumn 2009
Lecture 1 - 40
Christos Kozyrakis