EE282 Computer Architecture Lecture 1: What is Computer Architecture? Logistics

Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
EE282
Computer Architecture
Lecture 1: What is Computer Architecture?
Kunle Olukotun and Stephen Richardson
Computer Systems Laboratory
Stanford University
WJD/KO
EE282 Lecture 1
1
Logistics
Lectures
Lecturers
TAs
Mon & Wed. 9:00am-10:15am
Kunle Olukotun and Stephen Richardson
John Kim
Lee Kenyon
Grading
Final Exam
1
35%
Midterm
1
25%
Homework
5+2
40%
Text
Hennessy & Patterson
Computer Architecture: A Quantitative
Approach (2nd Edition)
See Information Sheet for details
WJD/KO
EE282 Lecture 1
2
1
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
EE282 Online
URL:
http://www.stanford.edu/class/ee282
e-mail: [email protected]
[email protected]
WJD/KO
EE282 Lecture 1
3
Wanted: A Few Good Graders
• Grade a portion of every problem set
• In exchange:
– requirement to take problem sets is waived
– full credit on all problem sets
WJD/KO
EE282 Lecture 1
4
2
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Goals
• Understand how computer systems are organized and why they are
organized that way.
–
–
–
–
Application behavior
Instruction-set architecture
Microarchitecture
Memory Hierarchy
• Be conversant with measures of computer performance and methods
for increasing performance
– metrics
– benchmarks
– performance impact of organization
• pipelining, multiple issue, branch prediction, caches
WJD/KO
EE282 Lecture 1
5
Link
I/O Chan
ISA
API
What is Computer Architecture?
Interfaces
Technology
IR
Regs
Machine Organization
Applications
WJD/KO
Computer
Architect
EE282 Lecture 1
Measurement &
Evaluation
6
3
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Technology Constraints
• Yearly improvement
– Semiconductor technology
• 60% more devices per chip
(doubles every 18 months)
• 15% faster devices
(doubles every 5 years)
• Slower wires
1992
1995
1998
– Magnetic Disks
• 60% increase in density
– Circuit boards
• 5% increase in wire density
– Cables
• no change
2001
64x more devices since 1992
4x faster devices
WJD/KO
EE282 Lecture 1
7
Changing Technology leads to
Changing Architecture
• 1970s
• 1990s
– multi-chip CPUs
– semiconductor memory very
expensive
– complex instruction sets (good
code density)
– microcoded control
– 1 M - 64M transistors
– complex control to exploit
instruction-level parallelism
– Super deep pipelines
• 2000s
–
–
–
–
–
• 1980s
– 5K – 500 K transistors
– single-chip CPUs, on-chip
RAM feasible
– simple, hard-wired control
– simple instruction sets
– small on-chip caches
100 M - 5 B transistors
slow wires
limited cooling capacity
Managing complexity
????
Keeps computer architecture interesting and challenging
WJD/KO
EE282 Lecture 1
8
4
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Application Constraints
•
Applications drive machine ‘balance’
– Numerical simulations
• floating-point performance
• main memory bandwidth
– Transaction processing
• I/Os per second
• Memory bandwidth
• integer CPU performance
– Media processing
• low-precision ‘pixel’ arithmetic
• multiply-accumulate rates
• bit manipulation
– Embedded control
• I/O timing
• Real-time behavior
Architecture concepts often exploit application behavior
WJD/KO
EE282 Lecture 1
9
The Architecture Process:
Quantitative Approach
Estimate
Cost &
Performance
Sort
New concepts created
Bad ideas
WJD/KO
EE282 Lecture 1
Mediocre ideas
Good
ideas
10
5
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Evaluation Tools
•
Benchmarks, traces, & mixes
MOVE
BR
LOAD
STORE
ALU
– macrobenchmarks & suites
• application execution time
– microbenchmarks
39%
20%
20%
10%
11%
• measure one aspect of performance
LD 5EA3
ST 31FF
….
LD 1EA2
….
– traces
• replay recorded accesses
–
•
memory, branch, register
Simulation at many levels
– ISA, microarchitecture, RTL, gate, circuit
• trade fidelity for simulation rate
•
•
Area and delay estimation
Analysis
– e.g., queuing theory
– Back-of-the-envelope
WJD/KO
EE282 Lecture 1
11
Performance Basics
• Performance equation
–
–
–
–
I - instructions executed
CPI - clock cycles per instruction
tcy - clock cycle
Need to consider all three elements
when making changes
– ISA, architecture, technology
T = I ¥ CPI ¥ t cy
• Amdahl’s Law
– speedup fraction p by S
p
T1 = T0 È(1 - p) + ˘
ÍÎ
S ˙˚
WJD/KO
EE282 Lecture 1
12
6
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Performance based on instruction types
• Another way to compute CPU time, based on n different
instruction types with different execution times:
n
CPU time = Â ( CPI i ¥ ICi ) ¥ Clock cycletime
i =1
CPI is now:
nÊ
ICi
ˆ
CPI = Â Á CPI i ¥
˜
Instruction
count
Ë
¯
i =1
where
Frequencyof Instruction i =
WJD/KO
ICi
Instruction count
EE282 Lecture 1
13
CPU Performance Example
• Assume a simple load/store machine with the following instruction
frequencies:
Instruction type
loads
stores
branches
ALU
Frequency
25%
15%
20%
40%
Cycles
2
2
2
1
• Conditional branches currently use simple test against zero.
(BEQZ Rn,loc; BNEZ Rn,loc)
• Should we add complex comparison/branch combination?
(BEQ Rn,Rm,loc; BNE Rn,Rm,loc)
– 25% of branches can use the complex scheme and save the preceding ALU
instruction
– The CPU cycle time of the new machine has to be 10% longer
– Will this increase CPU performance?
WJD/KO
EE282 Lecture 1
14
7
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
CPU Performance Example:
Solution #1
• Old CPU performance
CPIold= 0.25x2 + 0.15x2 + 0.20x2 + 0.40x1 = 1.6
CPUtimeold = 1.6 x ICold x CCTold
• New CPU performance
CPInew = 0.25x2 + 0.15x2 + 0.20x2 + (0.40 - 0.25x0.2) x 1
1-0.25x0.2
= 1.63
ICnew = 0.95 x ICold
CCTnew = 1.1 x CCTold
CPUtimenew = ICnew x CPInew x CCTnew
= 1.63 x (0.95xICold) x (1.1xCCTold)
= 1.71 x ICold x CCTold
• Answer: Don’t do it!
WJD/KO
EE282 Lecture 1
15
CPU Performance Example:
Solution #2
• If you don’t need to know CPInew:
• Old CPU performance
CPIold= 0.25x2 + 0.15x2 + 0.20x2 + 0.40x1 = 1.6
CPUtimeold = 1.6 x ICold x CCTold
• New cycles needed for same mix:
CnewPIold = 0.25x2 + 0.15x2 + 0.20x2 + (0.40 - 0.25x0.2) x 1
= 1.55
CPUtimenew= CnewPIold x ICold x CCTnew
= 1.55 x ICold x (1.1 x CCTold)
= 1.71 x ICold x CCTold
WJD/KO
EE282 Lecture 1
16
8
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Megaflops
-- Floating point intensive applications
•
MFLOPS =
FP op count
Execution time ¥106
• Applies only to floating-point (non-integer) problems
• Assumes that the same number and type of floating-point operations
are executed on all machines
• Livermore Loop benchmarks use “normalized” MFLOPS based on
operations in the source code weighted as follows:
Original FP Operation
Normalized FP ops
add, sub, cmp, mult
1
divide, sqrt
4
exp, sin, etc.
8
WJD/KO
EE282 Lecture 1
17
Amdahl’s Law
The performance enhancement of an
improvement is limited by how much the
100
improved feature is used
Speedup vs Vector Fraction
90
80
70
60
Speedup
p
T1 = T0 È(1 - p) + ˘
ÍÎ
S ˙˚
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fraction of Code Vectorizable
WJD/KO
EE282 Lecture 1
18
9
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Benchmarks
• Macrobenchmarks
• Microbenchmarks
– application execution time
– measure one performance
dimension
cache bandwidth
main memory bandwidth
procedure call overhead
FP performance
Perf. Dimensions
WJD/KO
Applications
– gives insight into the cause of
performance
– not a good predictor of
application performance
Macro
Micro
•
•
•
•
• measures overall
performance, but on just one
application
• Need application suite
EE282 Lecture 1
19
Some Warnings about Benchmarks
• Benchmarks measure the whole
system
–
–
–
–
–
–
application
input
compiler
operating system
architecture
implementation
• Performance on some
benchmarks can be misleading
– kernels
– synthetic benchmarks
• Popular benchmarks typically
reflect yesterday’s programs
– computers need to be designed
for tomorrow’s programs
• Benchmarks become obsolete
– SPEC92, SPEC95, SPEC2000
– TPC-A, TPC-B, TPC-C
WJD/KO
EE282 Lecture 1
20
10
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Summarizing Performance
Means
Arithmetic mean
1 n
 Ti
n i =1
times
Harmonic mean
n
1
Â
i =1 R i
Rates: MFLOPS, MIPS
n
Can be
weighted.
Represents total
execution time
1
Geometric mean
Ê1 n
Ê n Ti ˆ n
Ê T ˆˆ
ÁÁ ’ ˜˜ = expÁÁ  logÁÁ i ˜˜ ˜˜
Ë i =1 Tri ¯
Ë Tri ¯ ¯
Ë n i =1
Consistent
independent of
reference
Does not represent total
execution time
WJD/KO
EE282 Lecture 1
21
Cost
• Chip cost is primarily a function of die area
–
–
–
–
increases faster than linearly due to yield
a very powerful processor can be built on a small die (10mm)
going larger gives diminishing performance returns
but …
Wafer Cost
Wafer Diameter
Wafer Area
2,500.00
200 mm
31416 mm^2
Chip Size Die Area Die/Wafer Yield
Good Die Cost/Die
1
1
30971
0.98
30351
$0.08
5
25
1167
0.96
1122
$2.23
7.5
56.25
499
0.81
406
$6.16
10
100
269
0.62
166
$15.06
15
225
110
0.35
38
$65.79
17.5
306.25
77
0.27
21 $119.05
WJD/KO
EE282 Lecture 1
22
11
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Chip cost is a function of size
$250.00
Unpackaged Cost ($)
$200.00
$150.00
$100.00
$50.00
$0.00
0
2
4
6
8
10
12
14
16
18
20
Chip Size (mm)
WJD/KO
EE282 Lecture 1
23
Other costs and Real Examples
• IC Packaging and testing cost
– cost/die = (cost/hour x test time) / yield
– could be $10-$20 or more for complex chips
– IC Packaging: depends on die size, power dissipation and number of pins
• Chip Costs
WJD/KO
EE282 Lecture 1
24
Microprocessor Report, March 25, 1996
12
Spring 01/02
K. Olukotun / S. Richardson
Handout #2
EE282
Cost
Microprocessor Report, March 2001
WJD/KO
EE282 Lecture 1
25
Performance
Microprocessor
Report
March 2001
WJD/KO
EE282 Lecture 1
26
13