Computer Architecture Examination Paper with Sample Solutions and Marking Scheme CS3 1991–92

Computer Architecture Examination Paper
with Sample Solutions and Marking Scheme
CS3 1991–92
Nigel Topham (April 3, 1995)
Question 1
a. Describe what is meant by instruction pipelining, and explain how it can be
used to improve CPU performance.
[5]
b. Explain the causes of the following types of pipeline hazard and outline
briefly how their effects on performance can be minimised.
(i) structural hazards
(ii) data hazards
(iii) control hazards
[9]
c. A certain pipelined processor has the following characteristics:
• fully-pipelined integer ALU
• non-pipelined integer multiplier, with 3-cycle latency
• branches with a single delay slot
• loads with a single delay slot
Determine the average number of cycles per instruction (CPI) for the following code fragment, stating any assumptions you make.
[4]
L1:
load
sub
mul
store
bne
nop
r1,
r4,
r1,
r7,
r4,
-8(r15)
r5, r1
r8, r9
-4(r15)
r1, L2
/*
/*
/*
/*
/*
memory[r15-8] => r1
r5 - r1 => r4
(int) r8 * r9 => r1
r7 => memory[r15-4]
if (r4 != r1) goto L2
*/
*/
*/
*/
*/
Identify all data dependencies within this code fragment, specifying whether
they are flow-dependencies, output-dependencies, or anti-dependencies.
[3]
d. If the pipeline is enhanced to permit up to two instructions (of any type) to
be issued in parallel, what is the new CPI value ? (explain your calculations).
[4]
1
Question 2
a. Describe what is meant by the following terms, and outline briefly how they
can be exploited in high performance memory systems.
(i) program locality
(ii) temporal data locality
(iii) spatial data locality
[9]
b. Two different implementations of the same RISC architecture have the following characteristics:
Instruction
Class
Timing (cycles)
Class
Frequency Machine A Machine B
1 + tB
loads
20%
1 + tA
stores
13%
1
1
branches
24%
1
1
ALU ops.
43 %
1
1
The values tA and tB represent the effective memory access times of machines
A and B respectively. Machine A has a small on-chip cache, whereas machine
B has a large off-chip cache. The cache and memory system parameters are:
Parameter
cache hit time
cache miss ratio
block size (b)
refill time
copy back time
Machine A
1 cycle
5%
16 bytes
4 + b/4 cycles
—
Machine B
2 cycles
0.5%
64 bytes
4 + b/4 cycles
4 + b/4 cycles
Cache A uses a write through policy, whereas cache B uses a write back policy.
For cache B 40% of all misses are to “dirty” lines, and copying back cache
lines to memory cannot be overlapped with other activities.
(i) What are the effective memory access times of machine A and machine
B?
[4]
(ii) What is the mean number of cycles per instruction (CPI) for each
machine ?
[4]
(iii) If implementation A does not support pipelined memory writes, and
the time for store operations rises to 4 cyles, which of the two machines
is then fastest ?
[3]
c. It is suggested that cache A, which is a direct-mapped cache, might benefit
from being 2-way or 4-way set-associative, since it is believed that a small
cache suffers a high number of collision misses. Discuss the validity, or
otherwise, of this suggestion.
[5]
2
Question 3
a. Explain what is meant by the term quantitative design, in the context of
computer architecture.
[5]
b. What is Amdahl’s Law ?
[5]
c. A certain manufacturer decides to offer a high-performance version of its
popular RISC architecture, aimed at the scientific market. It believes that
a vector processing facility is the best way to achieve its performance and
cost goals.
Instructions that are able to exploit the vector facility of the new machine
execute in 1/10th of the cycles needed by the same instructions on the original
(scalar) machine. However, due to increased logic delays, the clock frequency
of the vector machine turns out to be only 75% of the clock frequency of
the original. In addition, studies indicate that, on average, 5/6ths of all
operations can exploit the vector facility.
What is the mean relative performance of the vector machine compared with
the original ?
[9]
d. Discuss the ways in which the memory system of the vector machine is likely
to differ from that of the original (scalar) machine.
[6]
3
Marking Scheme and Outline Solutions
Each question carries 25 marks. Students answer two out of the three questions.
This marking scheme and set of outline solutions illustrates the breakdown of
marks to each sub-question (or part thereof), and describes the type of answer
required to gain the stated marks.
Question 1
a. Describe what is meant by instruction pipelining, and explain how it can be
used to improve CPU performance.
5 marks
This is a relatively easy bookwork question. Typically, answers should describe what an instruction pipeline is, broadly what its structure is, and
explain how the average CPI value can be reduced towards an asymptotic
value of 1.0 by pipelining.
b. Explain the causes of the following types of pipeline hazard and outline briefly
how their effects on performance can be minimised
(i) structural hazards
3 marks
(ii) data hazards
3 marks
(iii) control hazards
3 marks
This is a slightly more difficult follow-on from the first part, but it should
not pose problems for many students, for details see Appendix A.
c. A certain pipelined processor has the following characteristics:
• fully-pipelined integer ALU
• non-pipelined integer multiplier, with 3-cycle latency
• branches with a single delay slot
• loads with a single delay slot
Determine the average number of cycles per instruction (CPI) for the following code fragment, stating any assumptions you make.
4 marks
L1:
load
sub
mul
store
bne
nop
r1,
r4,
r1,
r7,
r4,
-8(r15)
r5, r1
r8, r9
-4(r15)
r1, L2
/*
/*
/*
/*
/*
4
memory[r15-8] => r1
r5 - r1 => r4
(int) r8 * r9 => r1
r7 => memory[r15-4]
if (r4 != r1) goto L2
*/
*/
*/
*/
*/
Identify all data dependencies within this code fragment, specifying whether
they are flow-dependencies, output-dependencies, or anti-dependencies.
3 marks
They ought to find 3 flow dependencies, 1 output dependency and one anti
dependency, as listed below.
flow: load -> sub (r1)
flow: mul -> bne (r1)
flow: sub -> bne (r4)
output: mul -> load (r1)
anti: sub -> mul (r1)
This part of the question requires students to work out a numerical answer.
If they assume that the processor stops issuing instructions when a nonpipelined instruction is issued, the answer is CPI = 2.0, otherwise the answer
is CPI = 1.6. Most students ought to be able to get one of these results —
I’ll accept either one, but full marks only obtained if they mention that
they have to make an assumption about the behaviour of the non-pipelined
multiply, and the “perfect” nature of the memory (i.e., no extra load stalls).
d. If the pipeline is enhanced to permit up to two instructions (of any type) to be
issued in parallel, what is the new CPI value ? (explain your calculations). 4 marks
In this part they have to work out all dependencies between instructions and
schedule the code accordingly. The best schedule takes 7 cycles, leading to a
CPI of 1.4, and is shown below. Identifying the dependencies in the previous
part will help them to generate the schedule, which I’ve shown below.
INSTRUCTION 1
load r1, -8(r15)
sub r4, r5, r1
mul r1, r8, r9
-stall-stallbne r4, r1, L2
nop
INSTRUCTION 2
store r7, -4(r15)
nop
nop
nop
nop
CYCLES
1
1
1
1
1
1
1
Question 2
a. Describe what is meant by the following terms, and outline briefly how they
can be exploited in high performance memory systems.
(i) program locality
3 marks
(ii) temporal data locality
3 marks
5
(iii) spatial data locality
3 marks
This part of the question in essentially basic knowledge, which all students
ought to know. Further details in Appendix B
b. Two different implementations of the same RISC architecture have the following characteristics:
Instruction
Class
Timing (cycles)
Class
Frequency Machine A Machine B
1 + tB
loads
20%
1 + tA
stores
13%
1
1
branches
24%
1
1
ALU ops.
43 %
1
1
The values tA and tB represent the effective memory access times of machines
A and B respectively. Machine A has a small on-chip cache, whereas machine
B has a large off-chip cache. The cache and memory system parameters are:
Parameter
Machine A
cache hit time
1 cycle
cache miss ratio
5%
block size (b)
16 bytes
refill time
4 + b/4 cycles
copy back time
—
Machine B
2 cycles
0.5%
64 bytes
4 + b/4 cycles
4 + b/4 cycles
Cache A uses a write through policy, whereas cache B uses a write back
policy. For cache B 40% of all misses are to “dirty” lines, and copying back
cache lines to memory cannot be overlapped with other activities.
(i) What are the effective memory access times of machine A and machine
B?
4 marks
(ii) What is the mean number of cycles per instruction (CPI) for each machine ?
4 marks
(iii) If implementation A does not support pipelined memory writes, and the
time for store operations rises to 4 cyles, which of the two machines is
then fastest ?
3 marks
This is the “quantitative bit” of the question, where candidates have to apply
their knowledge of memory hierarchy behaviour to compare the performance
of two systems. The answers can be computed quite straightforwardly, if the
candidate has a grasp of how cache memories work.
They need to compute the effective memory latency of each cache system,
and then use the latency figures to compute the effective CPI of a load
6
instruction on each machine. These figures are then combined with the
CPI values for the other instruction types in proportion to their execution
frequency. The numbers have been chosen so that the arithmetic can be
computed without resort to a calculator.
c. It is suggested that cache A, which is a direct-mapped cache, might benefit
from being 2-way or 4-way set-associative, since it is believed that a small
cache suffers a high number of collision misses. Discuss the validity, or
otherwise, of this suggestion.
5 marks
This part requires the candidates to discuss the pros and cons of using setassociative caches. I would expect them at least to mention that the hit
time for S-A caches is typically higher, but that even very small degrees of
associativity lead to much better hit rates for small caches. I do not expect
any answer to come down unequivocally on one side of the argument.
Question 3
a. Explain what is meant by the term quantitative design, in the context of
computer architecture.
5 marks
Again, this is bookwork. I’m looking for a definition of the term, and what
it means for the design process. Some lecture notes on this subject are
contained in Appendix C.
b. What is Amdahl’s Law ?
5 marks
Now here I’ve given five marks to a simple question, but I am expecting
a definition of the law in algebraic terms – since that is the most effective
way to say “what is” for this particular concept. Again, there are some
photocopied lecture notes covering this question in Appendix D.
c. A certain manufacturer decides to offer a high-performance version of its
popular RISC architecture, aimed at the scientific market. It believes that a
vector processing facility is the best way to achieve its performance and cost
goals.
Instructions that are able to exploit the vector facility of the new machine
execute in 1/10th of the cycles needed by the same instructions on the original
(scalar) machine. However, due to increased logic delays, the clock frequency
of the vector machine turns out to be only 75% of the clock frequency of
the original. In addition, studies indicate that, on average, 5/6ths of all
operations can exploit the vector facility.
What is the mean relative performance of the vector machine compared with
the original ?
9 marks
For this question I’m looking for an answer along the following lines:
7
Firstly to derive (or state) that for vectorisation of v, we enjoy a relative
execution time T of:
(1)
T = [v/R + 1 − v]−1
where R is the vector:scalar computation rate.
In addition, the clock period of the vector machine is 4/5 times that of the
scalar machine, so the new execution time is:
T =
1
=3
4/3(5/(6 ∗ 10) + 1 − 5/6)
(2)
i.e., the vector machine is 3-times faster than the scalar machine.
d. Discuss the ways in which the memory system of the vector machine is likely
to differ from that of the original (scalar) machine.
6 marks
These two memory systems will differ in many detailed ways, but the aim
of this question is to get the candidates to discuss the broad differences in
memory requirements of each type of system. Listed below are some of these.
bandwidth the vector machine needs of the order of 3 words per cycle
bandwidth, per pair of floating point pipes (floating point bandwidth
: memory bandwidth ratio of 2/3). In comparison, the scalar machine
needs only 1 word per cycle.
unit of access the scalar machine will access the memory in units of a
single words, though this may change to units of a single cache line
in systems where all accesses are cached. In the vector machine the
dominant type of accesses are vector loads and stores. These typically
consist of VL words read or written to/from consecutive locations (VL
is vector length of the machine). However, the vector machine must
also be capable of accessing vectors with non-unit strides, in order to
perform efficiently on multi-dimensional structures. The memory thus
needs to support scatter and gather operations. Implementing a wide
memory interface will provide the vector bandwidth for unit accesses,
but will not give adequate performance for non-unit strides.
capacity a rather simple point, but important: the capacity of the vector
machine’s memory will need to be significantly larger than that needed
on a scalar RISC system (in general).
8