Solutions

Written Exam / Tentamen∗
Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp
Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp
KTH Royal Institute of Technology
2015-01-14
Examiner / Examinator: David Broman
Teacher on duty / Ansvarig lärare: David Broman, [email protected], +46 73 765 20 44
Instructions in English
• Allowed aids: One sheet of A4 paper with handwritten notes. You may write on both sides
of the paper.
• Explicitly forbidden aids: Textbooks, electronic equipment, calculators, mobile phones, machinewritten pages, photocopied pages, pages of different size than A4.
• Please write and draw carefully. Unreadable text may lead to zero points.
• You do not need to return these exam papers when you hand in your exam solutions.
• You may write your answers in either Swedish or English.
The exam consists of two parts:
• Part I: Fundamentals: The maximal number of points for Part I is 48 points (for IS1500)
and 40 points (for IS1200). There are 8 points for each of the six course modules. All questions in Part I expect only short answers. At most a few sentences are needed.
• Part II: Advanced: The maximal number of points for Part II is 50 points. In the answers,
it is required that the student discuss, analyze, or construct. Forthermore, answers to these
questions require clear motivations.
True/False/Don’t know Questions in Part I
These questions can give between 0 and 4 points. Each question consists of 4 statements. For
each statement, you should answer either true, false, or “don’t know”. The points are calculated
as follows: Each correct answer (answer true or false) gives one point. Each incorrect answer
(true or false) gives minus one point. If you answer “don’t know” for a statement, it neither
adds nor removes points. The rationale for introducing “don’t know” answers is to avoid that
the student makes guesses.
Example: Assume that the correct answers to four statements are: true, false, true, false.
Assume that the student answered: true, true, don’t know, false. The total number of points is
then: 1 - 1 + 0 + 1 = 1 point. Note that even if the answers to all four statements are wrong,
the points for the whole question can never be negative, that is, the final points will always be
between 0 and 4.
∗
This version of the exam paper has been update February 6, 2015. The updates include changes of the pass
criteria (the pass criterion was changed (lowered) from 36 to 33 points in the fundamental part for IS1500 and 30
to 27 for IS1200). We also updated the criterion for FX (only 11 points are needed on the advanced part) and added
a few clarifications on the solution suggestions.
1
Grades
To get a pass grade (A, B, C, D, or E), it is required to pass Part I of the exam. For IS1500
students, it is required to get 33 points or more on Part I to pass the exam. IS1200 students
should not answer question 1 in Part I. For IS1200 students, it is required to have 27 points or
more in total for questions 2-6 on Part I.
Grading scale (For both IS1200 and IS1500):
• A: 41-50 points on Part II
• B: 31-40 points on Part II
• C: 21-30 points on Part II
• D: 11-20 points on Part II
• E: 0-10 points on Part II
• FX: 30-32 points (IS1500) or 24-26 (IS1200) on Part I, and 11-50 points on Part II.
• F: otherwise
Results
• The result will be announced at latest 2015-02-04.
• If a student received grade FX, it is possible to request a complementary oral examination.
Such complementary oral examination must be requested by the student. Please send an
email to [email protected] at latest 2015-02-25.
Instruktioner på Svenska
• Tillåtna hjälpmedel: En A4-sida med handskrivna anteckningar. Det är tillåtet att skriva på
båda sidorna av pappret.
• Förbjudna hjälpmedel: Läroböcker, elektroniska hjälpmedel, miniräknare, mobiltelefoner,
maskinskrivna sidor, kopierade papper, sidor av andra storlekar än A4.
• Skriv och rita noggrant. Oläsbar text kan resultera i noll poäng.
• Du behöver inte lämna tillbaka dessa tentamenspapper när du lämnar in tentamenslösningarna.
• Du kan skriva dina svar på antingen engelska eller svenska.
Tentamen består av två delar:
• Del I: Fundamentala delen: Maximalt antal poäng för del I är 48 poäng (för IS1500) och
40 poäng (för IS1200). Totalpoängen per kursmodul är 8 poäng (6 moduler totalt). För del I
förväntas det endast korta svar på frågorna. Endast ett fåtal meningar krävs.
• Del II: Avancerade delen: Det maximala antalet poäng för del II är 50 poäng. I svaren för
denna del krävs att studenten diskuterar, analyserar och konstruerar. Vidare kräver svaren till
dessa frågor tydliga motiveringar.
2
Sant/Falskt/”Vet ej” frågor för Del I
Dessa frågor kan ge mellan 0 och 4 poäng. Varje fråga består av 4 påståenden. För varje
påstående ska du svara antingen sant, falskt, eller “vet ej”. Poängen beräknas enligt följande:
Varje korrekt svar (svar sant eller falskt) ger 1 poäng. Varje felaktigt svar (sant eller falskt)
ger minus en poäng. Om du svarar “vet ej” för ett påstående ger det varken extrapoäng eller
minuspoäng. Svarsalternativet “vet ej” finns för att undvika att studenten gissar.
Exempel: Anta att de korrekta svaren för fyra påståenden är: sant, falskt, sant, falskt. Anta
att studenten svarar: sant, sant, “vet ej”, falskt. Det totala antalet poäng blir då: 1 - 1 + 0 + 1 =
1 poäng. Notera att även om samtliga svar till alla fyra påståenden är felaktiga så kan poängen
för hela frågan aldrig bli negativ. Såldes är den slutliga poängen för frågan alltid mellan 0 och
4 poäng.
Betyg
För att erhålla godkänt betyg (A, B, C, D eller E) krävs att man får godkänt på del I. För IS1500studenter krävs det 33 poäng eller mer för del I för att få godkänt på tentamen. IS1200-studenter
ska inte svara på fråga 1 i del I. För IS1200-studenter krävs det 27 poäng eller mer totalt för
frågorna 2-6 på del I för att bli godkänd.
Betygsskala (För både IS1200 och IS1500):
• A: 41-50 poäng på del II
• B: 31-40 poäng på del II
• C: 21-30 poäng på del II
• D: 11-20 poäng på del II
• E: 0-10 poäng på del II
• FX: 30-32 poäng (IS1500) eller 24-26 (IS1200) på del I och 11-50 poäng på del II.
• F: i övriga fall
Resultat
• Resultaten kommer att meddelas senast 2015-02-04.
• Om en student får betyg FX är det möjligt att begära en muntlig examination. En sådan
muntlig examination måste begäras via epost av studenten. Skicka ett epost-meddelande till
[email protected] senast 2015-02-25.
Part I: Fundamentals
1. Module 1: Logic Design (for IS1500 only)
(a) English: Consider the figure below. Assume that the register initially has a zero
value and that it is triggered on a rising clock edge. What are then the values for
signals A, B, Y0 , Y1 , Y2 , and Y3 after the first raising clock edge when all signals
have stabilized? (4 points)
Swedish: Studera figuren nedan. Anta att register initialt har värdet noll och att
stigande klockflank är aktiv. Vad är då värdena på signalerna A, B, Y0 , Y1 , Y2 , och
Y3 efter den första stigande klockflank då alla signaler har stabiliserats? (4 poäng)
3
2"
+
310#
2"
A#
CLK#
2"
2"
B#
1#
0#
1#
0#
00"
01"
10"
11"
1#
Decoder"
00"
0"
01"
1"
10"
11"
Y0#
Y1#
Y2#
Y3#
Solution: A = 102 , B = 112 , Y0 = 0, Y1 = 0, Y2 = 0, and Y3 = 1.
Explanation: The initial state value of the register is zero. Hence, after the first
raising clock edge (and after stabilization) the register receives value 3 + 0 = 3, and
therefore B = 112 = 310 . The adder then adds 3 plus 3 and since the signal is only
two bits, the counter wraps around and A = 102 . The multiplexer selects the third
signal 10, which has value 1. Hence, the signal that is decoded is 11, which means
that only Y3 is 1, and the rest of the output signals from the decoder are zero.
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: Proof by perfect induction means that a theorem in boolean algebra can
be proven correct by exhaustively showing all possible combinations in a truth
table.
Swedish: Bevis med hjälp av perfekt induktion innebär att ett teorem i boolesk
algebra kan bevisas vara korrekt genom att uttömmande visa alla olika kombinationer i en sanningstabell.
Solution: True. Proof by perfect induction is the same as Proof by Exhaustion
(see the lecture slides). Proof by perfect induction (note the word perfect) should
not be mixed up with standard mathematical induction proofs.
• Statement 2:
English: When a tristate buffer is disabled, the output is said to be floating.
Swedish: När en tristate buffer är avaktiverad är dess utsignal flytande (floating).
Solution: True.
• Statement 3:
English: The main difference between a SR latch and a D latch is that a D latch
is triggered on the edge of a clock signal, whereas an SR latch is transparent.
Swedish: Den störta skillnaden mellan en SR-latch och en D-latch är att en Dlatch aktiveras på flanken av en klocksignal, medan en SR-latch är transparent.
Solution: False. Neither SR latches nor D latches are edge triggered. The main
difference between an SR latch and a D latch is that a D latch is clocked, whereas
4
an SR latch is not clocked.
• Statement 4:
English: In a synchronous sequential circuit, the propagation delay of the combinational logic in the circuit is an important factor when determining the maximal
clock frequency.
Swedish: För en synkront sekvenskrets är grindfördröjningen av den kombinatoriska logiken i kretsen en viktig faktor när man bestämmer den maximala klockfrekvensen.
Solution: True. There are several delays in the circuit that determines the maximal clock frequency. The delay in the combinatorial logic is one of these important delays.
2. Module 2: C and Assembly Programming ==========================================
(a) English: What is the binary machine code representation of the MIPS instruction
lb
$s1,-67($t2)
Answer as a 32-bit binary number. Note that a page with the structure of the encoding of MIPS is available at the end of the exam. (4 points)
Swedish: Vad är den binära maskinkodsrepresentationen för MIPS-instruktionen
lb
$s1,-67($t2)
Svara som ett 32-bitars binärt tal. Notera att strukturen för MIPS-kodning är tillgängligt på en sida i slutet av tentan. (4 poäng)
Solution: 1000 0001 0101 0001 1111 1111 1011 1101
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
int x = 5;
int y = 6;
int *p = &x;
*p = x + y;
p = &y;
*p = x + y;
printf("%d,%d", x, y);
English: When executing the C code above, the string 11,17 is printed to the
standard output.
Swedish: När C-programmet ovan exekveras så printas strängen 11,17 till standard output.
Solution: True.
5
• Statement 2:
English: The datastructure, that is used for storing local variables and additional
arguments during function calls, stores the values as a first-in first-out (FIFO)
queue in the memory.
Swedish: Datastrukturen, som används för att spara lokala variabler och ytterligare argument vid funktionsanrop, sparar värdena som en först-in-först-ut (FIFO)
kö i minnet.
Solution: False. The datastructure is a stack and it is stored as a last-in first-out
(LIFO) queue.
• Statement 3:
English: In the MIPS ISA, a calling function (called the caller) does not have to
save the registers $s0 to $s7 because the callee (the function that is called) is
responsible for saving these registers, if they are used in the called function.
Swedish: För MIPS ISA, behöver inte en anropande funktion (även kallad “caller”)
spara registerna $s0 till $s7, då funktionen som är anropad är ansvarig för att
spara dessa register, om de används i den anropade funktionen.
Solution: True.
• Statement 4:
English: A benefit of PC-relative addressing is that not all bits of the absolute
address need to be stored in the instruction.
Swedish: En fördel med PC-relativ adressering är att alla bitar av den absoluta
adressen inte behöver lagras i instruktionen.
Solution: True. Parts of the current PC address is used when computing the
branch target address (BTA). That is, the absolute address that is used when updating the PC and when executing the branch.
3. Module 3: Processor Design
(a) English: Consider the figure below that shows the datapath for a single-cycle MIPS
processor. Assume that the current instruction that is executing is
slt $t0,$t1,$t3. What are then the values of signals MemToReg, ALUSrc,
Branch, and A3? (4 points)
Swedish: Studera figuren nedan som visar en dataväg för en enkel-cyklig MIPSprocessor. Anta att den nuvarande instruktionen som exekveras är
slt $t0,$t1,$t3. Vad är då värdena av signalerna MemToReg, ALUSrc, Branch
och A3? (4 poäng)
6
*
PC#
32#
25:21# A1$
20:16# A2$
RD2$
0*
1*
32#
WD3$
32#
A$
RD$
0*
1*
*
32#
WD$
+$
0*
15:11# 1*
*
4*
25:0#
WE$
Zero#
*
20:16#
31:28#
<<2*
32#
A3$
32#
27:0#
3#
RD1$
MemToReg*
CLK$
15:0#
<<2*
Sign*Extend*
+$
32#
A$ RD$
ALUControl*
WE3$
Data*
Memory*
0*
1*
Inst#
ALU$
32#
Instruc(on*
Memory*
*
MemWrite*
ALUSrc*
CLK$
CLK$
0*
1*
Branch*
RegWrite**
RegDst*
Jump*
32#
Solution: MemToReg = 0, ALUSrc = 0, Branch = 0, and A3 = 010002
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: In a five-stage MIPS pipelined datapath, the execute stage is usually
used for reading out the values from the register file.
Swedish: I en fem-stegs MIPS-pipelinad dataväg så läser vanligtvis exekveringssteget ut värdena från registerfilen.
Solution: False. It is done in the decode stage.
• Statement 2:
English: A benefit of an arithmetic logic unit (ALU) is that the same hardware
can be used for different arithmetic operations.
Swedish: En fördel med en aritmetisk-logisk enhet (ALU) är att samma hårdvara
kan användas för olika aritmetiska operationer.
Solution: True.
• Statement 3:
English: A pipelined datapath may, compared to a single-cycle datapath, increase
the average cycles per instruction (CPI).
Swedish: En pipelinad dataväg kan, i jämförelse med en enkel-cyklig dataväg,
öka genomsnittliga värdet för antalet cykler per instruktion (CPI).
Solution: True. Compared to a single-cycle datapath (where each instruction
takes one clock cycle) the pipeline may result in hazards, which can result in
stalling. Hence, the average CPI may increase when a pipelined datapath is used.
• Statement 4:
7
English: Instructions in a CISC instruction set architecture (ISA) can in general
perform more complex operations than instructions in a RISC ISA.
Swedish: Instruktioner i en CISC instruction set architecture (ISA) kan generellt
utföra mer komplexa operationer än instruktioner i ett RISC ISA.
Solution: True. CISC stands for “Complex Instruction Set Computing”. An example of CISC is x86, where an instruction can perform several tasks, for instance
both load from memory and perform an addition. ISAs that are based on RISC,
which stands for “Reduced Instruction Set Computing” have in general few simple instructions.
4. Module 4: Memory Hierarchy
(a) English: Assume that we have a 2-way set associative cache for a processor that
uses 32-bits for addressing. The cache has 1024 cache blocks in total and the block
size is 16 bytes. How many bits are then the tag field, the set field (also called the
index), and the byte offset field, and how many validity bits does the cache contain
in total? (4 points)
Swedish: Anta att vi har en 2-way set associative cache för en processor som
använder 32 bitar för adressering. Cachen har 1024 cache block totalt och en blockstorlek på 16 bytes. Hur många bitar är då adressetiketten (även kallad tag), fältet
för radnummer (även kallad index eller set), och fältet för bytenummer (även kallad
byte offset), samt hur många giltighetsbitar innehåller totalt denna cache? (4 poäng)
Solution: There are in total 1024/2 = 512 sets (swedish: rader). Hence, the set
field is 9 bits. The block size is 16 bytes, making the byte offset field 4 bits. As a
consequence, the tag is 32 − 9 − 4 = 19 bits. Finally, there are in total 1024 number
of validity bits, the same number as blocks.
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: A common replacement policy for direct mapped caches is Least Recently Used (LRU), since a set does not always have to be replaced directly, even
if the validity bit is 1.
Swedish: An vanlig utbytespolicy för direkt mappade cachar är Least Recently
Used (LRU). Detta då raden inte alltid behöver ersättas, även om giltighetsbiten
är satt till 1.
Solution: False. A direct mapped cache do not need an specific replacement
policy; if the set is used, it must always be replaced.
• Statement 2:
8
English: One benefit of virtual memory is that each process (each running program) can have its own virtual memory space, which is protected from other concurrently running processes in the operating system.
Swedish: En fördel med virtuellt minne är att varje process (varje exekverande
program) kan ha sitt eget virtuella minnesutrymme, vilket är skyddat från andra
samtidiga processer som exekveras i operativsystemet.
Solution: True.
• Statement 3:
English: Assume that we have a direct mapped cache with block size 16 bytes and
256 blocks in total. If a load byte instruction reads from address 0xff20 215e,
followed by another load byte instruction that reads from address 0x1000 5150,
the second load instruction will result in a cache miss.
Swedish: Anta att vi har en direkt mappad cache med blockstorlek 16 bytes och
256 block totalt. Om en load byte-instruktion läser från adress 0xff20 215e,
vilket följs av en ytterligare load byte-instruktion som läser från adress
0x1000 5150, så kommer den andra load-instruktionen att resultera i en cachemiss.
Solution: True. Both instructions access the same set (0x15), but have different
tags.
• Statement 4:
English: A modern processor, such as the Intel Core-I7, has only one large cache
because the speed of one large cache is typically higher than having several small
caches.
Swedish: En modern processor, som t.ex. Intel Core-I7, har endast en stor cache
då hastigheten på en stor cache typiskt är högre än alternativet att ha flera små
cachar.
Solution: False. A modern processor has several caches in the memory hierarchy
(e.g., L1, L2, and L3 caches).
5. Module 5: I/O Systems
(a) English: Assume that an external interrupt occurs, a program A is preempted, and
the program counter is changed so that an interrupt handling routine is executed.
Explain shortly how it is possible to continue to execute program A correctly at the
program point where it was interrupted. (4 points)
Swedish: Anta att ett externt avbrott sker, att ett program A är åsidosatt (preempted)
och programräknaren ändras så att en avbrottshanteringsrutin exekveras. Förklara
kortfattat hur det är möjligt att fortsätta exekveringen av program A på ett korrekt
sätt vid den programpunkt där avbrottet inträffade (4 poäng).
Solution: When the external interrupt occurs, the processor must automatically save
the program counter, before jumping to the interrupt handling routine. For instance,
in MIPS, the PC is saved in a register called EPC. The interrupt handling routine
must save all registers on the stack before performing its task. When the interrupt
handling routine has finished, it restores all registers and returns to the original program by using the saved PC.
9
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: SPI and UART use synchronous and asynchronous serial communication, respectively.
Swedish: SPI och UART använder synkron respektive asynkron seriell kommunikation.
Solution: True.
• Statement 2:
English: A memory-mapped general-purpose I/O (GPIO) pin can be configured
to be either an output or input port, by writing a configuration parameter to a
specific memory address that is dedicated for configuring this GPIO pin.
Swedish: En minnes-mappad general-purpose I/O (GPIO) pin kan konfigureras
att vara antingen en input eller en output port. Detta görs genom att skriva en
konfigurationsparameter till en specifik minnesadress, vilken är dedikerad till att
konfigurera denna GPIO pin.
Solution: True.
• Statement 3:
English: Direct memory access (DMA) is a good alternative to instruction level
parallelism (ILP) because it enables multiple instructions to be fetched by the
datapath and executed in parallel since the code is read directly from memory.
Swedish: Direct memory access (DMA) är ett bra alternativ till instruktionsparallellism (ILP) då det möjliggör att flera instruktioner kan läsas av datavägen och
sedan exekveras parallellt, eftersom koden läses direkt från minnet.
Solution: False. DMA is not related to ILP and is not used for fetching several
instructions in parallel.
• Statement 4:
English: Declaring a pointer volatile in C (as in the example below) means
that the pointer itself is volatile and may change to point to a different address at
any point in time.
Swedish: Att deklarera en pekare flyktig (volatile) i C (som i exemplet
nedan) betyder att pekaren själv är flyktig och kan när som helst ändras så att
den pekar på en annan adress.
volatile int* x = (volatile int*) 0xff88;
Solution: False. The pointer itself cannot change indirectly, but the value that the
pointer points to may change.
6. Module 6: Parallel Processors and Programs
10
(a) English: A program consists of two parts, part A and part B. Part A is trivial to
parallelize, whereas part B is not possible to parallelize at all and consists only of
sequential code. Assume that the amount of improvement that can be achieved by
parallelizing part A is proportional to the number of cores, i.e., using 4 cores, we
achieve 4 times improvement on part A. The theoretical maximal speedup is 5, assuming that we have infinite number of cores. Our competitor can run a sequential
version of the program in 16s, whereas we can, using our parallel implementation,
run the program in 20s on one core. What are then the relative speedup and the
true speedup of our implementation when executing the program on 4 cores? Hint:
recall that true speedup compares with the fastest available sequential implementation, whereas relative speedup compares with your own implementation running
sequentially. (4 points)
Swedish: Ett program består av två delar, del A och del B. Del A är trivialt att
parallellisera, medan del B inte alls är möjlig att parallellisera och består endast
av sekventiell programkod. Anta att förbättringsmöjligheten som man kan uppnå
genom att parallellisera del A är proportionell mot antalet processorkärnor, dvs, om
man använder 4 kärnor så uppnår man 4 gångers förbättring. Den teoretiska maixmala speedupen är 5, om man antar att vi har oändligt många kärnor. Vår konkurrent
kan köra en sekventiell version av programmet på 16s, medan vi kan köra vår parallella implementation på 20s om den exekverar på 1 core. Vad blir då den relativa
speedupen och den sanna speedupen (true speedup) för vår implementation om man
exekverar programmet på 4 kärnor? Tips: Notera att sann speedup jämför med den
snabbast tillgängliga sekventiella implementationen, medan relativ speedup jämför
med din egna implementation när den körs sekventiellt. (4 poäng)
Solution: Since we know that the theoretical speedup limit is 5, 51 of the execution time is due to part B, the sequential code. By applying Amdahl’s law, we get
16
20
= 20
= 2.5 and speedup true = 16/4+4
= 16
= 2.0. Hence,
speedup relative = 16/4+4
8
8
the true speedup is, as expected, somewhat lower than the relative speedup.
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: A good example of MIMD is a modern multicore processor.
Swedish: Ett bra exempel på MIMD är en modern multicore-processor.
Solution: True.
• Statement 2:
English: MapReduce can be based on message passing techniques and is today
used extensively in Warehouse-scale computers.
Swedish: MapReduce kan vara baserat på meddelandehanteringsteknik och används
idag i hög omfattning i Warehouse-scale computers.
11
Solution: True.
• Statement 3:
English: If a semaphore is used for mutual exclusion (mutex), it means the programmer can define critical sections by first locking a mutex before entering the
critical section, and then unlocking the mutex when exiting the critical section.
Hence, a mutex can be used to have controlled access to shared resources in a
concurrent environment.
Swedish: Om en semaphore används för mutual exclusion (mutex), betyder det
att programmeraren kan definiera kritiska sektioner genom att först låsa en mutex
innan man kommer in i den kritiska sektionen, och sen låsa upp mutexen när man
lämnar den kritiska sektionen. Alltså, en mutex kan användas för att kontrollera
tillgång till delade resurser i en samtidig (concurrent) miljö.
Solution: True.
• Statement 4:
English: Assume we have a shared memory processor (SMP) consisting of 4
cores with separate L1 caches and an L2 cache that is shared among the cores.
Assume further that two of the cores frequently reads or writes to the same address
in memory. As a consequence, we get easily inconsistencies in the L1 caches.
This phenomena is called false sharing.
Swedish: Anta att vi har en processor med delat minne (shared memory processor) som består av 4 kärnor med separata L1 cachar och en gemensam L2
cache som delas mellan kärnorna. Anta även att två av kärnorna frekvent läser
eller skriver till samma minnesadress. Således leder detta till inkonsistens i L1
cacharna. Detta fenomen kallas falsk delning (false sharing).
Solution: False. The problem described is related to cache coherence, but not
directly to false sharing. If it was about false sharing, the cores would write to the
same cache block, but not to the exact same address.
12
Part II: Advanced
1. English: Explain in detail how a 2-way set associative data cache works. Your solution
should include the following:
• Sketch a hardware solution that can handle hit and miss detection. The solution
should also return the correct cache value for a cache hit. This illustration should
include and explain the terms of tag, set, byte offset, validity bit, and way.
• Explain the concept of replacement policy. In particular, explain the meaning of
Least Recently Used (LRU). You do not have to provide the hardware solution for
the LRU.
• Show assembly code examples and step-by-step guides for how the execution of
the example affects the cache. Your code examples must illustrate both temporal
locality and spatial locality. Your example code does not have to show the effects of
the replacement policy. You may write NIOS II or MIPS assembly code.
Clearly describe and motivate your answers. Diagrams or figures without explanations
will not give any points. (15 points)
Swedish: Förklara i detalj hur en 2-vägs associativ data-cache fungerar. Din lösning ska
innehålla följande:
• Skissa en hårdvarulösning som kan hantera hit- och miss-detektering. Lösningen
skall även returnera korrekt cache-värde vid en cache träff. Lösningen ska innehålla
och förklara termerna adressetikett (även kallad tag), rad (även kallad set), bytenummer, giltighetsbit samt “way”.
• Förklara konceptet utbytespolicy (replacement policy). Förklara speciellt betydelsen
av Least Recently Used (LRU). Du behöver dock inte tillhandahålla hårdvarulösningen för LRU.
• Visa assembler kodexempel med steg-för-steg guide för hur exekveringen av exemplet påverkar cachen. Dina exempel måste visa både tidsmässig (temporal) och
spatiell lokalitet. Dina exempel behöver inte visa utbytespolicyns effekter. Du kan
välja att skriva antingen assemblerkod för NIOS II eller MIPS.
Beskriv och motivera ditt svar tydligt. Diagram eller figurer utan förklaringar ger inga
poäng. (15 poäng)
Solution: This tasks can be answered in many different ways. A complete solution of the
task is therefore omitted. Instead, we just give a few comments:
• When describing replacement policies, it is important to relate this to LRU. A good
solution explains this by using an example.
• The code examples should show temporal and spatial locality in the data cache (data
cache is part of the exercise. The old solution text talked about instruction cache).
Temporal locality can be shown by reading from the same data element several
times, for instance in a loop. Spatial locality can be shown by, for instance, reading
from an array.
13
• In the hardware solution, it is important to show how the tag is compared in the
address and in the cache, as well as how the validity bit is compared. This is usually
done with an AND-gate. Note also that we need to use a multiplexor to select the
correct output from the cache.
• It is important to explain all the terms (see item 1 in the question).
14
2. English: There are many ways to make use of parallelism to achieve better performance
in a computer system. Three important concepts/techniques are SIMD instructions (for
instance multimedia extensions), instruction level parallelism (ILP), and multicore. In
this task, you should analyze, discuss, and compare these different techniques. In what
way are they similar? In what way are they different? Which are their pros and cons?
How do they affect the programmer or the compiler? Where are the limitations? Your
answer should consist of a comprehensive and well-thought-out discussion. (10 points)
Swedish: Det finns många olika sätt att använda parallellism för att uppnå bättre prestanda i ett datorsystem. Tre viktiga koncept/tekniker är SIMD-instruktioner (t.ex. multimedia-utökningar), instruktions-nivå parallellism (instruction level parallelism, ILP), och
multicore (flera kärnor). I denna uppgift ska du analysera, diskutera, och jämföra dessa
olika tekniker. På vilka sätt är de lika? På vilka sätt är de olika? Vilka är deras fördelar och nackdelar? På vilket sätt påverkar de programmeraren eller kompilatorn? Vilka
är begränsningarna? Ditt svar ska innefatta en utförlig och genomtänkt diskussion. (10
poäng)
Solution: This task has not one solution. Completely different answers can still give
the same number of points. In the following, we list some important aspects that can be
included in an answer.
• SIMD makes use of data-level parallelism. Hence, there is limitations of what kind
of programs that can actually utilize this kind of parallelism. The same instruction
needs to be applied to different data.
• ILP can take the form of static and dynamic multiple issue. Today, dynamic multiple issue is very common in modern processors. The main benefit is that ILP does
not affect the programmer, parallelism “comes for free” even for a sequential program. However, the amount of parallelism is limited due to dependencies between
instructions.
• Multicore processors can give parallelism by the help of the programmer. Typically,
for a shared memory processor (SMP), a multithreaded programming can be used
to achieve parallelism. This is a form of task parallelism (compared to data-level
parallelism for SIMD).
• Each of these techniques do not have to work in isolation. Instead, a program can
make use of all these concepts and techniques to achieve speedups.
• A similarity between multicore and SIMD are that both these techniques typically
need help from the programmer to actually work. Certain compilers exist, e.g.,
OpenMP, that can be used to program multicore systems in an easy way.
• ILP does not need to have support from the programmer, but if the programmer
programs in a special way, more ILP can be explored. One such technique is loop
unrolling, which makes it possible for the processor to fetch more instructions to
always try to fill the pipeline.
• A problem with multicore programming is that it is quite hard, unless there are few
dependencies between different tasks. Programs need to be synchronized using for
instance semaphores.
15
• SIMD is today typically programmed with special platform dependent libraries.
• A problem with SMP multicore systems is cache coherence. If the programmer is
not aware of how the communication and access to memory effects the caches, the
performance improvements may be much less than expected.
3. English: Explain in detail the meaning of the following terms and concepts and their
relationships: Execution time of programs, Cycles Per Instruction (CPI), Clock frequency,
Clock period, Power, and Energy. Explain also how computer architects and processor
manufacturers try to decrease energy consumption and why it is difficult to do (10 points).
Swedish: Förklara i detalj betydelsen av följande termer och koncept, samt deras relationer: Exekveringstid för program, Cykler per instruktion (CPI), Klockfrekvens, Klockperiod, Effekt och Energi. Förklara också hur datorarkitekter och processortillverkare
försöker att minska energiåtgången, samt varför detta är svårt (10 poäng).
Solution: This task has not one solution. Completely different answers can still give
the same number of points. In the following, we list some important aspects that can be
included in an answer.
• A programs execution time depends on several factors, where the most important
ones are i) the number of executed instructions, ii) the cycles per instruction (CPI),
and iii) the clock period.
• If a processor has a pipeline, this can increase the clock frequency, which gives better
performance. On the other hand, a pipeline can also introduce pipeline hazards, that
can result in stalls. This will increase the CPI.
• Clock frequency is defined as one divided by the clock period.
• For over 30 years, processor manufactures have constantly increased the clock rate
of processors, thus also increased the power. As a consequence, the so called power
wall was reached around year 2006. Instead of increasing the clock frequency of the
processor, manufacturers started to add processor cores.
• The dominating source of energy consumption is the dynamic energy, which is consumed when each transistor is switching. The dynamic energy for switching a transistor is proportional to the capacitive load (the number of transistors connected to
an output and the manufacturing technology), and the square of the voltage. Since
the voltage is the dominating factor (the term is squared), processors are today using
much lower voltage for their power supply, compared to just a few years ago.
• The dynamic power is proportional to the energy used for a transistor transition,
multiplied with the frequency switched. This means that power increases when frequency increases, but the voltage is still dominating due to the square term. Note
that the energy for computing a task does not decrease just because the frequency
is lower, but the power becomes lower. Although efforts are made to lower voltage, there is a limit for how low the voltage can be without increasing the leakage. In server systems, the static power dissipation due to leakage can be as high as
16
40%. Static energy is consumed (or transformed to other forms of energy) in CMOS
technology even if a transistor is off. As a consequence, increasing the number of
transistor (for instance, increasing the number of cores in a processor) increases the
static energy consumption, even if the transistors are not switching. It is therefore
hard to decrease the energy because of the voltage limitation and the increased demand on more cores and larger caches. There are also techniques for switching off
parts of a chip during execution to decrease energy further.
17
4. English: Carefully analyze the partially commented MIPS assembler program on the
next page. Construct a C program that generates the same result in memory as the MIPS
program. Note that there are two memory arrays. A 16x16 word matrix starting at address
0x1001 0000 and an output array of size 16 words (address computed and stored in
$s3 in part 2 of the MIPS code). Your C program should start with the variable declarations shown below. Assume that there exist code at the end of the program the prints
out the arrays to standard output. Explain in detail how your program works and what the
expected result of the program is, i.e., you should explain what the program is actually
trying to accomplish (15 points).
Swedish: Analysera noggrant det partiellt kommenterade MIPS-assembler-programmet
på nästa sida. Konstruera ett C-program som genererar samma resultat i minnet som
MIPS-programmet. Notera att det finns två stycken minnesarrayer. En 16x16-ords matris
som startar på adress 0x1001 0000 och en “output”-array av storlek 16 ord (adressen
beräknas och sparas i $s3 i del 2 av MIPS-koden). Ditt C-program ska börja med de
variabeldeklarationer som finns nedan. Anta att det finns kod i slutet av programmet som
skriver ut arrayerna till standard output. Förklara i detalj hur ditt program fungerar och
vad det förväntade resultatet av programmet är, dvs. du ska förklara vad programmet
egentligen försöker att göra (15 poäng).
int main(){
const int maxval = 16;
int matrix[maxval*maxval];
int output[maxval];
// Insert your code here.
// Assume that there is code here that prints
// out the arrays to standard output.
}
18
###### PART 1 #####
addi
sll
lui
addi
loop1_outer:
slt
beq
addi
loop1_inner:
slt
beq
sll
add
sll
add
addi
addi
mul
sw
addi
j
done1_inner:
addi
j
done1_outer:
$s0,
$s1,
$s2,
$t1,
$0, 16
$s0, 4
0x1001
$0, 0
sll
add
addi
loop2_outer:
slt
beq
addi
addi
loop2_inner:
slt
beq
sll
add
lw
bne
addi
no_match:
addi
j
done2_inner:
beq
sw
addi
no_output:
addi
j
done2_outer:
$s3, $s1, 2
$s3, $s3, $s2
$t1, $0, 2
# Address to a matrix that is 16x16 word
# counter i
$t3, $t1, $s0
$t3, $0, done1_outer
$t2, $0, 0
# counter j
$t3, $t2, $s0
$t3, $0, done1_inner
$t3, $t1, 4
#
$t3, $t3, $t2
$t3, $t3, 2
$t3, $t3, $s2
$t4, $t1, 2
#
$t5, $t2, 2
$t4, $t4, $t5
$t4, 0($t3)
#
$t2, $t2, 1
#
loop1_inner
compute address
compute element value
store result in matrix
increase and loop
$t1, $t1, 1
loop1_outer
###### PART 2 #####
# address to output array
# counter i
$t3,
$t3,
$t6,
$t2,
$t1, $s0
$0, done2_outer
$0, 1
$0, 0
# counter j
$t3,
$t3,
$t4,
$t4,
$t5,
$t1,
$t6,
$t2, $s1
$0, done2_inner
$t2, 2
# get element from matrix
$t4, $s2
0($t4)
$t5, no_match # check if elements match
$0, 0
$t2, $t2, 1
loop2_inner
# next
$t6, $0, no_output # check if write output
$t1, 0($s3)
$s3, $s3, 4
$t1, $t1, 1
loop2_outer
# next
19
Solution: The program computes all prime numbers between 1 and 16 and stores them
in the output array. The first part of the program computes a multiplication matrix, that
is then used in the second part to search for numbers that are prime numbers (i.e. that do
not exist in the matrix). This is a simple, but not very efficient way of computing prime
numbers.
An example C code is shown below.
int main(){
const int maxval = 16;
int matrix[maxval*maxval];
int output[maxval];
// Compute the multiplication matrix
int i,j;
for(i=0; i<maxval; i++){
for(j=0; j<maxval; j++){
matrix[i*maxval + j ] = (i+2) * (j+2);
}
}
// Extract the prime numbers
int k = 0;
for(i=2; i<maxval; i++){
int isprime = 1;
for(j=0; j<maxval*maxval; j++){
if(matrix[j] == i)
isprime = 0;
}
if(isprime)
output[k++] = i;
}
// Assume that there is code here that prints
// out the arrays to standard output.
}
20
MIPS)Reference)Sheet))
INSTSTRUCTION)SET)(SUBSET))
)
)
Name)(format,)op,)funct)))
)Syntax) )
)Opera<on)
add#(R,0,32)
#
#add rd,rs,rt
#reg(rd)#:=#reg(rs)#+#reg(rt);##
add#immediate#(I,8,na)
#addi rt,rs,imm #reg(rt)#:=#reg(rs)#+#signext(imm);#
add#immediate#unsigned#(I,9,na) #addiu rt,rs,imm #reg(rt)#:=#reg(rs)#+#signext(imm);#
add#unsigned#(R,0,33)#
#addu rd,rs,rt
#reg(rd)#:=#reg(rs)#+#reg(rt);#
and#(R,0,36)
#
#and rd,rs,rt
#reg(rd)#:=#reg(rs)#&#reg(rt);#
and#immediate#(I,12,na)
#andi rt,rs,imm #reg(rt)#:=#reg(rs)#&#zeroext(imm);#
branch#on#equal#(I,4,na)
#beq rs,rt,label #if#reg(rs)#==#reg(rt)#then#PC#=#BTA#else#NOP;#
branch#on#not#equal#(I,5,na)
#bne rs,rt,label #if#reg(rs)#!=#reg(rt)#then#PC#=#BTA#else#NOP;#
jump#and#link#register#(R,0,9)
#jalr rs #
#$ra#:=#PC#+#4;###PC#:=#reg(rs);#
jump#register#(R,0,8) #
#jr
rs #
#PC#:=#reg(rs);#
jump#(J,2,na)
#
#j
label
#PC#:=#JTA;###
jump#and#link#(J,3,na) #
#jal label
#$ra#:=#PC#+#4;###PC#:=#JTA;#
load#byte#(I,32,na) #
#lb
rt,imm(rs) #reg(rt)#:=#signext(mem[reg(rs)#+#signext(imm)]7:0);)
load#byte#unsigned#(I,36,na)
#lbu rt,imm(rs) #reg(rt)#:=#zeroext(mem[reg(rs)#+#signext(imm)]7:0);#
load#upper#immediate#(I,14,na) #lui rt,imm
#reg(rt)#:=#concat(imm,#16#bits#of#0);#
load#word#(I,35,na) #
#lw
rt,imm(rs) #reg(rt)#:=#mem[reg(rs)#+#signext(imm)];)
mulZply,#32[bit#result#(R,28,2) #mul rd,rs,rt
#reg(rd)#:=#reg(rs)#*#reg(rt);#
nor#(R,0,39)
#
#nor rd,rs,rt
reg(rd)#:=#not(reg(rs)#|#reg(rt));#
or#(R,0,37)#
#
#or
rd,rs,rt
#reg(rd)#:=#reg(rs)#|#reg(rt);#
or#immediate#(I,13,na)
#ori rt,rs,imm #reg(rt)#:=#reg(rs)#|#zeroext(imm);#
set#less#than#(R,0,42) #
#slt rd,rs,rt
#reg(rd)#:=#if#reg(rs)#<#reg(rt)#then#1#else#0;#
set#less#than#unsigned#(R,0,43) #sltu rd,rs,rt
#reg(rd)#:=#if#reg(rs)#<#reg(rt)#then#1#else#0;#
set#less#than#immediate#(I,10,na)#slti rt,rs,imm #reg(rt)#:=#if#reg(rs)#<#signext(imm)#then#1#else#0;#
set#less#than#immediate#
#sltiu rt,rs,imm #reg(rt)#:=#if#reg(rs)#<#signext(imm)#then#1#else#0;#
####unsigned#(I,11,na)#
shi`#le`#logical#(R,0,0)
#sll rd,rt,shamt #reg(rd)#:=#reg(rt)#<<#shamt;#
shi`#le`#logical#variable#(R,0,4) #sllv rd,rt,rs
reg(rd)#:=#reg(rt)#<<#reg(rs4:0);#
shi`#right#arithmeZc#(R,0,3)
#sra rd,rt,shamt #reg(rd)#:=#reg(rt)#>>>#shamt;#
shi`#right#logical#(R,0,2)
#srl rd,rt,shamt #reg(rd)#:=#reg(rt)#>>#shamt;#
shi`#right#logical#variable#(R,0,6) #srlv rd,rt,rs
reg(rd)#:=#reg(rt)#>>#reg(rs4:0); ##
store#byte#(I,40,na) #
#sb
rt,imm(rs) #mem[reg(rs)#+#signext(imm)]7:0#:=#reg(rt)7:0;#
store#word#(I,43,na) #
#sw
rt,imm(rs) #mem[reg(rs)#+#signext(imm)]#:=#reg(rt);#
subtract#(R,0,34)
#
#sub rd,rs,rt
reg(rd)#:=#reg(rs)#[#reg(rt);#
subtract#unsigned#(R,0,35)
#subu rd,rs,rt
#reg(rd)#:=#reg(rs)#[#reg(rt);#
xor#(R,0,38)
#
#xor rd,rs,rt
reg(rd)#:=#reg(rs)#^#reg(rt);#
xor#immediate#(I,14,na)
#xori rt,rs,imm #reg(rt)#:=#rerg(rs)#^#zeroext(imm);#
#
Defini<ons))
!  Jump#to#target#address:#JTA#=#concat((PC#+#4)31:28,#address(label),#002)#
!  Branch#target#address:#BTA#=#PC#+#4#+#imm#*#4#
#
Clarifica<ons)
!  All#numbers#are#given#in#decimal#form#(base#10).#
!  FuncZon#signext(x)#returns#a#32[bit#sign#extended#value#of#x#in#two’s#complement#form.#
!  FuncZon#zeroext(x)#returns#a#32[bit#value,#where#zero#are#added#to#the#most#significant#side#of#x.#
!  FuncZon#concat(x,#y,#…,#z)#concatenates#the#bits#of#expressions#x,#y,#…,#z.##
!  Subscripts,#for#instance#X8:2,#means#that#bits#with#index#8#to#2#are#spliced#out#of#the#integer#X.#
!  FuncZon#address(x)#means#the#address#of#label#x.#
!  NOP#and#na#means#“no#operaZon”#and#“not#applicable”,#respecZvely.#
!  shamt#is#an#abbreviaZon#for#“shi`#amount”,#i.e.#how#much#bit#shi`ing#that#should#be#done.#
INSTRUCTION)FORMAT)
)
))))))RPType)
)
)
)
)
)
31)
)
)
)
)
31)
))))))IPType)
))))))JPType)
)
)
)
)
26) 25)
21) 20)
16) 15)
11) 10)
6) 5)
0)
op)
rs)
rt)
rd)
shamt)
funct)
6)bits)
5)bits)
5)bits)
5)bits)
5)bits)
6)bits)
26) 25)
21) 20)
16) 15)
0)
op)
rs)
rt)
immediate)
6)bits)
5)bits)
5)bits)
16)bits)
31)
26) 25)
0)
op)
address)
6)bits)
26)bits)
REGISTERS)
)
Name)
$zero
$at
$v0
$v1
$a0
$a1
$a2
$a3
$t0
$t1
$t2
$t3
$t4
$t5
$t6
$t7
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$t8
$t9
$k0
$k1
$gp
$sp
$fp
$ra
#
#
#
)Number)
#0
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
)Descrip<on)
#constant#value#0#
#assembler#temp#
#funcZon#return##
#funcZon#return#
#argument#
#argument#
#argument#
#argument#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#temporary#value#
#temporary#value#
#reserved#for#OS#
#reserved#for#OS#
#global#pointer#
#stack#pointer#
#frame#pointer#
#return#address#
#
#
#
#
#
MIPS)Reference)Sheet)
)
By)David)Broman)
KTH)Royal)Ins<tute)of)Technology)
)
If#you#find#any#errors#or#have#any#
feedback#on#this#document,#please#send#
me#an#email:#[email protected]#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Version#1.01,#January#9,#2015#
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 11
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 12
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 13
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 14
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 15
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 16
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 17
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 18