Download Report

Written Exam / Tentamen∗
Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp
Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp
KTH Royal Institute of Technology
2015-01-14
Examiner / Examinator: David Broman
Teacher on duty / Ansvarig lärare: David Broman, [email protected], +46 73 765 20 44
Instructions in English
• Allowed aids: One sheet of A4 paper with handwritten notes. You may write on both sides
of the paper.
• Explicitly forbidden aids: Textbooks, electronic equipment, calculators, mobile phones, machinewritten pages, photocopied pages, pages of different size than A4.
• Please write and draw carefully. Unreadable text may lead to zero points.
• You do not need to return these exam papers when you hand in your exam solutions.
• You may write your answers in either Swedish or English.
The exam consists of two parts:
• Part I: Fundamentals: The maximal number of points for Part I is 48 points (for IS1500)
and 40 points (for IS1200). There are 8 points for each of the six course modules. All questions in Part I expect only short answers. At most a few sentences are needed.
• Part II: Advanced: The maximal number of points for Part II is 50 points. In the answers,
it is required that the student discuss, analyze, or construct. Forthermore, answers to these
questions require clear motivations.
True/False/Don’t know Questions in Part I
These questions can give between 0 and 4 points. Each question consists of 4 statements. For
each statement, you should answer either true, false, or “don’t know”. The points are calculated
as follows: Each correct answer (answer true or false) gives one point. Each incorrect answer
(true or false) gives minus one point. If you answer “don’t know” for a statement, it neither
adds nor removes points. The rationale for introducing “don’t know” answers is to avoid that
the student makes guesses.
Example: Assume that the correct answers to four statements are: true, false, true, false.
Assume that the student answered: true, true, don’t know, false. The total number of points is
then: 1 - 1 + 0 + 1 = 1 point. Note that even if the answers to all four statements are wrong,
the points for the whole question can never be negative, that is, the final points will always be
between 0 and 4.
∗
This version of the exam paper has been update February 6, 2015. The updates include changes of the pass
criteria (the pass criterion was changed (lowered) from 36 to 33 points in the fundamental part for IS1500 and 30
to 27 for IS1200). We also updated the criterion for FX (only 11 points are needed on the advanced part) and added
a few clarifications on the solution suggestions.
1
Grades
To get a pass grade (A, B, C, D, or E), it is required to pass Part I of the exam. For IS1500
students, it is required to get 33 points or more on Part I to pass the exam. IS1200 students
should not answer question 1 in Part I. For IS1200 students, it is required to have 27 points or
more in total for questions 2-6 on Part I.
Grading scale (For both IS1200 and IS1500):
• A: 41-50 points on Part II
• B: 31-40 points on Part II
• C: 21-30 points on Part II
• D: 11-20 points on Part II
• E: 0-10 points on Part II
• FX: 30-32 points (IS1500) or 24-26 (IS1200) on Part I, and 11-50 points on Part II.
• F: otherwise
Results
• The result will be announced at latest 2015-02-04.
• If a student received grade FX, it is possible to request a complementary oral examination.
Such complementary oral examination must be requested by the student. Please send an
email to [email protected] at latest 2015-02-25.
Instruktioner på Svenska
• Tillåtna hjälpmedel: En A4-sida med handskrivna anteckningar. Det är tillåtet att skriva på
båda sidorna av pappret.
• Förbjudna hjälpmedel: Läroböcker, elektroniska hjälpmedel, miniräknare, mobiltelefoner,
maskinskrivna sidor, kopierade papper, sidor av andra storlekar än A4.
• Skriv och rita noggrant. Oläsbar text kan resultera i noll poäng.
• Du behöver inte lämna tillbaka dessa tentamenspapper när du lämnar in tentamenslösningarna.
• Du kan skriva dina svar på antingen engelska eller svenska.
Tentamen består av två delar:
• Del I: Fundamentala delen: Maximalt antal poäng för del I är 48 poäng (för IS1500) och
40 poäng (för IS1200). Totalpoängen per kursmodul är 8 poäng (6 moduler totalt). För del I
förväntas det endast korta svar på frågorna. Endast ett fåtal meningar krävs.
• Del II: Avancerade delen: Det maximala antalet poäng för del II är 50 poäng. I svaren för
denna del krävs att studenten diskuterar, analyserar och konstruerar. Vidare kräver svaren till
dessa frågor tydliga motiveringar.
2
Sant/Falskt/”Vet ej” frågor för Del I
Dessa frågor kan ge mellan 0 och 4 poäng. Varje fråga består av 4 påståenden. För varje
påstående ska du svara antingen sant, falskt, eller “vet ej”. Poängen beräknas enligt följande:
Varje korrekt svar (svar sant eller falskt) ger 1 poäng. Varje felaktigt svar (sant eller falskt)
ger minus en poäng. Om du svarar “vet ej” för ett påstående ger det varken extrapoäng eller
minuspoäng. Svarsalternativet “vet ej” finns för att undvika att studenten gissar.
Exempel: Anta att de korrekta svaren för fyra påståenden är: sant, falskt, sant, falskt. Anta
att studenten svarar: sant, sant, “vet ej”, falskt. Det totala antalet poäng blir då: 1 - 1 + 0 + 1 =
1 poäng. Notera att även om samtliga svar till alla fyra påståenden är felaktiga så kan poängen
för hela frågan aldrig bli negativ. Såldes är den slutliga poängen för frågan alltid mellan 0 och
4 poäng.
Betyg
För att erhålla godkänt betyg (A, B, C, D eller E) krävs att man får godkänt på del I. För IS1500studenter krävs det 33 poäng eller mer för del I för att få godkänt på tentamen. IS1200-studenter
ska inte svara på fråga 1 i del I. För IS1200-studenter krävs det 27 poäng eller mer totalt för
frågorna 2-6 på del I för att bli godkänd.
Betygsskala (För både IS1200 och IS1500):
• A: 41-50 poäng på del II
• B: 31-40 poäng på del II
• C: 21-30 poäng på del II
• D: 11-20 poäng på del II
• E: 0-10 poäng på del II
• FX: 30-32 poäng (IS1500) eller 24-26 (IS1200) på del I och 11-50 poäng på del II.
• F: i övriga fall
Resultat
• Resultaten kommer att meddelas senast 2015-02-04.
• Om en student får betyg FX är det möjligt att begära en muntlig examination. En sådan
muntlig examination måste begäras via epost av studenten. Skicka ett epost-meddelande till
[email protected] senast 2015-02-25.
Part I: Fundamentals
1. Module 1: Logic Design (for IS1500 only)
(a) English: Consider the figure below. Assume that the register initially has a zero
value and that it is triggered on a rising clock edge. What are then the values for
signals A, B, Y0 , Y1 , Y2 , and Y3 after the first raising clock edge when all signals
have stabilized? (4 points)
Swedish: Studera figuren nedan. Anta att register initialt har värdet noll och att
stigande klockflank är aktiv. Vad är då värdena på signalerna A, B, Y0 , Y1 , Y2 , och
Y3 efter den första stigande klockflank då alla signaler har stabiliserats? (4 poäng)
3
2"
+
310#
2"
A#
CLK#
2"
2"
B#
1#
0#
1#
0#
00"
01"
10"
11"
1#
Decoder"
00"
0"
01"
1"
10"
11"
Y0#
Y1#
Y2#
Y3#
Solution: A = 102 , B = 112 , Y0 = 0, Y1 = 0, Y2 = 0, and Y3 = 1.
Explanation: The initial state value of the register is zero. Hence, after the first
raising clock edge (and after stabilization) the register receives value 3 + 0 = 3, and
therefore B = 112 = 310 . The adder then adds 3 plus 3 and since the signal is only
two bits, the counter wraps around and A = 102 . The multiplexer selects the third
signal 10, which has value 1. Hence, the signal that is decoded is 11, which means
that only Y3 is 1, and the rest of the output signals from the decoder are zero.
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: Proof by perfect induction means that a theorem in boolean algebra can
be proven correct by exhaustively showing all possible combinations in a truth
table.
Swedish: Bevis med hjälp av perfekt induktion innebär att ett teorem i boolesk
algebra kan bevisas vara korrekt genom att uttömmande visa alla olika kombinationer i en sanningstabell.
Solution: True. Proof by perfect induction is the same as Proof by Exhaustion
(see the lecture slides). Proof by perfect induction (note the word perfect) should
not be mixed up with standard mathematical induction proofs.
• Statement 2:
English: When a tristate buffer is disabled, the output is said to be floating.
Swedish: När en tristate buffer är avaktiverad är dess utsignal flytande (floating).
Solution: True.
• Statement 3:
English: The main difference between a SR latch and a D latch is that a D latch
is triggered on the edge of a clock signal, whereas an SR latch is transparent.
Swedish: Den störta skillnaden mellan en SR-latch och en D-latch är att en Dlatch aktiveras på flanken av en klocksignal, medan en SR-latch är transparent.
Solution: False. Neither SR latches nor D latches are edge triggered. The main
difference between an SR latch and a D latch is that a D latch is clocked, whereas
4
an SR latch is not clocked.
• Statement 4:
English: In a synchronous sequential circuit, the propagation delay of the combinational logic in the circuit is an important factor when determining the maximal
clock frequency.
Swedish: För en synkront sekvenskrets är grindfördröjningen av den kombinatoriska logiken i kretsen en viktig faktor när man bestämmer den maximala klockfrekvensen.
Solution: True. There are several delays in the circuit that determines the maximal clock frequency. The delay in the combinatorial logic is one of these important delays.
2. Module 2: C and Assembly Programming ==========================================
(a) English: What is the binary machine code representation of the MIPS instruction
lb
$s1,-67($t2)
Answer as a 32-bit binary number. Note that a page with the structure of the encoding of MIPS is available at the end of the exam. (4 points)
Swedish: Vad är den binära maskinkodsrepresentationen för MIPS-instruktionen
lb
$s1,-67($t2)
Svara som ett 32-bitars binärt tal. Notera att strukturen för MIPS-kodning är tillgängligt på en sida i slutet av tentan. (4 poäng)
Solution: 1000 0001 0101 0001 1111 1111 1011 1101
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
int x = 5;
int y = 6;
int *p = &x;
*p = x + y;
p = &y;
*p = x + y;
printf("%d,%d", x, y);
English: When executing the C code above, the string 11,17 is printed to the
standard output.
Swedish: När C-programmet ovan exekveras så printas strängen 11,17 till standard output.
Solution: True.
5
• Statement 2:
English: The datastructure, that is used for storing local variables and additional
arguments during function calls, stores the values as a first-in first-out (FIFO)
queue in the memory.
Swedish: Datastrukturen, som används för att spara lokala variabler och ytterligare argument vid funktionsanrop, sparar värdena som en först-in-först-ut (FIFO)
kö i minnet.
Solution: False. The datastructure is a stack and it is stored as a last-in first-out
(LIFO) queue.
• Statement 3:
English: In the MIPS ISA, a calling function (called the caller) does not have to
save the registers $s0 to $s7 because the callee (the function that is called) is
responsible for saving these registers, if they are used in the called function.
Swedish: För MIPS ISA, behöver inte en anropande funktion (även kallad “caller”)
spara registerna $s0 till $s7, då funktionen som är anropad är ansvarig för att
spara dessa register, om de används i den anropade funktionen.
Solution: True.
• Statement 4:
English: A benefit of PC-relative addressing is that not all bits of the absolute
address need to be stored in the instruction.
Swedish: En fördel med PC-relativ adressering är att alla bitar av den absoluta
adressen inte behöver lagras i instruktionen.
Solution: True. Parts of the current PC address is used when computing the
branch target address (BTA). That is, the absolute address that is used when updating the PC and when executing the branch.
3. Module 3: Processor Design
(a) English: Consider the figure below that shows the datapath for a single-cycle MIPS
processor. Assume that the current instruction that is executing is
slt $t0,$t1,$t3. What are then the values of signals MemToReg, ALUSrc,
Branch, and A3? (4 points)
Swedish: Studera figuren nedan som visar en dataväg för en enkel-cyklig MIPSprocessor. Anta att den nuvarande instruktionen som exekveras är
slt $t0,$t1,$t3. Vad är då värdena av signalerna MemToReg, ALUSrc, Branch
och A3? (4 poäng)
6
*
PC#
32#
25:21# A1$
20:16# A2$
RD2$
0*
1*
32#
WD3$
32#
A$
RD$
0*
1*
*
32#
WD$
+$
0*
15:11# 1*
*
4*
25:0#
WE$
Zero#
*
20:16#
31:28#
<<2*
32#
A3$
32#
27:0#
3#
RD1$
MemToReg*
CLK$
15:0#
<<2*
Sign*Extend*
+$
32#
A$ RD$
ALUControl*
WE3$
Data*
Memory*
0*
1*
Inst#
ALU$
32#
Instruc(on*
Memory*
*
MemWrite*
ALUSrc*
CLK$
CLK$
0*
1*
Branch*
RegWrite**
RegDst*
Jump*
32#
Solution: MemToReg = 0, ALUSrc = 0, Branch = 0, and A3 = 010002
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: In a five-stage MIPS pipelined datapath, the execute stage is usually
used for reading out the values from the register file.
Swedish: I en fem-stegs MIPS-pipelinad dataväg så läser vanligtvis exekveringssteget ut värdena från registerfilen.
Solution: False. It is done in the decode stage.
• Statement 2:
English: A benefit of an arithmetic logic unit (ALU) is that the same hardware
can be used for different arithmetic operations.
Swedish: En fördel med en aritmetisk-logisk enhet (ALU) är att samma hårdvara
kan användas för olika aritmetiska operationer.
Solution: True.
• Statement 3:
English: A pipelined datapath may, compared to a single-cycle datapath, increase
the average cycles per instruction (CPI).
Swedish: En pipelinad dataväg kan, i jämförelse med en enkel-cyklig dataväg,
öka genomsnittliga värdet för antalet cykler per instruktion (CPI).
Solution: True. Compared to a single-cycle datapath (where each instruction
takes one clock cycle) the pipeline may result in hazards, which can result in
stalling. Hence, the average CPI may increase when a pipelined datapath is used.
• Statement 4:
7
English: Instructions in a CISC instruction set architecture (ISA) can in general
perform more complex operations than instructions in a RISC ISA.
Swedish: Instruktioner i en CISC instruction set architecture (ISA) kan generellt
utföra mer komplexa operationer än instruktioner i ett RISC ISA.
Solution: True. CISC stands for “Complex Instruction Set Computing”. An example of CISC is x86, where an instruction can perform several tasks, for instance
both load from memory and perform an addition. ISAs that are based on RISC,
which stands for “Reduced Instruction Set Computing” have in general few simple instructions.
4. Module 4: Memory Hierarchy
(a) English: Assume that we have a 2-way set associative cache for a processor that
uses 32-bits for addressing. The cache has 1024 cache blocks in total and the block
size is 16 bytes. How many bits are then the tag field, the set field (also called the
index), and the byte offset field, and how many validity bits does the cache contain
in total? (4 points)
Swedish: Anta att vi har en 2-way set associative cache för en processor som
använder 32 bitar för adressering. Cachen har 1024 cache block totalt och en blockstorlek på 16 bytes. Hur många bitar är då adressetiketten (även kallad tag), fältet
för radnummer (även kallad index eller set), och fältet för bytenummer (även kallad
byte offset), samt hur många giltighetsbitar innehåller totalt denna cache? (4 poäng)
Solution: There are in total 1024/2 = 512 sets (swedish: rader). Hence, the set
field is 9 bits. The block size is 16 bytes, making the byte offset field 4 bits. As a
consequence, the tag is 32 − 9 − 4 = 19 bits. Finally, there are in total 1024 number
of validity bits, the same number as blocks.
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: A common replacement policy for direct mapped caches is Least Recently Used (LRU), since a set does not always have to be replaced directly, even
if the validity bit is 1.
Swedish: An vanlig utbytespolicy för direkt mappade cachar är Least Recently
Used (LRU). Detta då raden inte alltid behöver ersättas, även om giltighetsbiten
är satt till 1.
Solution: False. A direct mapped cache do not need an specific replacement
policy; if the set is used, it must always be replaced.
• Statement 2:
8
English: One benefit of virtual memory is that each process (each running program) can have its own virtual memory space, which is protected from other concurrently running processes in the operating system.
Swedish: En fördel med virtuellt minne är att varje process (varje exekverande
program) kan ha sitt eget virtuella minnesutrymme, vilket är skyddat från andra
samtidiga processer som exekveras i operativsystemet.
Solution: True.
• Statement 3:
English: Assume that we have a direct mapped cache with block size 16 bytes and
256 blocks in total. If a load byte instruction reads from address 0xff20 215e,
followed by another load byte instruction that reads from address 0x1000 5150,
the second load instruction will result in a cache miss.
Swedish: Anta att vi har en direkt mappad cache med blockstorlek 16 bytes och
256 block totalt. Om en load byte-instruktion läser från adress 0xff20 215e,
vilket följs av en ytterligare load byte-instruktion som läser från adress
0x1000 5150, så kommer den andra load-instruktionen att resultera i en cachemiss.
Solution: True. Both instructions access the same set (0x15), but have different
tags.
• Statement 4:
English: A modern processor, such as the Intel Core-I7, has only one large cache
because the speed of one large cache is typically higher than having several small
caches.
Swedish: En modern processor, som t.ex. Intel Core-I7, har endast en stor cache
då hastigheten på en stor cache typiskt är högre än alternativet att ha flera små
cachar.
Solution: False. A modern processor has several caches in the memory hierarchy
(e.g., L1, L2, and L3 caches).
5. Module 5: I/O Systems
(a) English: Assume that an external interrupt occurs, a program A is preempted, and
the program counter is changed so that an interrupt handling routine is executed.
Explain shortly how it is possible to continue to execute program A correctly at the
program point where it was interrupted. (4 points)
Swedish: Anta att ett externt avbrott sker, att ett program A är åsidosatt (preempted)
och programräknaren ändras så att en avbrottshanteringsrutin exekveras. Förklara
kortfattat hur det är möjligt att fortsätta exekveringen av program A på ett korrekt
sätt vid den programpunkt där avbrottet inträffade (4 poäng).
Solution: When the external interrupt occurs, the processor must automatically save
the program counter, before jumping to the interrupt handling routine. For instance,
in MIPS, the PC is saved in a register called EPC. The interrupt handling routine
must save all registers on the stack before performing its task. When the interrupt
handling routine has finished, it restores all registers and returns to the original program by using the saved PC.
9
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: SPI and UART use synchronous and asynchronous serial communication, respectively.
Swedish: SPI och UART använder synkron respektive asynkron seriell kommunikation.
Solution: True.
• Statement 2:
English: A memory-mapped general-purpose I/O (GPIO) pin can be configured
to be either an output or input port, by writing a configuration parameter to a
specific memory address that is dedicated for configuring this GPIO pin.
Swedish: En minnes-mappad general-purpose I/O (GPIO) pin kan konfigureras
att vara antingen en input eller en output port. Detta görs genom att skriva en
konfigurationsparameter till en specifik minnesadress, vilken är dedikerad till att
konfigurera denna GPIO pin.
Solution: True.
• Statement 3:
English: Direct memory access (DMA) is a good alternative to instruction level
parallelism (ILP) because it enables multiple instructions to be fetched by the
datapath and executed in parallel since the code is read directly from memory.
Swedish: Direct memory access (DMA) är ett bra alternativ till instruktionsparallellism (ILP) då det möjliggör att flera instruktioner kan läsas av datavägen och
sedan exekveras parallellt, eftersom koden läses direkt från minnet.
Solution: False. DMA is not related to ILP and is not used for fetching several
instructions in parallel.
• Statement 4:
English: Declaring a pointer volatile in C (as in the example below) means
that the pointer itself is volatile and may change to point to a different address at
any point in time.
Swedish: Att deklarera en pekare flyktig (volatile) i C (som i exemplet
nedan) betyder att pekaren själv är flyktig och kan när som helst ändras så att
den pekar på en annan adress.
volatile int* x = (volatile int*) 0xff88;
Solution: False. The pointer itself cannot change indirectly, but the value that the
pointer points to may change.
6. Module 6: Parallel Processors and Programs
10
(a) English: A program consists of two parts, part A and part B. Part A is trivial to
parallelize, whereas part B is not possible to parallelize at all and consists only of
sequential code. Assume that the amount of improvement that can be achieved by
parallelizing part A is proportional to the number of cores, i.e., using 4 cores, we
achieve 4 times improvement on part A. The theoretical maximal speedup is 5, assuming that we have infinite number of cores. Our competitor can run a sequential
version of the program in 16s, whereas we can, using our parallel implementation,
run the program in 20s on one core. What are then the relative speedup and the
true speedup of our implementation when executing the program on 4 cores? Hint:
recall that true speedup compares with the fastest available sequential implementation, whereas relative speedup compares with your own implementation running
sequentially. (4 points)
Swedish: Ett program består av två delar, del A och del B. Del A är trivialt att
parallellisera, medan del B inte alls är möjlig att parallellisera och består endast
av sekventiell programkod. Anta att förbättringsmöjligheten som man kan uppnå
genom att parallellisera del A är proportionell mot antalet processorkärnor, dvs, om
man använder 4 kärnor så uppnår man 4 gångers förbättring. Den teoretiska maixmala speedupen är 5, om man antar att vi har oändligt många kärnor. Vår konkurrent
kan köra en sekventiell version av programmet på 16s, medan vi kan köra vår parallella implementation på 20s om den exekverar på 1 core. Vad blir då den relativa
speedupen och den sanna speedupen (true speedup) för vår implementation om man
exekverar programmet på 4 kärnor? Tips: Notera att sann speedup jämför med den
snabbast tillgängliga sekventiella implementationen, medan relativ speedup jämför
med din egna implementation när den körs sekventiellt. (4 poäng)
Solution: Since we know that the theoretical speedup limit is 5, 51 of the execution time is due to part B, the sequential code. By applying Amdahl’s law, we get
16
20
= 20
= 2.5 and speedup true = 16/4+4
= 16
= 2.0. Hence,
speedup relative = 16/4+4
8
8
the true speedup is, as expected, somewhat lower than the relative speedup.
(b) English: For each of the following statements, answer if the statement is true, false,
or “don’t know”. Note that incorrect answers give minus points. Please see the
first pages of the exam for an explanation of how the points are calculated for these
true/false/don’t know questions. (4 points)
Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet
ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av
hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen.
(4 poäng)
• Statement 1:
English: A good example of MIMD is a modern multicore processor.
Swedish: Ett bra exempel på MIMD är en modern multicore-processor.
Solution: True.
• Statement 2:
English: MapReduce can be based on message passing techniques and is today
used extensively in Warehouse-scale computers.
Swedish: MapReduce kan vara baserat på meddelandehanteringsteknik och används
idag i hög omfattning i Warehouse-scale computers.
11
Solution: True.
• Statement 3:
English: If a semaphore is used for mutual exclusion (mutex), it means the programmer can define critical sections by first locking a mutex before entering the
critical section, and then unlocking the mutex when exiting the critical section.
Hence, a mutex can be used to have controlled access to shared resources in a
concurrent environment.
Swedish: Om en semaphore används för mutual exclusion (mutex), betyder det
att programmeraren kan definiera kritiska sektioner genom att först låsa en mutex
innan man kommer in i den kritiska sektionen, och sen låsa upp mutexen när man
lämnar den kritiska sektionen. Alltså, en mutex kan användas för att kontrollera
tillgång till delade resurser i en samtidig (concurrent) miljö.
Solution: True.
• Statement 4:
English: Assume we have a shared memory processor (SMP) consisting of 4
cores with separate L1 caches and an L2 cache that is shared among the cores.
Assume further that two of the cores frequently reads or writes to the same address
in memory. As a consequence, we get easily inconsistencies in the L1 caches.
This phenomena is called false sharing.
Swedish: Anta att vi har en processor med delat minne (shared memory processor) som består av 4 kärnor med separata L1 cachar och en gemensam L2
cache som delas mellan kärnorna. Anta även att två av kärnorna frekvent läser
eller skriver till samma minnesadress. Således leder detta till inkonsistens i L1
cacharna. Detta fenomen kallas falsk delning (false sharing).
Solution: False. The problem described is related to cache coherence, but not
directly to false sharing. If it was about false sharing, the cores would write to the
same cache block, but not to the exact same address.
12
Part II: Advanced
1. English: Explain in detail how a 2-way set associative data cache works. Your solution
should include the following:
• Sketch a hardware solution that can handle hit and miss detection. The solution
should also return the correct cache value for a cache hit. This illustration should
include and explain the terms of tag, set, byte offset, validity bit, and way.
• Explain the concept of replacement policy. In particular, explain the meaning of
Least Recently Used (LRU). You do not have to provide the hardware solution for
the LRU.
• Show assembly code examples and step-by-step guides for how the execution of
the example affects the cache. Your code examples must illustrate both temporal
locality and spatial locality. Your example code does not have to show the effects of
the replacement policy. You may write NIOS II or MIPS assembly code.
Clearly describe and motivate your answers. Diagrams or figures without explanations
will not give any points. (15 points)
Swedish: Förklara i detalj hur en 2-vägs associativ data-cache fungerar. Din lösning ska
innehålla följande:
• Skissa en hårdvarulösning som kan hantera hit- och miss-detektering. Lösningen
skall även returnera korrekt cache-värde vid en cache träff. Lösningen ska innehålla
och förklara termerna adressetikett (även kallad tag), rad (även kallad set), bytenummer, giltighetsbit samt “way”.
• Förklara konceptet utbytespolicy (replacement policy). Förklara speciellt betydelsen
av Least Recently Used (LRU). Du behöver dock inte tillhandahålla hårdvarulösningen för LRU.
• Visa assembler kodexempel med steg-för-steg guide för hur exekveringen av exemplet påverkar cachen. Dina exempel måste visa både tidsmässig (temporal) och
spatiell lokalitet. Dina exempel behöver inte visa utbytespolicyns effekter. Du kan
välja att skriva antingen assemblerkod för NIOS II eller MIPS.
Beskriv och motivera ditt svar tydligt. Diagram eller figurer utan förklaringar ger inga
poäng. (15 poäng)
Solution: This tasks can be answered in many different ways. A complete solution of the
task is therefore omitted. Instead, we just give a few comments:
• When describing replacement policies, it is important to relate this to LRU. A good
solution explains this by using an example.
• The code examples should show temporal and spatial locality in the data cache (data
cache is part of the exercise. The old solution text talked about instruction cache).
Temporal locality can be shown by reading from the same data element several
times, for instance in a loop. Spatial locality can be shown by, for instance, reading
from an array.
13
• In the hardware solution, it is important to show how the tag is compared in the
address and in the cache, as well as how the validity bit is compared. This is usually
done with an AND-gate. Note also that we need to use a multiplexor to select the
correct output from the cache.
• It is important to explain all the terms (see item 1 in the question).
14
2. English: There are many ways to make use of parallelism to achieve better performance
in a computer system. Three important concepts/techniques are SIMD instructions (for
instance multimedia extensions), instruction level parallelism (ILP), and multicore. In
this task, you should analyze, discuss, and compare these different techniques. In what
way are they similar? In what way are they different? Which are their pros and cons?
How do they affect the programmer or the compiler? Where are the limitations? Your
answer should consist of a comprehensive and well-thought-out discussion. (10 points)
Swedish: Det finns många olika sätt att använda parallellism för att uppnå bättre prestanda i ett datorsystem. Tre viktiga koncept/tekniker är SIMD-instruktioner (t.ex. multimedia-utökningar), instruktions-nivå parallellism (instruction level parallelism, ILP), och
multicore (flera kärnor). I denna uppgift ska du analysera, diskutera, och jämföra dessa
olika tekniker. På vilka sätt är de lika? På vilka sätt är de olika? Vilka är deras fördelar och nackdelar? På vilket sätt påverkar de programmeraren eller kompilatorn? Vilka
är begränsningarna? Ditt svar ska innefatta en utförlig och genomtänkt diskussion. (10
poäng)
Solution: This task has not one solution. Completely different answers can still give
the same number of points. In the following, we list some important aspects that can be
included in an answer.
• SIMD makes use of data-level parallelism. Hence, there is limitations of what kind
of programs that can actually utilize this kind of parallelism. The same instruction
needs to be applied to different data.
• ILP can take the form of static and dynamic multiple issue. Today, dynamic multiple issue is very common in modern processors. The main benefit is that ILP does
not affect the programmer, parallelism “comes for free” even for a sequential program. However, the amount of parallelism is limited due to dependencies between
instructions.
• Multicore processors can give parallelism by the help of the programmer. Typically,
for a shared memory processor (SMP), a multithreaded programming can be used
to achieve parallelism. This is a form of task parallelism (compared to data-level
parallelism for SIMD).
• Each of these techniques do not have to work in isolation. Instead, a program can
make use of all these concepts and techniques to achieve speedups.
• A similarity between multicore and SIMD are that both these techniques typically
need help from the programmer to actually work. Certain compilers exist, e.g.,
OpenMP, that can be used to program multicore systems in an easy way.
• ILP does not need to have support from the programmer, but if the programmer
programs in a special way, more ILP can be explored. One such technique is loop
unrolling, which makes it possible for the processor to fetch more instructions to
always try to fill the pipeline.
• A problem with multicore programming is that it is quite hard, unless there are few
dependencies between different tasks. Programs need to be synchronized using for
instance semaphores.
15
• SIMD is today typically programmed with special platform dependent libraries.
• A problem with SMP multicore systems is cache coherence. If the programmer is
not aware of how the communication and access to memory effects the caches, the
performance improvements may be much less than expected.
3. English: Explain in detail the meaning of the following terms and concepts and their
relationships: Execution time of programs, Cycles Per Instruction (CPI), Clock frequency,
Clock period, Power, and Energy. Explain also how computer architects and processor
manufacturers try to decrease energy consumption and why it is difficult to do (10 points).
Swedish: Förklara i detalj betydelsen av följande termer och koncept, samt deras relationer: Exekveringstid för program, Cykler per instruktion (CPI), Klockfrekvens, Klockperiod, Effekt och Energi. Förklara också hur datorarkitekter och processortillverkare
försöker att minska energiåtgången, samt varför detta är svårt (10 poäng).
Solution: This task has not one solution. Completely different answers can still give
the same number of points. In the following, we list some important aspects that can be
included in an answer.
• A programs execution time depends on several factors, where the most important
ones are i) the number of executed instructions, ii) the cycles per instruction (CPI),
and iii) the clock period.
• If a processor has a pipeline, this can increase the clock frequency, which gives better
performance. On the other hand, a pipeline can also introduce pipeline hazards, that
can result in stalls. This will increase the CPI.
• Clock frequency is defined as one divided by the clock period.
• For over 30 years, processor manufactures have constantly increased the clock rate
of processors, thus also increased the power. As a consequence, the so called power
wall was reached around year 2006. Instead of increasing the clock frequency of the
processor, manufacturers started to add processor cores.
• The dominating source of energy consumption is the dynamic energy, which is consumed when each transistor is switching. The dynamic energy for switching a transistor is proportional to the capacitive load (the number of transistors connected to
an output and the manufacturing technology), and the square of the voltage. Since
the voltage is the dominating factor (the term is squared), processors are today using
much lower voltage for their power supply, compared to just a few years ago.
• The dynamic power is proportional to the energy used for a transistor transition,
multiplied with the frequency switched. This means that power increases when frequency increases, but the voltage is still dominating due to the square term. Note
that the energy for computing a task does not decrease just because the frequency
is lower, but the power becomes lower. Although efforts are made to lower voltage, there is a limit for how low the voltage can be without increasing the leakage. In server systems, the static power dissipation due to leakage can be as high as
16
40%. Static energy is consumed (or transformed to other forms of energy) in CMOS
technology even if a transistor is off. As a consequence, increasing the number of
transistor (for instance, increasing the number of cores in a processor) increases the
static energy consumption, even if the transistors are not switching. It is therefore
hard to decrease the energy because of the voltage limitation and the increased demand on more cores and larger caches. There are also techniques for switching off
parts of a chip during execution to decrease energy further.
17
4. English: Carefully analyze the partially commented MIPS assembler program on the
next page. Construct a C program that generates the same result in memory as the MIPS
program. Note that there are two memory arrays. A 16x16 word matrix starting at address
0x1001 0000 and an output array of size 16 words (address computed and stored in
$s3 in part 2 of the MIPS code). Your C program should start with the variable declarations shown below. Assume that there exist code at the end of the program the prints
out the arrays to standard output. Explain in detail how your program works and what the
expected result of the program is, i.e., you should explain what the program is actually
trying to accomplish (15 points).
Swedish: Analysera noggrant det partiellt kommenterade MIPS-assembler-programmet
på nästa sida. Konstruera ett C-program som genererar samma resultat i minnet som
MIPS-programmet. Notera att det finns två stycken minnesarrayer. En 16x16-ords matris
som startar på adress 0x1001 0000 och en “output”-array av storlek 16 ord (adressen
beräknas och sparas i $s3 i del 2 av MIPS-koden). Ditt C-program ska börja med de
variabeldeklarationer som finns nedan. Anta att det finns kod i slutet av programmet som
skriver ut arrayerna till standard output. Förklara i detalj hur ditt program fungerar och
vad det förväntade resultatet av programmet är, dvs. du ska förklara vad programmet
egentligen försöker att göra (15 poäng).
int main(){
const int maxval = 16;
int matrix[maxval*maxval];
int output[maxval];
// Insert your code here.
// Assume that there is code here that prints
// out the arrays to standard output.
}
18
###### PART 1 #####
addi
sll
lui
addi
loop1_outer:
slt
beq
addi
loop1_inner:
slt
beq
sll
add
sll
add
addi
addi
mul
sw
addi
j
done1_inner:
addi
j
done1_outer:
$s0,
$s1,
$s2,
$t1,
$0, 16
$s0, 4
0x1001
$0, 0
sll
add
addi
loop2_outer:
slt
beq
addi
addi
loop2_inner:
slt
beq
sll
add
lw
bne
addi
no_match:
addi
j
done2_inner:
beq
sw
addi
no_output:
addi
j
done2_outer:
$s3, $s1, 2
$s3, $s3, $s2
$t1, $0, 2
# Address to a matrix that is 16x16 word
# counter i
$t3, $t1, $s0
$t3, $0, done1_outer
$t2, $0, 0
# counter j
$t3, $t2, $s0
$t3, $0, done1_inner
$t3, $t1, 4
#
$t3, $t3, $t2
$t3, $t3, 2
$t3, $t3, $s2
$t4, $t1, 2
#
$t5, $t2, 2
$t4, $t4, $t5
$t4, 0($t3)
#
$t2, $t2, 1
#
loop1_inner
compute address
compute element value
store result in matrix
increase and loop
$t1, $t1, 1
loop1_outer
###### PART 2 #####
# address to output array
# counter i
$t3,
$t3,
$t6,
$t2,
$t1, $s0
$0, done2_outer
$0, 1
$0, 0
# counter j
$t3,
$t3,
$t4,
$t4,
$t5,
$t1,
$t6,
$t2, $s1
$0, done2_inner
$t2, 2
# get element from matrix
$t4, $s2
0($t4)
$t5, no_match # check if elements match
$0, 0
$t2, $t2, 1
loop2_inner
# next
$t6, $0, no_output # check if write output
$t1, 0($s3)
$s3, $s3, 4
$t1, $t1, 1
loop2_outer
# next
19
Solution: The program computes all prime numbers between 1 and 16 and stores them
in the output array. The first part of the program computes a multiplication matrix, that
is then used in the second part to search for numbers that are prime numbers (i.e. that do
not exist in the matrix). This is a simple, but not very efficient way of computing prime
numbers.
An example C code is shown below.
int main(){
const int maxval = 16;
int matrix[maxval*maxval];
int output[maxval];
// Compute the multiplication matrix
int i,j;
for(i=0; i<maxval; i++){
for(j=0; j<maxval; j++){
matrix[i*maxval + j ] = (i+2) * (j+2);
}
}
// Extract the prime numbers
int k = 0;
for(i=2; i<maxval; i++){
int isprime = 1;
for(j=0; j<maxval*maxval; j++){
if(matrix[j] == i)
isprime = 0;
}
if(isprime)
output[k++] = i;
}
// Assume that there is code here that prints
// out the arrays to standard output.
}
20
MIPS)Reference)Sheet))
INSTSTRUCTION)SET)(SUBSET))
)
)
Name)(format,)op,)funct)))
)Syntax) )
)Opera<on)
add#(R,0,32)
#
#add rd,rs,rt
#reg(rd)#:=#reg(rs)#+#reg(rt);##
add#immediate#(I,8,na)
#addi rt,rs,imm #reg(rt)#:=#reg(rs)#+#signext(imm);#
add#immediate#unsigned#(I,9,na) #addiu rt,rs,imm #reg(rt)#:=#reg(rs)#+#signext(imm);#
add#unsigned#(R,0,33)#
#addu rd,rs,rt
#reg(rd)#:=#reg(rs)#+#reg(rt);#
and#(R,0,36)
#
#and rd,rs,rt
#reg(rd)#:=#reg(rs)#&#reg(rt);#
and#immediate#(I,12,na)
#andi rt,rs,imm #reg(rt)#:=#reg(rs)#&#zeroext(imm);#
branch#on#equal#(I,4,na)
#beq rs,rt,label #if#reg(rs)#==#reg(rt)#then#PC#=#BTA#else#NOP;#
branch#on#not#equal#(I,5,na)
#bne rs,rt,label #if#reg(rs)#!=#reg(rt)#then#PC#=#BTA#else#NOP;#
jump#and#link#register#(R,0,9)
#jalr rs #
#$ra#:=#PC#+#4;###PC#:=#reg(rs);#
jump#register#(R,0,8) #
#jr
rs #
#PC#:=#reg(rs);#
jump#(J,2,na)
#
#j
label
#PC#:=#JTA;###
jump#and#link#(J,3,na) #
#jal label
#$ra#:=#PC#+#4;###PC#:=#JTA;#
load#byte#(I,32,na) #
#lb
rt,imm(rs) #reg(rt)#:=#signext(mem[reg(rs)#+#signext(imm)]7:0);)
load#byte#unsigned#(I,36,na)
#lbu rt,imm(rs) #reg(rt)#:=#zeroext(mem[reg(rs)#+#signext(imm)]7:0);#
load#upper#immediate#(I,14,na) #lui rt,imm
#reg(rt)#:=#concat(imm,#16#bits#of#0);#
load#word#(I,35,na) #
#lw
rt,imm(rs) #reg(rt)#:=#mem[reg(rs)#+#signext(imm)];)
mulZply,#32[bit#result#(R,28,2) #mul rd,rs,rt
#reg(rd)#:=#reg(rs)#*#reg(rt);#
nor#(R,0,39)
#
#nor rd,rs,rt
reg(rd)#:=#not(reg(rs)#|#reg(rt));#
or#(R,0,37)#
#
#or
rd,rs,rt
#reg(rd)#:=#reg(rs)#|#reg(rt);#
or#immediate#(I,13,na)
#ori rt,rs,imm #reg(rt)#:=#reg(rs)#|#zeroext(imm);#
set#less#than#(R,0,42) #
#slt rd,rs,rt
#reg(rd)#:=#if#reg(rs)#<#reg(rt)#then#1#else#0;#
set#less#than#unsigned#(R,0,43) #sltu rd,rs,rt
#reg(rd)#:=#if#reg(rs)#<#reg(rt)#then#1#else#0;#
set#less#than#immediate#(I,10,na)#slti rt,rs,imm #reg(rt)#:=#if#reg(rs)#<#signext(imm)#then#1#else#0;#
set#less#than#immediate#
#sltiu rt,rs,imm #reg(rt)#:=#if#reg(rs)#<#signext(imm)#then#1#else#0;#
####unsigned#(I,11,na)#
shi`#le`#logical#(R,0,0)
#sll rd,rt,shamt #reg(rd)#:=#reg(rt)#<<#shamt;#
shi`#le`#logical#variable#(R,0,4) #sllv rd,rt,rs
reg(rd)#:=#reg(rt)#<<#reg(rs4:0);#
shi`#right#arithmeZc#(R,0,3)
#sra rd,rt,shamt #reg(rd)#:=#reg(rt)#>>>#shamt;#
shi`#right#logical#(R,0,2)
#srl rd,rt,shamt #reg(rd)#:=#reg(rt)#>>#shamt;#
shi`#right#logical#variable#(R,0,6) #srlv rd,rt,rs
reg(rd)#:=#reg(rt)#>>#reg(rs4:0); ##
store#byte#(I,40,na) #
#sb
rt,imm(rs) #mem[reg(rs)#+#signext(imm)]7:0#:=#reg(rt)7:0;#
store#word#(I,43,na) #
#sw
rt,imm(rs) #mem[reg(rs)#+#signext(imm)]#:=#reg(rt);#
subtract#(R,0,34)
#
#sub rd,rs,rt
reg(rd)#:=#reg(rs)#[#reg(rt);#
subtract#unsigned#(R,0,35)
#subu rd,rs,rt
#reg(rd)#:=#reg(rs)#[#reg(rt);#
xor#(R,0,38)
#
#xor rd,rs,rt
reg(rd)#:=#reg(rs)#^#reg(rt);#
xor#immediate#(I,14,na)
#xori rt,rs,imm #reg(rt)#:=#rerg(rs)#^#zeroext(imm);#
#
Deﬁni<ons))
!  Jump#to#target#address:#JTA#=#concat((PC#+#4)31:28,#address(label),#002)#
!  Branch#target#address:#BTA#=#PC#+#4#+#imm#*#4#
#
Clariﬁca<ons)
!  All#numbers#are#given#in#decimal#form#(base#10).#
!  FuncZon#signext(x)#returns#a#32[bit#sign#extended#value#of#x#in#two’s#complement#form.#
!  FuncZon#zeroext(x)#returns#a#32[bit#value,#where#zero#are#added#to#the#most#signiﬁcant#side#of#x.#
!  FuncZon#concat(x,#y,#…,#z)#concatenates#the#bits#of#expressions#x,#y,#…,#z.##
!  Subscripts,#for#instance#X8:2,#means#that#bits#with#index#8#to#2#are#spliced#out#of#the#integer#X.#
!  FuncZon#address(x)#means#the#address#of#label#x.#
!  NOP#and#na#means#“no#operaZon”#and#“not#applicable”,#respecZvely.#
!  shamt#is#an#abbreviaZon#for#“shi`#amount”,#i.e.#how#much#bit#shi`ing#that#should#be#done.#
INSTRUCTION)FORMAT)
)
))))))RPType)
)
)
)
)
)
31)
)
)
)
)
31)
))))))IPType)
))))))JPType)
)
)
)
)
26) 25)
21) 20)
16) 15)
11) 10)
6) 5)
0)
op)
rs)
rt)
rd)
shamt)
funct)
6)bits)
5)bits)
5)bits)
5)bits)
5)bits)
6)bits)
26) 25)
21) 20)
16) 15)
0)
op)
rs)
rt)
immediate)
6)bits)
5)bits)
5)bits)
16)bits)
31)
26) 25)
0)
op)
address)
6)bits)
26)bits)
REGISTERS)
)
Name)
$zero
$at
$v0
$v1
$a0
$a1
$a2
$a3
$t0
$t1
$t2
$t3
$t4
$t5
$t6
$t7
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$t8
$t9
$k0
$k1
$gp
$sp
$fp
$ra
#
#
#
)Number)
#0
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
)Descrip<on)
#constant#value#0#
#assembler#temp#
#funcZon#return##
#funcZon#return#
#argument#
#argument#
#argument#
#argument#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#temporary#value#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#saved#temporary#
#temporary#value#
#temporary#value#
#reserved#for#OS#
#reserved#for#OS#
#global#pointer#
#stack#pointer#
#frame#pointer#
#return#address#
#
#
#
#
#
MIPS)Reference)Sheet)
)
By)David)Broman)
KTH)Royal)Ins<tute)of)Technology)
)
If#you#ﬁnd#any#errors#or#have#any#
feedback#on#this#document,#please#send#
me#an#email:#[email protected]#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Version#1.01,#January#9,#2015#
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 11
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 12
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 13
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 14
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 15
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 16
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 17
Utdrag ur Nios II Processor Reference Handbook
IS1200/IS1500 2014-05-22 – sida 18