Written Exam / Tentamen∗ Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of Technology 2015-01-14 Examiner / Examinator: David Broman Teacher on duty / Ansvarig lärare: David Broman, [email protected], +46 73 765 20 44 Instructions in English • Allowed aids: One sheet of A4 paper with handwritten notes. You may write on both sides of the paper. • Explicitly forbidden aids: Textbooks, electronic equipment, calculators, mobile phones, machinewritten pages, photocopied pages, pages of different size than A4. • Please write and draw carefully. Unreadable text may lead to zero points. • You do not need to return these exam papers when you hand in your exam solutions. • You may write your answers in either Swedish or English. The exam consists of two parts: • Part I: Fundamentals: The maximal number of points for Part I is 48 points (for IS1500) and 40 points (for IS1200). There are 8 points for each of the six course modules. All questions in Part I expect only short answers. At most a few sentences are needed. • Part II: Advanced: The maximal number of points for Part II is 50 points. In the answers, it is required that the student discuss, analyze, or construct. Forthermore, answers to these questions require clear motivations. True/False/Don’t know Questions in Part I These questions can give between 0 and 4 points. Each question consists of 4 statements. For each statement, you should answer either true, false, or “don’t know”. The points are calculated as follows: Each correct answer (answer true or false) gives one point. Each incorrect answer (true or false) gives minus one point. If you answer “don’t know” for a statement, it neither adds nor removes points. The rationale for introducing “don’t know” answers is to avoid that the student makes guesses. Example: Assume that the correct answers to four statements are: true, false, true, false. Assume that the student answered: true, true, don’t know, false. The total number of points is then: 1 - 1 + 0 + 1 = 1 point. Note that even if the answers to all four statements are wrong, the points for the whole question can never be negative, that is, the final points will always be between 0 and 4. ∗ This version of the exam paper has been update February 6, 2015. The updates include changes of the pass criteria (the pass criterion was changed (lowered) from 36 to 33 points in the fundamental part for IS1500 and 30 to 27 for IS1200). We also updated the criterion for FX (only 11 points are needed on the advanced part) and added a few clarifications on the solution suggestions. 1 Grades To get a pass grade (A, B, C, D, or E), it is required to pass Part I of the exam. For IS1500 students, it is required to get 33 points or more on Part I to pass the exam. IS1200 students should not answer question 1 in Part I. For IS1200 students, it is required to have 27 points or more in total for questions 2-6 on Part I. Grading scale (For both IS1200 and IS1500): • A: 41-50 points on Part II • B: 31-40 points on Part II • C: 21-30 points on Part II • D: 11-20 points on Part II • E: 0-10 points on Part II • FX: 30-32 points (IS1500) or 24-26 (IS1200) on Part I, and 11-50 points on Part II. • F: otherwise Results • The result will be announced at latest 2015-02-04. • If a student received grade FX, it is possible to request a complementary oral examination. Such complementary oral examination must be requested by the student. Please send an email to [email protected] at latest 2015-02-25. Instruktioner på Svenska • Tillåtna hjälpmedel: En A4-sida med handskrivna anteckningar. Det är tillåtet att skriva på båda sidorna av pappret. • Förbjudna hjälpmedel: Läroböcker, elektroniska hjälpmedel, miniräknare, mobiltelefoner, maskinskrivna sidor, kopierade papper, sidor av andra storlekar än A4. • Skriv och rita noggrant. Oläsbar text kan resultera i noll poäng. • Du behöver inte lämna tillbaka dessa tentamenspapper när du lämnar in tentamenslösningarna. • Du kan skriva dina svar på antingen engelska eller svenska. Tentamen består av två delar: • Del I: Fundamentala delen: Maximalt antal poäng för del I är 48 poäng (för IS1500) och 40 poäng (för IS1200). Totalpoängen per kursmodul är 8 poäng (6 moduler totalt). För del I förväntas det endast korta svar på frågorna. Endast ett fåtal meningar krävs. • Del II: Avancerade delen: Det maximala antalet poäng för del II är 50 poäng. I svaren för denna del krävs att studenten diskuterar, analyserar och konstruerar. Vidare kräver svaren till dessa frågor tydliga motiveringar. 2 Sant/Falskt/”Vet ej” frågor för Del I Dessa frågor kan ge mellan 0 och 4 poäng. Varje fråga består av 4 påståenden. För varje påstående ska du svara antingen sant, falskt, eller “vet ej”. Poängen beräknas enligt följande: Varje korrekt svar (svar sant eller falskt) ger 1 poäng. Varje felaktigt svar (sant eller falskt) ger minus en poäng. Om du svarar “vet ej” för ett påstående ger det varken extrapoäng eller minuspoäng. Svarsalternativet “vet ej” finns för att undvika att studenten gissar. Exempel: Anta att de korrekta svaren för fyra påståenden är: sant, falskt, sant, falskt. Anta att studenten svarar: sant, sant, “vet ej”, falskt. Det totala antalet poäng blir då: 1 - 1 + 0 + 1 = 1 poäng. Notera att även om samtliga svar till alla fyra påståenden är felaktiga så kan poängen för hela frågan aldrig bli negativ. Såldes är den slutliga poängen för frågan alltid mellan 0 och 4 poäng. Betyg För att erhålla godkänt betyg (A, B, C, D eller E) krävs att man får godkänt på del I. För IS1500studenter krävs det 33 poäng eller mer för del I för att få godkänt på tentamen. IS1200-studenter ska inte svara på fråga 1 i del I. För IS1200-studenter krävs det 27 poäng eller mer totalt för frågorna 2-6 på del I för att bli godkänd. Betygsskala (För både IS1200 och IS1500): • A: 41-50 poäng på del II • B: 31-40 poäng på del II • C: 21-30 poäng på del II • D: 11-20 poäng på del II • E: 0-10 poäng på del II • FX: 30-32 poäng (IS1500) eller 24-26 (IS1200) på del I och 11-50 poäng på del II. • F: i övriga fall Resultat • Resultaten kommer att meddelas senast 2015-02-04. • Om en student får betyg FX är det möjligt att begära en muntlig examination. En sådan muntlig examination måste begäras via epost av studenten. Skicka ett epost-meddelande till [email protected] senast 2015-02-25. Part I: Fundamentals 1. Module 1: Logic Design (for IS1500 only) (a) English: Consider the figure below. Assume that the register initially has a zero value and that it is triggered on a rising clock edge. What are then the values for signals A, B, Y0 , Y1 , Y2 , and Y3 after the first raising clock edge when all signals have stabilized? (4 points) Swedish: Studera figuren nedan. Anta att register initialt har värdet noll och att stigande klockflank är aktiv. Vad är då värdena på signalerna A, B, Y0 , Y1 , Y2 , och Y3 efter den första stigande klockflank då alla signaler har stabiliserats? (4 poäng) 3 2" + 310# 2" A# CLK# 2" 2" B# 1# 0# 1# 0# 00" 01" 10" 11" 1# Decoder" 00" 0" 01" 1" 10" 11" Y0# Y1# Y2# Y3# Solution: A = 102 , B = 112 , Y0 = 0, Y1 = 0, Y2 = 0, and Y3 = 1. Explanation: The initial state value of the register is zero. Hence, after the first raising clock edge (and after stabilization) the register receives value 3 + 0 = 3, and therefore B = 112 = 310 . The adder then adds 3 plus 3 and since the signal is only two bits, the counter wraps around and A = 102 . The multiplexer selects the third signal 10, which has value 1. Hence, the signal that is decoded is 11, which means that only Y3 is 1, and the rest of the output signals from the decoder are zero. (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen. (4 poäng) • Statement 1: English: Proof by perfect induction means that a theorem in boolean algebra can be proven correct by exhaustively showing all possible combinations in a truth table. Swedish: Bevis med hjälp av perfekt induktion innebär att ett teorem i boolesk algebra kan bevisas vara korrekt genom att uttömmande visa alla olika kombinationer i en sanningstabell. Solution: True. Proof by perfect induction is the same as Proof by Exhaustion (see the lecture slides). Proof by perfect induction (note the word perfect) should not be mixed up with standard mathematical induction proofs. • Statement 2: English: When a tristate buffer is disabled, the output is said to be floating. Swedish: När en tristate buffer är avaktiverad är dess utsignal flytande (floating). Solution: True. • Statement 3: English: The main difference between a SR latch and a D latch is that a D latch is triggered on the edge of a clock signal, whereas an SR latch is transparent. Swedish: Den störta skillnaden mellan en SR-latch och en D-latch är att en Dlatch aktiveras på flanken av en klocksignal, medan en SR-latch är transparent. Solution: False. Neither SR latches nor D latches are edge triggered. The main difference between an SR latch and a D latch is that a D latch is clocked, whereas 4 an SR latch is not clocked. • Statement 4: English: In a synchronous sequential circuit, the propagation delay of the combinational logic in the circuit is an important factor when determining the maximal clock frequency. Swedish: För en synkront sekvenskrets är grindfördröjningen av den kombinatoriska logiken i kretsen en viktig faktor när man bestämmer den maximala klockfrekvensen. Solution: True. There are several delays in the circuit that determines the maximal clock frequency. The delay in the combinatorial logic is one of these important delays. 2. Module 2: C and Assembly Programming ========================================== (a) English: What is the binary machine code representation of the MIPS instruction lb $s1,-67($t2) Answer as a 32-bit binary number. Note that a page with the structure of the encoding of MIPS is available at the end of the exam. (4 points) Swedish: Vad är den binära maskinkodsrepresentationen för MIPS-instruktionen lb $s1,-67($t2) Svara som ett 32-bitars binärt tal. Notera att strukturen för MIPS-kodning är tillgängligt på en sida i slutet av tentan. (4 poäng) Solution: 1000 0001 0101 0001 1111 1111 1011 1101 (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen. (4 poäng) • Statement 1: int x = 5; int y = 6; int *p = &x; *p = x + y; p = &y; *p = x + y; printf("%d,%d", x, y); English: When executing the C code above, the string 11,17 is printed to the standard output. Swedish: När C-programmet ovan exekveras så printas strängen 11,17 till standard output. Solution: True. 5 • Statement 2: English: The datastructure, that is used for storing local variables and additional arguments during function calls, stores the values as a first-in first-out (FIFO) queue in the memory. Swedish: Datastrukturen, som används för att spara lokala variabler och ytterligare argument vid funktionsanrop, sparar värdena som en först-in-först-ut (FIFO) kö i minnet. Solution: False. The datastructure is a stack and it is stored as a last-in first-out (LIFO) queue. • Statement 3: English: In the MIPS ISA, a calling function (called the caller) does not have to save the registers $s0 to $s7 because the callee (the function that is called) is responsible for saving these registers, if they are used in the called function. Swedish: För MIPS ISA, behöver inte en anropande funktion (även kallad “caller”) spara registerna $s0 till $s7, då funktionen som är anropad är ansvarig för att spara dessa register, om de används i den anropade funktionen. Solution: True. • Statement 4: English: A benefit of PC-relative addressing is that not all bits of the absolute address need to be stored in the instruction. Swedish: En fördel med PC-relativ adressering är att alla bitar av den absoluta adressen inte behöver lagras i instruktionen. Solution: True. Parts of the current PC address is used when computing the branch target address (BTA). That is, the absolute address that is used when updating the PC and when executing the branch. 3. Module 3: Processor Design (a) English: Consider the figure below that shows the datapath for a single-cycle MIPS processor. Assume that the current instruction that is executing is slt $t0,$t1,$t3. What are then the values of signals MemToReg, ALUSrc, Branch, and A3? (4 points) Swedish: Studera figuren nedan som visar en dataväg för en enkel-cyklig MIPSprocessor. Anta att den nuvarande instruktionen som exekveras är slt $t0,$t1,$t3. Vad är då värdena av signalerna MemToReg, ALUSrc, Branch och A3? (4 poäng) 6 * PC# 32# 25:21# A1$ 20:16# A2$ RD2$ 0* 1* 32# WD3$ 32# A$ RD$ 0* 1* * 32# WD$ +$ 0* 15:11# 1* * 4* 25:0# WE$ Zero# * 20:16# 31:28# <<2* 32# A3$ 32# 27:0# 3# RD1$ MemToReg* CLK$ 15:0# <<2* Sign*Extend* +$ 32# A$ RD$ ALUControl* WE3$ Data* Memory* 0* 1* Inst# ALU$ 32# Instruc(on* Memory* * MemWrite* ALUSrc* CLK$ CLK$ 0* 1* Branch* RegWrite** RegDst* Jump* 32# Solution: MemToReg = 0, ALUSrc = 0, Branch = 0, and A3 = 010002 (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen. (4 poäng) • Statement 1: English: In a five-stage MIPS pipelined datapath, the execute stage is usually used for reading out the values from the register file. Swedish: I en fem-stegs MIPS-pipelinad dataväg så läser vanligtvis exekveringssteget ut värdena från registerfilen. Solution: False. It is done in the decode stage. • Statement 2: English: A benefit of an arithmetic logic unit (ALU) is that the same hardware can be used for different arithmetic operations. Swedish: En fördel med en aritmetisk-logisk enhet (ALU) är att samma hårdvara kan användas för olika aritmetiska operationer. Solution: True. • Statement 3: English: A pipelined datapath may, compared to a single-cycle datapath, increase the average cycles per instruction (CPI). Swedish: En pipelinad dataväg kan, i jämförelse med en enkel-cyklig dataväg, öka genomsnittliga värdet för antalet cykler per instruktion (CPI). Solution: True. Compared to a single-cycle datapath (where each instruction takes one clock cycle) the pipeline may result in hazards, which can result in stalling. Hence, the average CPI may increase when a pipelined datapath is used. • Statement 4: 7 English: Instructions in a CISC instruction set architecture (ISA) can in general perform more complex operations than instructions in a RISC ISA. Swedish: Instruktioner i en CISC instruction set architecture (ISA) kan generellt utföra mer komplexa operationer än instruktioner i ett RISC ISA. Solution: True. CISC stands for “Complex Instruction Set Computing”. An example of CISC is x86, where an instruction can perform several tasks, for instance both load from memory and perform an addition. ISAs that are based on RISC, which stands for “Reduced Instruction Set Computing” have in general few simple instructions. 4. Module 4: Memory Hierarchy (a) English: Assume that we have a 2-way set associative cache for a processor that uses 32-bits for addressing. The cache has 1024 cache blocks in total and the block size is 16 bytes. How many bits are then the tag field, the set field (also called the index), and the byte offset field, and how many validity bits does the cache contain in total? (4 points) Swedish: Anta att vi har en 2-way set associative cache för en processor som använder 32 bitar för adressering. Cachen har 1024 cache block totalt och en blockstorlek på 16 bytes. Hur många bitar är då adressetiketten (även kallad tag), fältet för radnummer (även kallad index eller set), och fältet för bytenummer (även kallad byte offset), samt hur många giltighetsbitar innehåller totalt denna cache? (4 poäng) Solution: There are in total 1024/2 = 512 sets (swedish: rader). Hence, the set field is 9 bits. The block size is 16 bytes, making the byte offset field 4 bits. As a consequence, the tag is 32 − 9 − 4 = 19 bits. Finally, there are in total 1024 number of validity bits, the same number as blocks. (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen. (4 poäng) • Statement 1: English: A common replacement policy for direct mapped caches is Least Recently Used (LRU), since a set does not always have to be replaced directly, even if the validity bit is 1. Swedish: An vanlig utbytespolicy för direkt mappade cachar är Least Recently Used (LRU). Detta då raden inte alltid behöver ersättas, även om giltighetsbiten är satt till 1. Solution: False. A direct mapped cache do not need an specific replacement policy; if the set is used, it must always be replaced. • Statement 2: 8 English: One benefit of virtual memory is that each process (each running program) can have its own virtual memory space, which is protected from other concurrently running processes in the operating system. Swedish: En fördel med virtuellt minne är att varje process (varje exekverande program) kan ha sitt eget virtuella minnesutrymme, vilket är skyddat från andra samtidiga processer som exekveras i operativsystemet. Solution: True. • Statement 3: English: Assume that we have a direct mapped cache with block size 16 bytes and 256 blocks in total. If a load byte instruction reads from address 0xff20 215e, followed by another load byte instruction that reads from address 0x1000 5150, the second load instruction will result in a cache miss. Swedish: Anta att vi har en direkt mappad cache med blockstorlek 16 bytes och 256 block totalt. Om en load byte-instruktion läser från adress 0xff20 215e, vilket följs av en ytterligare load byte-instruktion som läser från adress 0x1000 5150, så kommer den andra load-instruktionen att resultera i en cachemiss. Solution: True. Both instructions access the same set (0x15), but have different tags. • Statement 4: English: A modern processor, such as the Intel Core-I7, has only one large cache because the speed of one large cache is typically higher than having several small caches. Swedish: En modern processor, som t.ex. Intel Core-I7, har endast en stor cache då hastigheten på en stor cache typiskt är högre än alternativet att ha flera små cachar. Solution: False. A modern processor has several caches in the memory hierarchy (e.g., L1, L2, and L3 caches). 5. Module 5: I/O Systems (a) English: Assume that an external interrupt occurs, a program A is preempted, and the program counter is changed so that an interrupt handling routine is executed. Explain shortly how it is possible to continue to execute program A correctly at the program point where it was interrupted. (4 points) Swedish: Anta att ett externt avbrott sker, att ett program A är åsidosatt (preempted) och programräknaren ändras så att en avbrottshanteringsrutin exekveras. Förklara kortfattat hur det är möjligt att fortsätta exekveringen av program A på ett korrekt sätt vid den programpunkt där avbrottet inträffade (4 poäng). Solution: When the external interrupt occurs, the processor must automatically save the program counter, before jumping to the interrupt handling routine. For instance, in MIPS, the PC is saved in a register called EPC. The interrupt handling routine must save all registers on the stack before performing its task. When the interrupt handling routine has finished, it restores all registers and returns to the original program by using the saved PC. 9 (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen. (4 poäng) • Statement 1: English: SPI and UART use synchronous and asynchronous serial communication, respectively. Swedish: SPI och UART använder synkron respektive asynkron seriell kommunikation. Solution: True. • Statement 2: English: A memory-mapped general-purpose I/O (GPIO) pin can be configured to be either an output or input port, by writing a configuration parameter to a specific memory address that is dedicated for configuring this GPIO pin. Swedish: En minnes-mappad general-purpose I/O (GPIO) pin kan konfigureras att vara antingen en input eller en output port. Detta görs genom att skriva en konfigurationsparameter till en specifik minnesadress, vilken är dedikerad till att konfigurera denna GPIO pin. Solution: True. • Statement 3: English: Direct memory access (DMA) is a good alternative to instruction level parallelism (ILP) because it enables multiple instructions to be fetched by the datapath and executed in parallel since the code is read directly from memory. Swedish: Direct memory access (DMA) är ett bra alternativ till instruktionsparallellism (ILP) då det möjliggör att flera instruktioner kan läsas av datavägen och sedan exekveras parallellt, eftersom koden läses direkt från minnet. Solution: False. DMA is not related to ILP and is not used for fetching several instructions in parallel. • Statement 4: English: Declaring a pointer volatile in C (as in the example below) means that the pointer itself is volatile and may change to point to a different address at any point in time. Swedish: Att deklarera en pekare flyktig (volatile) i C (som i exemplet nedan) betyder att pekaren själv är flyktig och kan när som helst ändras så att den pekar på en annan adress. volatile int* x = (volatile int*) 0xff88; Solution: False. The pointer itself cannot change indirectly, but the value that the pointer points to may change. 6. Module 6: Parallel Processors and Programs 10 (a) English: A program consists of two parts, part A and part B. Part A is trivial to parallelize, whereas part B is not possible to parallelize at all and consists only of sequential code. Assume that the amount of improvement that can be achieved by parallelizing part A is proportional to the number of cores, i.e., using 4 cores, we achieve 4 times improvement on part A. The theoretical maximal speedup is 5, assuming that we have infinite number of cores. Our competitor can run a sequential version of the program in 16s, whereas we can, using our parallel implementation, run the program in 20s on one core. What are then the relative speedup and the true speedup of our implementation when executing the program on 4 cores? Hint: recall that true speedup compares with the fastest available sequential implementation, whereas relative speedup compares with your own implementation running sequentially. (4 points) Swedish: Ett program består av två delar, del A och del B. Del A är trivialt att parallellisera, medan del B inte alls är möjlig att parallellisera och består endast av sekventiell programkod. Anta att förbättringsmöjligheten som man kan uppnå genom att parallellisera del A är proportionell mot antalet processorkärnor, dvs, om man använder 4 kärnor så uppnår man 4 gångers förbättring. Den teoretiska maixmala speedupen är 5, om man antar att vi har oändligt många kärnor. Vår konkurrent kan köra en sekventiell version av programmet på 16s, medan vi kan köra vår parallella implementation på 20s om den exekverar på 1 core. Vad blir då den relativa speedupen och den sanna speedupen (true speedup) för vår implementation om man exekverar programmet på 4 kärnor? Tips: Notera att sann speedup jämför med den snabbast tillgängliga sekventiella implementationen, medan relativ speedup jämför med din egna implementation när den körs sekventiellt. (4 poäng) Solution: Since we know that the theoretical speedup limit is 5, 51 of the execution time is due to part B, the sequential code. By applying Amdahl’s law, we get 16 20 = 20 = 2.5 and speedup true = 16/4+4 = 16 = 2.0. Hence, speedup relative = 16/4+4 8 8 the true speedup is, as expected, somewhat lower than the relative speedup. (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: För följande påståenden, svara om påståendet är sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspoäng. För en mer detaljerad förklaring av hur poäng beräknas för sant/falskt/”vet ej” frågor, se de första sidorna av tentamen. (4 poäng) • Statement 1: English: A good example of MIMD is a modern multicore processor. Swedish: Ett bra exempel på MIMD är en modern multicore-processor. Solution: True. • Statement 2: English: MapReduce can be based on message passing techniques and is today used extensively in Warehouse-scale computers. Swedish: MapReduce kan vara baserat på meddelandehanteringsteknik och används idag i hög omfattning i Warehouse-scale computers. 11 Solution: True. • Statement 3: English: If a semaphore is used for mutual exclusion (mutex), it means the programmer can define critical sections by first locking a mutex before entering the critical section, and then unlocking the mutex when exiting the critical section. Hence, a mutex can be used to have controlled access to shared resources in a concurrent environment. Swedish: Om en semaphore används för mutual exclusion (mutex), betyder det att programmeraren kan definiera kritiska sektioner genom att först låsa en mutex innan man kommer in i den kritiska sektionen, och sen låsa upp mutexen när man lämnar den kritiska sektionen. Alltså, en mutex kan användas för att kontrollera tillgång till delade resurser i en samtidig (concurrent) miljö. Solution: True. • Statement 4: English: Assume we have a shared memory processor (SMP) consisting of 4 cores with separate L1 caches and an L2 cache that is shared among the cores. Assume further that two of the cores frequently reads or writes to the same address in memory. As a consequence, we get easily inconsistencies in the L1 caches. This phenomena is called false sharing. Swedish: Anta att vi har en processor med delat minne (shared memory processor) som består av 4 kärnor med separata L1 cachar och en gemensam L2 cache som delas mellan kärnorna. Anta även att två av kärnorna frekvent läser eller skriver till samma minnesadress. Således leder detta till inkonsistens i L1 cacharna. Detta fenomen kallas falsk delning (false sharing). Solution: False. The problem described is related to cache coherence, but not directly to false sharing. If it was about false sharing, the cores would write to the same cache block, but not to the exact same address. 12 Part II: Advanced 1. English: Explain in detail how a 2-way set associative data cache works. Your solution should include the following: • Sketch a hardware solution that can handle hit and miss detection. The solution should also return the correct cache value for a cache hit. This illustration should include and explain the terms of tag, set, byte offset, validity bit, and way. • Explain the concept of replacement policy. In particular, explain the meaning of Least Recently Used (LRU). You do not have to provide the hardware solution for the LRU. • Show assembly code examples and step-by-step guides for how the execution of the example affects the cache. Your code examples must illustrate both temporal locality and spatial locality. Your example code does not have to show the effects of the replacement policy. You may write NIOS II or MIPS assembly code. Clearly describe and motivate your answers. Diagrams or figures without explanations will not give any points. (15 points) Swedish: Förklara i detalj hur en 2-vägs associativ data-cache fungerar. Din lösning ska innehålla följande: • Skissa en hårdvarulösning som kan hantera hit- och miss-detektering. Lösningen skall även returnera korrekt cache-värde vid en cache träff. Lösningen ska innehålla och förklara termerna adressetikett (även kallad tag), rad (även kallad set), bytenummer, giltighetsbit samt “way”. • Förklara konceptet utbytespolicy (replacement policy). Förklara speciellt betydelsen av Least Recently Used (LRU). Du behöver dock inte tillhandahålla hårdvarulösningen för LRU. • Visa assembler kodexempel med steg-för-steg guide för hur exekveringen av exemplet påverkar cachen. Dina exempel måste visa både tidsmässig (temporal) och spatiell lokalitet. Dina exempel behöver inte visa utbytespolicyns effekter. Du kan välja att skriva antingen assemblerkod för NIOS II eller MIPS. Beskriv och motivera ditt svar tydligt. Diagram eller figurer utan förklaringar ger inga poäng. (15 poäng) Solution: This tasks can be answered in many different ways. A complete solution of the task is therefore omitted. Instead, we just give a few comments: • When describing replacement policies, it is important to relate this to LRU. A good solution explains this by using an example. • The code examples should show temporal and spatial locality in the data cache (data cache is part of the exercise. The old solution text talked about instruction cache). Temporal locality can be shown by reading from the same data element several times, for instance in a loop. Spatial locality can be shown by, for instance, reading from an array. 13 • In the hardware solution, it is important to show how the tag is compared in the address and in the cache, as well as how the validity bit is compared. This is usually done with an AND-gate. Note also that we need to use a multiplexor to select the correct output from the cache. • It is important to explain all the terms (see item 1 in the question). 14 2. English: There are many ways to make use of parallelism to achieve better performance in a computer system. Three important concepts/techniques are SIMD instructions (for instance multimedia extensions), instruction level parallelism (ILP), and multicore. In this task, you should analyze, discuss, and compare these different techniques. In what way are they similar? In what way are they different? Which are their pros and cons? How do they affect the programmer or the compiler? Where are the limitations? Your answer should consist of a comprehensive and well-thought-out discussion. (10 points) Swedish: Det finns många olika sätt att använda parallellism för att uppnå bättre prestanda i ett datorsystem. Tre viktiga koncept/tekniker är SIMD-instruktioner (t.ex. multimedia-utökningar), instruktions-nivå parallellism (instruction level parallelism, ILP), och multicore (flera kärnor). I denna uppgift ska du analysera, diskutera, och jämföra dessa olika tekniker. På vilka sätt är de lika? På vilka sätt är de olika? Vilka är deras fördelar och nackdelar? På vilket sätt påverkar de programmeraren eller kompilatorn? Vilka är begränsningarna? Ditt svar ska innefatta en utförlig och genomtänkt diskussion. (10 poäng) Solution: This task has not one solution. Completely different answers can still give the same number of points. In the following, we list some important aspects that can be included in an answer. • SIMD makes use of data-level parallelism. Hence, there is limitations of what kind of programs that can actually utilize this kind of parallelism. The same instruction needs to be applied to different data. • ILP can take the form of static and dynamic multiple issue. Today, dynamic multiple issue is very common in modern processors. The main benefit is that ILP does not affect the programmer, parallelism “comes for free” even for a sequential program. However, the amount of parallelism is limited due to dependencies between instructions. • Multicore processors can give parallelism by the help of the programmer. Typically, for a shared memory processor (SMP), a multithreaded programming can be used to achieve parallelism. This is a form of task parallelism (compared to data-level parallelism for SIMD). • Each of these techniques do not have to work in isolation. Instead, a program can make use of all these concepts and techniques to achieve speedups. • A similarity between multicore and SIMD are that both these techniques typically need help from the programmer to actually work. Certain compilers exist, e.g., OpenMP, that can be used to program multicore systems in an easy way. • ILP does not need to have support from the programmer, but if the programmer programs in a special way, more ILP can be explored. One such technique is loop unrolling, which makes it possible for the processor to fetch more instructions to always try to fill the pipeline. • A problem with multicore programming is that it is quite hard, unless there are few dependencies between different tasks. Programs need to be synchronized using for instance semaphores. 15 • SIMD is today typically programmed with special platform dependent libraries. • A problem with SMP multicore systems is cache coherence. If the programmer is not aware of how the communication and access to memory effects the caches, the performance improvements may be much less than expected. 3. English: Explain in detail the meaning of the following terms and concepts and their relationships: Execution time of programs, Cycles Per Instruction (CPI), Clock frequency, Clock period, Power, and Energy. Explain also how computer architects and processor manufacturers try to decrease energy consumption and why it is difficult to do (10 points). Swedish: Förklara i detalj betydelsen av följande termer och koncept, samt deras relationer: Exekveringstid för program, Cykler per instruktion (CPI), Klockfrekvens, Klockperiod, Effekt och Energi. Förklara också hur datorarkitekter och processortillverkare försöker att minska energiåtgången, samt varför detta är svårt (10 poäng). Solution: This task has not one solution. Completely different answers can still give the same number of points. In the following, we list some important aspects that can be included in an answer. • A programs execution time depends on several factors, where the most important ones are i) the number of executed instructions, ii) the cycles per instruction (CPI), and iii) the clock period. • If a processor has a pipeline, this can increase the clock frequency, which gives better performance. On the other hand, a pipeline can also introduce pipeline hazards, that can result in stalls. This will increase the CPI. • Clock frequency is defined as one divided by the clock period. • For over 30 years, processor manufactures have constantly increased the clock rate of processors, thus also increased the power. As a consequence, the so called power wall was reached around year 2006. Instead of increasing the clock frequency of the processor, manufacturers started to add processor cores. • The dominating source of energy consumption is the dynamic energy, which is consumed when each transistor is switching. The dynamic energy for switching a transistor is proportional to the capacitive load (the number of transistors connected to an output and the manufacturing technology), and the square of the voltage. Since the voltage is the dominating factor (the term is squared), processors are today using much lower voltage for their power supply, compared to just a few years ago. • The dynamic power is proportional to the energy used for a transistor transition, multiplied with the frequency switched. This means that power increases when frequency increases, but the voltage is still dominating due to the square term. Note that the energy for computing a task does not decrease just because the frequency is lower, but the power becomes lower. Although efforts are made to lower voltage, there is a limit for how low the voltage can be without increasing the leakage. In server systems, the static power dissipation due to leakage can be as high as 16 40%. Static energy is consumed (or transformed to other forms of energy) in CMOS technology even if a transistor is off. As a consequence, increasing the number of transistor (for instance, increasing the number of cores in a processor) increases the static energy consumption, even if the transistors are not switching. It is therefore hard to decrease the energy because of the voltage limitation and the increased demand on more cores and larger caches. There are also techniques for switching off parts of a chip during execution to decrease energy further. 17 4. English: Carefully analyze the partially commented MIPS assembler program on the next page. Construct a C program that generates the same result in memory as the MIPS program. Note that there are two memory arrays. A 16x16 word matrix starting at address 0x1001 0000 and an output array of size 16 words (address computed and stored in $s3 in part 2 of the MIPS code). Your C program should start with the variable declarations shown below. Assume that there exist code at the end of the program the prints out the arrays to standard output. Explain in detail how your program works and what the expected result of the program is, i.e., you should explain what the program is actually trying to accomplish (15 points). Swedish: Analysera noggrant det partiellt kommenterade MIPS-assembler-programmet på nästa sida. Konstruera ett C-program som genererar samma resultat i minnet som MIPS-programmet. Notera att det finns två stycken minnesarrayer. En 16x16-ords matris som startar på adress 0x1001 0000 och en “output”-array av storlek 16 ord (adressen beräknas och sparas i $s3 i del 2 av MIPS-koden). Ditt C-program ska börja med de variabeldeklarationer som finns nedan. Anta att det finns kod i slutet av programmet som skriver ut arrayerna till standard output. Förklara i detalj hur ditt program fungerar och vad det förväntade resultatet av programmet är, dvs. du ska förklara vad programmet egentligen försöker att göra (15 poäng). int main(){ const int maxval = 16; int matrix[maxval*maxval]; int output[maxval]; // Insert your code here. // Assume that there is code here that prints // out the arrays to standard output. } 18 ###### PART 1 ##### addi sll lui addi loop1_outer: slt beq addi loop1_inner: slt beq sll add sll add addi addi mul sw addi j done1_inner: addi j done1_outer: $s0, $s1, $s2, $t1, $0, 16 $s0, 4 0x1001 $0, 0 sll add addi loop2_outer: slt beq addi addi loop2_inner: slt beq sll add lw bne addi no_match: addi j done2_inner: beq sw addi no_output: addi j done2_outer: $s3, $s1, 2 $s3, $s3, $s2 $t1, $0, 2 # Address to a matrix that is 16x16 word # counter i $t3, $t1, $s0 $t3, $0, done1_outer $t2, $0, 0 # counter j $t3, $t2, $s0 $t3, $0, done1_inner $t3, $t1, 4 # $t3, $t3, $t2 $t3, $t3, 2 $t3, $t3, $s2 $t4, $t1, 2 # $t5, $t2, 2 $t4, $t4, $t5 $t4, 0($t3) # $t2, $t2, 1 # loop1_inner compute address compute element value store result in matrix increase and loop $t1, $t1, 1 loop1_outer ###### PART 2 ##### # address to output array # counter i $t3, $t3, $t6, $t2, $t1, $s0 $0, done2_outer $0, 1 $0, 0 # counter j $t3, $t3, $t4, $t4, $t5, $t1, $t6, $t2, $s1 $0, done2_inner $t2, 2 # get element from matrix $t4, $s2 0($t4) $t5, no_match # check if elements match $0, 0 $t2, $t2, 1 loop2_inner # next $t6, $0, no_output # check if write output $t1, 0($s3) $s3, $s3, 4 $t1, $t1, 1 loop2_outer # next 19 Solution: The program computes all prime numbers between 1 and 16 and stores them in the output array. The first part of the program computes a multiplication matrix, that is then used in the second part to search for numbers that are prime numbers (i.e. that do not exist in the matrix). This is a simple, but not very efficient way of computing prime numbers. An example C code is shown below. int main(){ const int maxval = 16; int matrix[maxval*maxval]; int output[maxval]; // Compute the multiplication matrix int i,j; for(i=0; i<maxval; i++){ for(j=0; j<maxval; j++){ matrix[i*maxval + j ] = (i+2) * (j+2); } } // Extract the prime numbers int k = 0; for(i=2; i<maxval; i++){ int isprime = 1; for(j=0; j<maxval*maxval; j++){ if(matrix[j] == i) isprime = 0; } if(isprime) output[k++] = i; } // Assume that there is code here that prints // out the arrays to standard output. } 20 MIPS)Reference)Sheet)) INSTSTRUCTION)SET)(SUBSET)) ) ) Name)(format,)op,)funct))) )Syntax) ) )Opera<on) add#(R,0,32) # #add rd,rs,rt #reg(rd)#:=#reg(rs)#+#reg(rt);## add#immediate#(I,8,na) #addi rt,rs,imm #reg(rt)#:=#reg(rs)#+#signext(imm);# add#immediate#unsigned#(I,9,na) #addiu rt,rs,imm #reg(rt)#:=#reg(rs)#+#signext(imm);# add#unsigned#(R,0,33)# #addu rd,rs,rt #reg(rd)#:=#reg(rs)#+#reg(rt);# and#(R,0,36) # #and rd,rs,rt #reg(rd)#:=#reg(rs)#&#reg(rt);# and#immediate#(I,12,na) #andi rt,rs,imm #reg(rt)#:=#reg(rs)#&#zeroext(imm);# branch#on#equal#(I,4,na) #beq rs,rt,label #if#reg(rs)#==#reg(rt)#then#PC#=#BTA#else#NOP;# branch#on#not#equal#(I,5,na) #bne rs,rt,label #if#reg(rs)#!=#reg(rt)#then#PC#=#BTA#else#NOP;# jump#and#link#register#(R,0,9) #jalr rs # #$ra#:=#PC#+#4;###PC#:=#reg(rs);# jump#register#(R,0,8) # #jr rs # #PC#:=#reg(rs);# jump#(J,2,na) # #j label #PC#:=#JTA;### jump#and#link#(J,3,na) # #jal label #$ra#:=#PC#+#4;###PC#:=#JTA;# load#byte#(I,32,na) # #lb rt,imm(rs) #reg(rt)#:=#signext(mem[reg(rs)#+#signext(imm)]7:0);) load#byte#unsigned#(I,36,na) #lbu rt,imm(rs) #reg(rt)#:=#zeroext(mem[reg(rs)#+#signext(imm)]7:0);# load#upper#immediate#(I,14,na) #lui rt,imm #reg(rt)#:=#concat(imm,#16#bits#of#0);# load#word#(I,35,na) # #lw rt,imm(rs) #reg(rt)#:=#mem[reg(rs)#+#signext(imm)];) mulZply,#32[bit#result#(R,28,2) #mul rd,rs,rt #reg(rd)#:=#reg(rs)#*#reg(rt);# nor#(R,0,39) # #nor rd,rs,rt reg(rd)#:=#not(reg(rs)#|#reg(rt));# or#(R,0,37)# # #or rd,rs,rt #reg(rd)#:=#reg(rs)#|#reg(rt);# or#immediate#(I,13,na) #ori rt,rs,imm #reg(rt)#:=#reg(rs)#|#zeroext(imm);# set#less#than#(R,0,42) # #slt rd,rs,rt #reg(rd)#:=#if#reg(rs)#<#reg(rt)#then#1#else#0;# set#less#than#unsigned#(R,0,43) #sltu rd,rs,rt #reg(rd)#:=#if#reg(rs)#<#reg(rt)#then#1#else#0;# set#less#than#immediate#(I,10,na)#slti rt,rs,imm #reg(rt)#:=#if#reg(rs)#<#signext(imm)#then#1#else#0;# set#less#than#immediate# #sltiu rt,rs,imm #reg(rt)#:=#if#reg(rs)#<#signext(imm)#then#1#else#0;# ####unsigned#(I,11,na)# shi`#le`#logical#(R,0,0) #sll rd,rt,shamt #reg(rd)#:=#reg(rt)#<<#shamt;# shi`#le`#logical#variable#(R,0,4) #sllv rd,rt,rs reg(rd)#:=#reg(rt)#<<#reg(rs4:0);# shi`#right#arithmeZc#(R,0,3) #sra rd,rt,shamt #reg(rd)#:=#reg(rt)#>>>#shamt;# shi`#right#logical#(R,0,2) #srl rd,rt,shamt #reg(rd)#:=#reg(rt)#>>#shamt;# shi`#right#logical#variable#(R,0,6) #srlv rd,rt,rs reg(rd)#:=#reg(rt)#>>#reg(rs4:0); ## store#byte#(I,40,na) # #sb rt,imm(rs) #mem[reg(rs)#+#signext(imm)]7:0#:=#reg(rt)7:0;# store#word#(I,43,na) # #sw rt,imm(rs) #mem[reg(rs)#+#signext(imm)]#:=#reg(rt);# subtract#(R,0,34) # #sub rd,rs,rt reg(rd)#:=#reg(rs)#[#reg(rt);# subtract#unsigned#(R,0,35) #subu rd,rs,rt #reg(rd)#:=#reg(rs)#[#reg(rt);# xor#(R,0,38) # #xor rd,rs,rt reg(rd)#:=#reg(rs)#^#reg(rt);# xor#immediate#(I,14,na) #xori rt,rs,imm #reg(rt)#:=#rerg(rs)#^#zeroext(imm);# # Defini<ons)) ! Jump#to#target#address:#JTA#=#concat((PC#+#4)31:28,#address(label),#002)# ! Branch#target#address:#BTA#=#PC#+#4#+#imm#*#4# # Clarifica<ons) ! All#numbers#are#given#in#decimal#form#(base#10).# ! FuncZon#signext(x)#returns#a#32[bit#sign#extended#value#of#x#in#two’s#complement#form.# ! FuncZon#zeroext(x)#returns#a#32[bit#value,#where#zero#are#added#to#the#most#significant#side#of#x.# ! FuncZon#concat(x,#y,#…,#z)#concatenates#the#bits#of#expressions#x,#y,#…,#z.## ! Subscripts,#for#instance#X8:2,#means#that#bits#with#index#8#to#2#are#spliced#out#of#the#integer#X.# ! FuncZon#address(x)#means#the#address#of#label#x.# ! NOP#and#na#means#“no#operaZon”#and#“not#applicable”,#respecZvely.# ! shamt#is#an#abbreviaZon#for#“shi`#amount”,#i.e.#how#much#bit#shi`ing#that#should#be#done.# INSTRUCTION)FORMAT) ) ))))))RPType) ) ) ) ) ) 31) ) ) ) ) 31) ))))))IPType) ))))))JPType) ) ) ) ) 26) 25) 21) 20) 16) 15) 11) 10) 6) 5) 0) op) rs) rt) rd) shamt) funct) 6)bits) 5)bits) 5)bits) 5)bits) 5)bits) 6)bits) 26) 25) 21) 20) 16) 15) 0) op) rs) rt) immediate) 6)bits) 5)bits) 5)bits) 16)bits) 31) 26) 25) 0) op) address) 6)bits) 26)bits) REGISTERS) ) Name) $zero $at $v0 $v1 $a0 $a1 $a2 $a3 $t0 $t1 $t2 $t3 $t4 $t5 $t6 $t7 $s0 $s1 $s2 $s3 $s4 $s5 $s6 $s7 $t8 $t9 $k0 $k1 $gp $sp $fp $ra # # # )Number) #0 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 )Descrip<on) #constant#value#0# #assembler#temp# #funcZon#return## #funcZon#return# #argument# #argument# #argument# #argument# #temporary#value# #temporary#value# #temporary#value# #temporary#value# #temporary#value# #temporary#value# #temporary#value# #temporary#value# #saved#temporary# #saved#temporary# #saved#temporary# #saved#temporary# #saved#temporary# #saved#temporary# #saved#temporary# #saved#temporary# #temporary#value# #temporary#value# #reserved#for#OS# #reserved#for#OS# #global#pointer# #stack#pointer# #frame#pointer# #return#address# # # # # # MIPS)Reference)Sheet) ) By)David)Broman) KTH)Royal)Ins<tute)of)Technology) ) If#you#find#any#errors#or#have#any# feedback#on#this#document,#please#send# me#an#email:#[email protected]# # # # # # # # # # # # # # # # # # # # # Version#1.01,#January#9,#2015# Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 11 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 12 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 13 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 14 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 15 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 16 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 17 Utdrag ur Nios II Processor Reference Handbook IS1200/IS1500 2014-05-22 – sida 18
© Copyright 2024