Design of Arbitrary-way Superscalar Processor based on R10K Architecture A Final Report for EECS 470 Group 3: Chuyi Jiang, Di Hu, Jingyuan Sun, Yongyi Lu, Yuanlang Song Department of EECS, University of Michigan {chuyi, hudi, sunjy, luyongyi, ylsong}@umich.edu Abstract—Current trends in industry has been increasingly demanding for high efficiency and high performance processors, urging the hardware structures to be utilized as efficient as possible. In order to achieve this, computer architects continue to implement different features, among which exploiting instruction level parallelism (ILP) is one of the most effective methods to improve hardware performance. This article presents a implementation of an out-of-order processor based on the R10K architecture that features an arbitrary-way superscalar pipeline, local branch predictor, branch target buffer, load-store-queue, a four-way associative non-blocking data cache and a locking instruction cache, multiple function units with varying latencies and advanced data forwarding scheme. This design significantly decreases CPI. Meanwhile, its clock period is still maintained at a reasonable level. Therefore, the presented design's performance has been enhanced. Keywords—arbitrary-way superscalar; R10K architectur; outof-order pipeline; load-store-queue; four-way associative nonblocking cache I. INTRODUCTION Pipelining is a technique to break large scale tasks into multiple smaller sections, so that several tasks can be executed in parallel. In computer architecture area, instruction pipelining is very commonly used nowadays. This significantly increases the parallelism of a processor. However, it is still restricted by the inherent dependency between instructions. To resolve the problem, out-of-order pipelining has been brought out and widely used. In this paper, we present a more advanced out-of-order (OoO) pipeline with several features to further increase the performance of our design. Basic features include data cache and instruction cache, multiple functional units with varying latencies, out-of-order implementation, branch prediction with address prediction and the ability to process two load misses in parallel. The advanced features implemented are arbitrary-way superscalar, memory hierarchy (including fourway associative non-blocking L1 cache) and data forwarding from stores to loads and ability to access memory out-oforder. In the following part of this report, Section II discusses the motivation of designing and implementing the presented processor with certain advanced features; Section III discusses the design overview; Section IV discusses the feature implementation in detail; Section V shows the test results and performance analysis; Section VI shows the contribution of each group member and Section VII is the conclusion. II. MOTIVATION This section discusses the motivation of designing and implementing the proposed processor. Three main features are discussed in this section including out-of-order implementation and arbitrary-way superscalar, memory enhancement, and branch control. A. Out-of-order Implementation and Arbitrary-way Superscalar Dependency is a typical problem in programs referring to the fact that an instruction requires the results of previous instructions in order to execute. There are mainly three types of dependencies, which are reading after writing (RAW), writing after reading (WAR) and writing after writing (WAW), among which only RAW is true dependency. The false dependencies, which are in fact problems of the naming conflicts of registers, can be solved by renaming. Yet the true dependency will force the program to wait for the previous instructions to complete. Such stalling seriously harms the performance. Earlier in-order pipeline machine with data forwarding and stalling can partially solve this issue, but still cannot make full use of available resources. Multi-cycle instructions will occupy the entire execution stage and stall later instructions. In practice, chances are that younger instructions are ready to be executed before older ones. A solution to this is to execute out-of-order (OoO) by allowing ready instructions to go while instructions before them need to be stalled. Therefore, a R10K scheme based on OoO technology has been chosen to implement in order to better utilize the hardware resources and decrease cycles per instruction (CPI). To further exploit ILP, the microprocessor was built using superscalar CPU architecture. For this specific design, arbitrary-way superscalar was implemented with a unified algorithm and Verilog implementation. Our superscalar is designed to be flexible for various requirements and limitations, as well as to provide an abstraction of superscalar implementation. B. Memory Enhancement Accessing main memory is typically slow due to hardware limitation. In order to fetch data more efficiently and further improve the performance, instruction cache and data cache have been implemented. The caches are four-way associative. In normal case, when there is a cache miss, the memory stage is stalled to wait for data. In order to still access cache when the previous loads or stores are misses, L1 data cache is designed to be non-blocking. Load store queue is also designed to accelerate the program by forwarding data from store queue to load queue and executing memory accesses out of order. 1 Fig. 1 Architecture of the Presented Processor C. Branch Control The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictor plays a critical role in achieving efficient performance in many modern pipelined microprocessor architectures such as x86. In this specific microprocessor, branch predictor predicts whether a branch is taken or not and a branch target buffer (BTB) predicts which address to branch to. The branch predictor is a local predictor. The reason for choosing local predictor will be discussed in SectionV. III. DESIGN OVERVIEW Fig. 1 shows the architecture of the presented processor. All major modules and their Verilog module names are shown by the figure. The processor is designed to feature an arbitrary way superscalar pipeline, allowing up to N (N is the number of superscalar ways, which is specified in the Verilog .vh header file) instructions to be processed at a time. This requires all modules to have the ability to process n instructions in one cycle. There are mainly five stages in this microprocessor. N instructions are firstly fetched from ICache and memory. BTB will determine if an instruction is a branch based on its PC. If a branch is detected, BTB will provide its target PC and the local predictor will predict whether the branch is taken or not. In dispatch stage, N decoders are designed to decode the N instructions, complete renaming and pass the relative information to issue stage. An N-way reservation station (RS) and a reorder buffer (ROB) receive information, issue them out-of-order and commit them in order respectively. In the execution stage, N function units, each with the two types of operations including ALU and MUL, process N instructions and get the result. If the instruction involves memory load or store, DCache together with load queue and store queue will receive this information and process them out of order. Table I shows the configuration of the core modules. TABLE I. SIZES OF MAJOR MODULES RS ROB BTB SQ/LQ PRF RAT DCache ICache 16 32 64 8 64 32 32 32 IV. FEATURE IMPLEMENTATION This section discusses the details of implementation of each feature stage by stage. There are some modules that do not belong to a single stage. For those modules, they are discussed in a certain stage, because they have strong relationship with one or more key modules in that stage. A. Fetch Stage The fetch stage reads in the instructions from the ICache. Each cycle it fetches up to N instructions from the ICache and sends the instructions along with the valid signal to IF/ID register. The valid number of instructions is the minimum among the valid numbers of instructions from the ICache along with reorder buffer 's (ROB) and reservation station's (RS) available entries. Next PC (NPC) is determined by the current PC, valid number of instructions, the branch predictor, BTB, and the misprediction signal. In case there is a misprediction resolved, NPC will be determined by the correct target PC from ROB. If there is no misprediction, NPC will be predicted target PC if there is a branch. Otherwise, NPC will be PC plus valid number of instructions. Instruction Cache (ICache) Instruction cache is designed to be four-way associative and blocking. Every cycle instruction cache receives up to N requests. In case there is no cache miss, it will send requested number of instructions to fetch stage. When there is a miss, the program stops reading instructions in, stalls and waits until the data is found in memory. Due to the limitation of memory, instruction cache can only get two instructions from memory per cycle. The main controller in instruction cache is request queue. Fig. 2 shows the working scheme of the request queue. If a request is a hit, data is directly fetched from ICache.In case a request got missed in cache memory, request queue will take in this request. Request queue will then send one request to memory per cycle, together with the address information of this specific instruction and 2 register map table, the physical register files, the free list, and the reorder buffer. An architectural register is mapped to a new physical register taken from the free list whenever the architectural register is the destination of an instruction. The new physical register is marked not free until the instruction is committed and it will be returned to the free list. The instructions are committed in order by ROB. Physical Register Files (PRF) Fig. 2 Illustration of Request Queue Working Scheme for ICache the memory will start to look for that instruction. A request entry will be evicted when the data is found and given to fetch stage. Every cycle memory will provide three kinds of information to request queue: data, response and tag. Tag is matched with data, meaning that the instruction with this tag should take the data. Memory will also give a response to whatever instruction sent by request queue. This response is the key for a request to match a tag and get data. The picture below is how instruction cache works. Branch Target Buffer (BTB) The BTB is a 64-entry direct-mapped register file to store the previous target branch addresses. It’s indexed by lower 6 bits (from bit 2 to bit 7) of the PC. For each BTB entry, it stores a tag of 32 bits of instruction, and 32 bits of target PC. The instruction in fetch stage uses its PC to locate its own entry in BTB. If the instruction tag is not matched, BTB will predicts as non-branch instruction, and the output target PC and predictions from local predictor will be not be used in fetch stage. After one branch is committed in ROB and is mispredicted, BTB will update the table with ROB committed PC and its target address. Branch Predictor A 2-level local predictor is used as the direction predictor, which consists of branch history table (BHT) and pattern history table (PHT). The BHT is a 16-entry direct-mapped register file, indexed by lower 4 bits (from bit 5 to bit 2) of the branch PC. Each entry has 4 bits representing the PHT index. The PHT is also a 16-entry direct-mapped register file, and each entry has 2 bits. It outputs only 1-bit taken-or-not prediction by using the bit 1 of the 2-bit saturating counter predictor, beginning at weakly-not-taken. When a branch is committed in ROB, the counter in PHT is updated based on the result of the branch and its PHT index, incrementing if taken and decrementing if not taken. Also, the BHT is left shifted one bit to store the result of the branch. Our direction predictor predicts the all incoming instruction and sends back to fetch stage, however, the BTB outputs a signal indicating whether the instruction associated with the current PC is a branch and controls the prediction to be handled correctly. B. Dispatch Stage The dispatch stage consists of N independent decoders, each of which decodes an instruction from the IF/ID register in each cycle. This stage handles the register renaming, inserts the decoded instructions into ROB in their original order and outputs signals to ID/ISS pipeline registers. The key structures used in this stage are the The PRF consists of 64 entries corresponding to 64 physical registers. Each entry stores the value of the architectural register to which this physical register was formerly mapped, a free signal to indicate whether it’s in use or not and a valid signal to indicate whether it has obtained the final result from CDB. It monitors the CDB and ROB retire signals and then updates values, valid and free signals in the entry. When a renaming request comes from a decoder, PRF will outputs the free tag according to the free list and this entry will remain not free until retired in ROB. It also outputs the value of operands of the instruction to RS. If the physical register is valid, it will output the value stored, otherwise it will output the physical register index. The valid bit will be transferred into RS along with the operands' values, so that the RS will be able to distinguish whether the incoming number is the real value or the PRF index. Renaming Table & Retire Renaming Table (RAT & RRAT) Both RAT and RRAT have thirty-two entries. To obtain values of operands, the RAT outputs the tags to the PRF and the PRF generates outputs in the way described above. Fig. 3shows the renaming scheme of our design. Take a three-way superscalar machine as an example, in each cycle, the priority selector will select three free PRF and send them to RAT. Based on the renaming information from the decoder, some architectural registers will be renamed according. The physical registers that were previously mapped to those architectural registers will then be set free in PRF. To adapt to the superscalar, the renaming must handle the data dependency among the instructions that enter in the same cycle. Common concern is that the later instructions depends on the previous ones dispatched in the same clock cycle. To ensure that the later ones always gets the most updated PRF index, the RAT will check for this kind of dependency and forward the correct PRF index internally. The RRAT is updated only when the instruction is retired in ROB and records the physical registers that have been committed according to the architectural registers. Therefore, the RRAT stores the renaming of each architectural register in actual finish order. If misprediction happens, the RAT will copy the whole table from the RRAT. Fig. 3 shows an example of renaming of a 3-way superscalar machine. At the beginning of every cycle, decoders will send indices of architecture registers to be renamed to RAT. In this example, 10, 18 and 29 are sent. At the same time, a priority selector will select three free physical registers from PRF and send their indices to RAT. As shown by Fig. 3, 1, 57 and 62 are chosen and sent to RAT in the example. Then architecture registers 10, 18 and 29 will be re-mapped to physical registers 3, 1, 57 and the physical registers that were previously mapped to by those architecture registers will be freed in PRF. 3 Fig. 3 Example of Renaming of a 3-way Superscalar Machine Reorder Buffer (ROB) The reorder buffer, which is responsible for holding and retiring the instructions in order, is set to have 32 entries. After an instruction is dispatched, the architectural and physical index of the destination register and instruction PC are stored in the entry pointed to by the tail pointer. The indices of entries used to store instructions will be sent to the reservation station as an instruction tag. For each entry, there are two 1-bit signals indicating whether it’s a branch or a halt respectively. For branch instruction, signals including branch misprediction and branch target PC are stored. Once the branch instruction is detected as mispredicted at committing, ROB will squash the entries behind it and commit this exception in the same clock cycle. Tail pointer will be moved to point to the same entry with the head pointer. The misprediction signal is also output to pipeline for the purpose of updating predictors and fetching target PC. Each time instructions are retired, the head pointer will increment. When the new instructions are added to ROB, tail pointer will increment and this entry will be labeled invalid. Since the fetch stage has already fetched appropriate number of instructions as described previously, the instructions passed into ROB will fit into the valid ROB entries. When a pointer comes to the end of the ROB and it is to be incremented, it will move to the beginning of the ROB automatically due to overflow. ROB also has an executed signal to indicate whether an instruction stored in ROB has finished execution. C. Issue Stage The issue stage waits for the data dependencies to be cleared and issues the ready-to-execute instructions to ISS/EX registers in each cycle. Our issue stage receives instructions from the ID/ISS registers and stores the instructions in the reservation station (RS). True dependencies cannot be eliminated and instructions truly depend on others have to wait in RS until all its operands have valid value. False dependencies are eliminated by register renaming. Structural hazards in this stage relate to the function unit availability. If no functional unit is valid, the issue stage will hold all the instructions. Reservation Station (RS) RS is a key module in the OoO processor design because it holds the instructions until it has no data dependencies or structural hazards. It is designed to have 32 entries. Each entry records operation type, operands values or PRF index, operands ready bit, destination register PRF index, ROB index, next PC and prediction target PC for branch instructions. Our RS contains a free-list module which generates the free indices for holding new instructions and updates the list after sending the instruction to execute stage. By monitoring the common data bus (CDB), RS changes the operands ready bit and values when receiving signals indicating an execution or a memory load is done. The computed/loaded value can be directly forwarded to execution in the same cycle. Such forwarding can improve the CPI by reducing the waiting time of instructions in the RS. As shown by Fig. 4, a dependency chain consisting of Instruction A, B, and C enters the RS in the same cycle. Among them, B is dependent on A and C is dependent on B. In our design, RS selects Instruction A in Cycle 2 and Instruction A is executed in Cycle 3. RS then selects Instruction B in Cycle 3 and forwards execution result as B’s operands, allowing B to be executed in Cycle 4. Similarly, Instruction C is selected at the same time B completes execution in Cycle 4 and finishes execution in Cycle 5. Fig. 4 Example of Data Forwarding of a 3-way Superscalar Machine For N sets of function units in the execute stage, each containing an ALU and a multiplier (MULT), two priority selectors will select N ALU operations and N multiplication in each cycle. Each set of ALU operation and multiplication from RS is directly mapped to the function unit in execute stage accordingly. We will send ALU and/or multiplication instructions to the execute stage only if the corresponding function unit is valid. Between ALU operations and multiplications, we select multiplications prior to ALU operations because multiplications take more cycles to execute. The RS outputs all the indicators in one entry into the execute stage and cleans this entry after sending. Fig. 5 shows an example of the selection scheme of RS for three-way superscalar machine. At the beginning of a certain cycle, the operation type and valid status of each entry of RS is shown by Fig. 5. The priority selector for ALU operation then will select 3 valid ALU operations. Because there are only two valid multiplications, the other priority selector will only select two and leave its third entry invalid. After selection, the selector will issue instructions following the order that multiplications first and ALU operations after. Since the first way have both ALU function unit and multiplier available and both ALU operation and multiplication valid, it will send the multiplication into the execution stage. The second way also has both ALU operation and multiplication valid, but its corresponding multiplier is not ready, so ALU operation is sent out on this way. On the 3rd way, even though that both ALU function unit and multiplier are available, ALU operation is sent out since the priority selector for multiplication does not have a valid instruction on this way. 4 register of stores and loads. If there is a previous dependent store and the address and value of this store in known, store queue will directly forward data to load queue and send a ready signal. Load queue will then load the data to PRF. If there is no previous dependent store found in store queue, store queue will still send a high ready signal but without data. Load queue will then send load request to data cache. After load queue got value from data cache, it will send the value to PRF and tell ROB this load instruction has completed. Also, the corresponding entries in load queue will be evicted. Store Queue Fig. 5 Illustration of RS Selection Scheme for a 3-way Superscalar Machine D. Execute Stage The execute stage carries out the instruction from the ISS/EX registers with n sets of function unit. This stage is purely combinational and it always executes the current instructions regardless whether they are valid or not. In case a current instruction is invalid, an invalid signal will become true and will be sent to other related modules through CDB. Each function unit contains one ALU and one multiplier. All the function units output the results to CDB. The ALU is non-clocked, which means that any instruction fed into an ALU will be finished within the same cycle. The multiplication module is pipelined with 4 stages and takes 4 cycles to finish. Because the CDB is shared by the ALU and multiplier, in the cycle when a multiplier finishes its computation, its corresponding ALU cannot process any valid instruction to avoid structural hazard. In other cycles, when a multiplier is executing, the ALU on the same can still take in new instructions. E. Memory Stage The memory stage deals with load and store instructions passed through the pipeline. In order to improve the efficiency of communication with memory, DCache, load queue and store queue are designed respectfully. Load Queue Load queue handles load instructions and it can take at most N instruction per cycle. Its size is subject to change according to different ways of superscalar. Once a load instruction has been decoded in dispatch stage, load queue gets its information from ID/ISS register. If a load instruction loads data from the same address previous store instruction has stored the data in, there is a dependency, which may cause mistake and reduction in performance. In order to solve this problem, once the destination register of a load instruction is valid, the information of this load is sent to store queue to see whether there has been a dependent previous store, which is realized by comparing the ROB index and destination Store queue handles store instructions and it can take at most N instructions per cycle. Its size is also subject to change according to different ways of superscalar. Once a store instruction has been decoded in dispatch stage, store queue gets its information from ID/ISS register. Store queue will communicate with DCache, which will be discussed later. When a load is sent to store queue to check dependency, store queue is traversed and gives back relative information. Because if there is a misprediction before the store, memory cannot recover from it, entries in store queue will only be cleared when the store instructions saved in these entries are committed in ROB. Data cache (DCache) Data cache is designed to be four-way associative and non-blocking. It takes read or write requests directly from load queue and store queue. Each cycle, data cache takes at most N loads and N stores and output at most N lines for loads response. Data cache is mainly composed of cache memory, action queue and request queue. Fig. 6 shows the working scheme of DCache. In order to achieve non-blocking, an action queue is built to hold the previous missed load requests. It is filled when new load requests come in and get a miss in data cache, and spilled when previous unresolved load requests has been resolved. Request queue is responsible for the communication with memory. When a load request gets missed in cache memory and there are no other load requests with same address in request queue, request queue will take in this new load request. When a store request gets missed in cache memory, if there is a store request with same address in request queue, the data in old request will be updated to this new value. If there is no store requests with same address in request queue, request queue will take in this new store request. Only one request can be sent to memory per cycle and while data cache is sending request to memory, instruction cache should not send request. After a request has been sent, request queue will store the response from memory. For each cycle, if the memory tag is not zero, and matches one of the response stored in request queue, memory data will be taken in and written in cache memory. Once a request has been resolved in request queue, the address and data will be broadcasted to all entries in action queue. In the next clock cycles, if one or more load requests from load queue get missed, an empty slot can be found to return the data of solved load request in action queue. 5 algorithm and 4-way associative was considered to be the solution. 4-way associative cache worked with 20ns reduction in clock period and performed little impact on hit/miss rate for instruction cache. Pseudo LRU algorithm showed little reduction in clock period so true LRU remained unchanged. C. Reservation Station Design One-way RS v.s. N-way RS Fig. 6 Working Scheme of DCache V. TEST & ANALYSIS This section covers the analysis of the design and performance tests of the design. A. Choice of Predictor Local predictor scheme was chosen after compared to Gshare predictor, which uses the part of branch PC to XOR branch history register and then looks into the pattern history table to find the prediction result. Local predictor allows the processor to better predict the pattern of branch by remembering the most recent decision of branch based on specific reference indexed by branch PC. The performance of local predictor is better than Gshare predictor because the use of global history register might outputs wrong information if one or two branches are unresolved. This happens due to the resolution of branches in later stage. So if two branches share the same index PC of predictor, the later one won’t get the results updated by the previous one and will be likely to be wrong. Therefore, local predictor avoids this problem and now only tight branches may have faults. Local predictor also simple to implement compared to more advanced alternatives such as tournament predictor. Table I shows the hit rate of different configurations of BTB, PHT and BHT. As shown by Table II, there is a performance threshold regarding BTB size between 16 and 32. In addition, hit rate is positively related with the sizes of BTB, PHT and BHT. Since none of these modules is on the critical path in the processor, there are all of maximum sizes in the final design. TABLE II. HIT RATE OF LOCAL PREDICTOR BTB Size PHT Size BHT Size Hit Rate 64 16 16 81% 32 16 16 80% 16 16 16 78% 64 8 16 71% B. Cache Design At the beginning, fully associative with true LRU algorithm instruction cache was designed. During synthesizing, the clock period was found to be more than 25ns for instruction cache. The analysis showed that the traversal through all 32 entries to find the appropriate position for new data was the critical path. Pseudo LRU In the final design, one RS is implemented which is able to receive and send multiple instructions in one cycle. The advantage of such implementation is that the information regarding RS can be unified. For example, the fetch stage can read in a single value of RS size to determine the number of instructions to fetch. In previous designs, a multi-RS implementation was attempted but denied. The multi-RS implementation aims to simplify the output logic where each RS is mapped to a fixed set of function units (which is applied in our current single-RS design.) However, this design was unable to inform fetch stage about its availability. Without knowing how many instructions the RS can store, fetching became problematic and might require recovery in later stages. Therefore, the multi-RS design was discarded. Selection Logic Another possible selection (and once implemented) logic in RS can be demonstrated in Fig. 7, where no forwarding logic is implemented. Dependent instructions such as Instruction B are no longer able to be selected in Cycle 3. The same dependency chain will require 2 more cycles before Instruction C before it completes execution. Fig. 7 Example of an Alternative RS Selection Scheme of a 3-way Superscalar Machine This selection logic was competitive with respect to the logic applied in the final design because it eliminates all forwarding and greatly reduces the clock period of the issue stage. However, by implementing this logic in the three-way superscalar design, the clock period was not improved because issue stage was not included in the critical path. Thus, the forwarding logic was chosen in our three-way superscalar design. As for superscalar designs with the number of scalar greater than three, the forwarding logic became the critical path in the design. This change of delay was caused by the increase of the sizes of other modules such as LSQ, which required more time to obtain the results. A third implementation for selection logic was proposed but not implemented. Stark, a researcher from Intel Corporation introduced this logic in his paper. The logic applies dynamic scheduling technique, which allows RS to speculate the execution order of the instructions without having its operand values ready [1]. To explain 6 this logic, we use the same example with Inst A, B, and C, as shown in Fig. 8. In the dynamic scheduling scheme, Inst A broadcasts its PRF entry number at the same time it is selected, which also wakes up Inst B. Inst B is selected to be executed in Cycle 3 and wakes up Inst C. Without having the result ready, Inst B takes another cycle to read the execution result (Cycle 4.) By applying the dynamic scheduling scheme, the forwarding logic can be simplified, thus reducing the delay. In addition, the number of cycles used to complete execution decreases. Fig. 10 RS after first cycle Fig. 8 Dynamic Scheduling Scheme in RS However, the dynamic scheduling scheme introduced complicated recovery logic and required intensive modification to the pipeline, which eventually enforced us to give up this scheme. D. Out-of-order Implementation This section will demonstrate the out-of-order algorithm in our design with an assembly code segment. The segment listed below is taken from one of the instructor test cases. mulq addq srl srl srl $r12,$r2,$r1 $r1,$r3,$r1 $r10,32,$r10 $r11,32,$r11 $r12,32,$r12 Fig. 11 RS after multiplier finishes execution finishes execution in the next cycle. Thus, we finish executing the code segment with the out-of-order algorithm. E. Test Results In our design, the number of scalars and sizes of modules can be specified at will. In order to find the best combination of module sizes and scalar numbers, a series of testbenches were performed. In the following four tables (Table III, IV, V, and VI,) a collection of sizes of core modules (RS, ROB, PRF, LQ, and SQ) with respect to the number of scalars are presented. The following figure (Fig. 12) demonstrates the test result in term of CPI. TABLE III. DIFFERENT CONFIGURATIONS OF 2-WAY SUPERSCALAR MACHINE In this segment, addq is dependent on the result of mulq (both architectural register 1.) At the same time, all srl instructions are independent of preceding instructions. Provided that all five instructions are stored in the RS of a three-way superscalar machine, as shown in Fig. 9, the selection logic will first select mulq and two srl instructions to send to execution. In this example, assume that RS selects the first two srl instructions. 2-way #1 2-way #2 2-way #3 RS 32 16 16 ROB 32 32 16 PRF 64 64 64 LQ 8 8 8 SQ 8 8 8 TABLE IV. DIFFERENT CONFIGURATIONS OF 3-WAY SUPERSCALAR MACHINE 3-way #1 3-way #2 3-way #3 RS 16 32 16 ROB 32 32 32 PRF 64 64 64 LQ 8 8 16 SQ 8 8 16 Fig. 9 Initial state of instructions in RS In the next cycle, only two instructions are in the RS, as shown in Fig. 10. The selection logic will select srl because addq has one operand not ready (depending on the result of previous mulq.) Therefore, only srl is sent to execution in the next cycle. After three cycles, multiplier will finish execution and broadcast the result. The addq instruction in the RS will receive the value and change its ready bit to 1, as shown in Fig. 11. Then the selection logic selects this addq and 7 TABLE V. DIFFERENT CONFIGURATIONS OF 4-WAY SUPERSCALAR MACHINE 4-way #1 4-way #2 4-way #3 4-way #4 4-way #5 4-way #6 RS 64 32 64 32 32 16 ROB 64 32 32 64 64 32 PRF 96 64 64 96 96 64 LQ 16 16 16 16 8 16 SQ 16 16 16 16 8 16 TABLE VI. DIFFERENT CONFIGURATIONS OF 5-WAY SUPERSCALAR MACHINE 5-way #1 5-way #2 5-way #3 5-way #4 RS 64 64 64 128 ROB 64 64 128 64 PRF 96 96 160 96 LQ 16 24 16 16 SQ 16 24 16 16 Fig. 12 Normalized CPI vs. N-Way Superscalar with Different Configurations All the module sizes were determined through our designing and debugging process. The design was implemented based on a 3-way superscalar machine. During the design, we found that a 32-entry ROB seldom got full, while a 16-entry ROB was frequently filled. Note that the CPI was measured for brief analyze at a clock period of 12.5 ns, which was the clock period of 3-way #1. Change of memory latency will be considered later. In Fig. 12, several combinations of module sizes provided promising CPI. By examining the results closely, some conclusions were made: 1. Generally, the CPI decreases as the scalar number goes up. This result meets our expectation because higher scalar number allows more instructions to be executed in parallel. However, none of 5-way 8 superscalars shows improvement compared to 4-way superscalars. Considering the test sizes (lines of instructions in a test case,) this situation may result from insufficient number of instructions, i.e., the tests are not large enough to fully utilize the hardware resources provided by 5-way superscalars. By comparing module sizes among superscalars with the same scalar count, larger ROB size is found to be the primary factor for better CPI. The increase of ROB size provides the pipeline a bigger instruction window, allowing the processor to execute more instructions before committing. Since scalar count rises, more instructions can be fetched in one cycle, ROB size should also increase. The size of LQ/SQ becomes another restriction when the scalar count and ROB size rises. We believe that larger ROB size brings more load and stores into the instruction window. In our design, if LQ or SQ is filled, the pipeline should be stalled to wait for memory accesses. Therefore, LSQ size should also increase along with ROB size. 2. 3. To determine the final configuration of our design, specifically the scalar count and module sizes, clock period must be tested. Among the configuration listed above, 2-way #1, 3-way #1, 2, and 3, 4-way #1 and 4 were chosen as candidates. For each configuration, the design was synthesized to test the actual clock period. For synthesis setup, we specified map effort to “medium” in .tcl file. The clock period is recorded in Table VII. TABLE VII. clk 2-way #1 12 ns CLOCK PERIOD OF DIFFERENT CONFIGURATIONS 3-way #1 12.5 ns 3-way #2 18 ns 3-way #3 20 ns 4-way #1 22 ns 4-way #4 22 ns data from Reg A in the RS and the value of Reg A is computed in the same cycle, the forwarding logic will detect PRF entry index (tag) broadcasted on the CDB and allow this instruction to be sent to execution in this cycle. Without forwarding, the same instruction has to wait in the RS for its ready bits to be updated, which requires another cycle to be executed. However, the forwarding in RS and LSQ introduces large delay to the pipeline design. Such delay is not exposed in the final design as the critical path exists between ICache and fetching stage. As for the other design we synthesized in the previous section, RS appeared in every critical path in the post-synthesis report. To prove the relation between forwarding logic and delay, the forwarding logic was cancelled in the RS for configuration 3-way #2. The new clock period fell to 12.5 ns, the same as 3-way #1 configuration, while the normalized CPI rises to 1.06 which did not give a better performance. This result clearly stated that forwarding in the RS affected clock period. Similar result appeared after we cancelled the forwarding in LSQ, which resulted in CPI increase and slightly worse overall performance. In the sense of higher scalar counts, we recommend disabling all forwarding logic to achieve better clock period. If refined pipeline structure can be applied, we believe our pipeline can perform better with scalar number greater than 4. VI. CONTRIBUTION Table IX shows the contribution of each group member. Our group member averagely undertook the module implementation. In addition, Jingyuan Sun and Yuanlang Song spent more effort in testing the high-level design. TABLE IX. CONTRIBUTION OF EACH GROUP MEMBER Responsible Work Testbenches were re-run with post-synthesized clock period to calibrate the CPI (normalized.) The result is shown in Table VIII, together with the product of clock and CPI with respect to each configuration. The product can be used to represent Tcpu level for each configuration. TABLE VIII. TCPU OF DIFFERENT CONFIGURATIONS Clk 2-way #1 12 ns CPI 1.11 3-way #1 12.5 ns 1 Prod 13.32 12.5 3-way #2 18 ns 3-way #3 20 ns 4-way #1 22 ns 4-way #4 22 ns 0.972 0.90 0.826 0.826 17.50 18 18.17 18.17 With the product of clock period and calibrated CPI, configuration 3-way #1 was chosen as our final design. F. Forwarding and Performance In our design, data forwarding logic was intensively implemented to shrink CPI. The forwarding was applied in RS, PRF, and LSQ. By forwarding the execution results, we do not need an extra cycle for instructions to read the register values. For example, if an instruction requires Chuyi Jiang Percentage Contribution 19% Di Hu Id_stage, Local_pred, if_stage, RAT/RRAT BTB, Ex_stage Jingyuan Sun RS, ROB 23% Yongyi Lu If_stage,PRF, if_stage 19% Yuanlang Song LQ/SQ, ICache, Dcache 20% 19% VII. CONCLUSION This report presents a R10K pipelined processor with arbitrary-way superscalar pipeline, local branch predictor, branch target buffer, load-store-queue, four-way associative non-blocking data cache and instruction cache, and advance data forwarding. The design significantly reduces the CPI and keeps the clock period at a reasonable level. With the unified algorithm for various design configurations in our Verilog implementation, our design can be fully customized in terms of scalar count and every module’s size. By securing the correctness and achieving flexibility, we have successfully fulfilled the project requirement of constructing a functional processor. Future work may include optimizing the selection logic of RS, 9 improving the forwarding scheme, and designing better branch prediction units. VIII. Acknowledgement This work was finished with the help from Prof. Mark Brehob, and GSIs Jonathan Beaumont, William Cunningham and Jason Shintani. We would like to thank them for their generous help and patient guidance. REFERENCE [1] Stark, J.; Brown, M.D.; Patt, Y.N., "On pipelining dynamic instruction scheduling logic," Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on , vol., no., pp.57,66, 2000. 10 APPENDIX A Table of Hit Rate for Each Test Program of Machine Configured in 3-way #1 11 APPENDIX B Table of CPI for Each Test Program of Machine with Different Configurations 12
© Copyright 2024