Pipelining The Nios II Learning Goal: Processor pipeline. Requirements: Quartus, ModelSim, FPGA4U, Nios2Sim. 1 Introduction During this lab, you will create a relatively simple pipelined version of the Nios II. Start from the multicycle version (template), reorganize the components, add some register stages, and modify the Program Counter and the Controller. 2 The CPU Pipeline You must implement a relatively simple 5-stage pipeline. • Harvard architecture. • 5 stages (Fetch, Decode, Execute, Memory, and Writeback). • All instructions go through these 5 stages, even if some of them are not used. • There are no forwarding paths, stalls or flushes. • The branch instructions (including br) have 2 delay slots. • The jump instructions (e.g., jmp, call, ret) have a single delay slot. In the multi-cycle version of the Nios II (shown in the following figure), instruction and data memory accesses use the same memory port, since these different accesses never occurred at the same time. LEDs_out Buttons_cs LEDs_cs RAM_cs Decoder ROM_cs 96 addr cs clk read ROM (4KB) addr rddata clk reset_n read CPU write address rddata wrdata 16 address11..2 reset_n cs clk read write RAM (4KB) addr 10 clk reset_n Buttons_in wrdata rddata cs reset_n clk read write addr 10 address11..2 4 LEDs reset_n LEDs wrdata rddata Buttons cs reset_n clk Buttons read write addr 2 address3..2 wrdata rddata 1 address2 32 32 Version 1.6 of 9th May 2015, EPFL ©2015 1 of 5 Pipelining The Nios II In the pipelined version of the Nios II, the processor reads the instruction memory at every cycle. Therefore, it is not possible to use a single memory port without stalling the pipeline when a data access occurs. To simplify the implementation and avoid stalling, we will switch to a Harvard architecture (i.e., have separated ports for instructions and data). The original single memory port of the multi-cycle processor (i.e., the one connected to the ROM, the RAM and the peripherals) will become the data port. An instruction port should be introduced and connected to a duplicate copy of the ROM. LEDs_out Buttons_cs LEDs_cs RAM_cs Decoder ROM_cs 96 addr clk cs clk read Instruction ROM (4KB) addr rddata addr 16 32 clk reset_n I_addr D_read CPU D_write D_addr D_rddata D_wrdata RAM (4KB) addr rddata address11..2 reset_n cs clk read write Data ROM (4KB) 10 clk reset_n Buttons_in wrdata rddata cs reset_n clk read write addr 10 4 LEDs reset_n LEDs wrdata rddata Buttons cs reset_n clk Buttons read write addr 2 address11..2 address3..2 wrdata rddata 1 address2 32 32 I_rddata The following figure illustrates the new CPU entity. D_addr clk D_read reset_n 16 32 I_addr I_rddata 16 CPU D_write D_rddata D_wrdata 32 32 The following figure shows what the architecture of the your pipeline could be. You can see that most of the components implemented for the multi-cycle processor can be reused without any modification. The function of each of the five stages is very close to the states of the multi-cycle processor. • Fetch: The next instruction and the next instruction address (i.e., PC+4) are fetched in the first register stage. • Decode: The instruction is decoded by the Controller, which generates the control signals for the next stages and also for the PC, so that in the case of a jump instruction, the PC is directly updated during the Decode stage. Additionally, the register operands are read from the Register File and stored in the next register stage. • Execute: This is where the ALU operations occur. It’s during this cycle that the decision of taking the branch is done. The next instruction address and the immediate value are sent to the PC. The memory control signals are sent to the memory port, so that in the case of a read, the data is received during the next stage. The result of the ALU and some of the control signals are stored in the next register stage. • Memory: In the case of a ldw instruction, the data from memory is stored in the next stage. • Writeback: The control signals are sent to the Register File to write the result of the instruction. 2 of 5 Version 1.6 of 9th May 2015, EPFL ©2015 I_rddata D_rddata D_wrdata D_read D_write D_address Pipelining The Nios II CPU 31..27 26..22 clk clk 5 aa 5 aw aw wren wren wrdata Register File ab 5 32 a b 32 32 32 32 ALU 32 alu_res 0 0 1 0 wrdata wrdata 1 1 signed sel_imm sel_branch a d_imm e_imm pc_addr PC 0 sel_imm 21..17 1 d_imm 16 e_imm 16 pc_addr next_addr addr sel_branch wren rf_wren pc_sel_imm sel_ra 0 5 aw 1 branch 16 alu_res0 sel_mem sel_imm 26..22 a branch_op 5 sel_a 16 E MW 6 Controller pc_sel_a opx op_alu D op rf_retaddr 5..0 16..11 sel_b op_alu read write sel_pc sel_a reset_n e_imm imm_signed sel_rC clk sel_a d_imm 32 sel_b F imm32 sel_mem instr Extend sel_pc imm16 raed 16 write 21..6 clk reset_n pc_addr 6 32 16 pc_addr 16 rst rst rst rst 16 I_address The Program Counter and the Controller are the only modules that require to be modified. The following subsections will give you a description of what must be done. 2.1 The Program Counter • In this pipelined version of the Nios II, the PC is always enabled and loads a new instruction at every cycle. Therefore, you must remove the en input signal. The following figure shows the modified entity. clk reset_n PC sel_a sel_imm branch 16 a 16 d_imm 16 e_imm 16 pc_addr next_addr 16 addr 16 • The a and imm input values must be provided by the Decode stage with their corresponding control signals (i.e., sel a and sel imm). • Reduce the latency of the pipeline by providing the addr signal to the ROM directly after the next address selection, and before the counter register (see the following figure). Version 1.6 of 9th May 2015, EPFL ©2015 3 of 5 Pipelining The Nios II PC 16 clk reset_n 0 pc_addr reset_n 1 + 16 0 16 00 01 10 1 branch Rst 16 clk next_addr En 16 e_imm + 16 '1' 16 0x0004 addr + 16 a d_imm 16 «2 sel_imm sel_a 2.2 The Controller • Now that we have a Harvard architecture, the sel addr control signal becomes useless. • The PC being always enabled, the pc en signal becomes useless as well. • The pc add imm signal cannot be provided directly by the Controller. Instead, it is generated in the Execute stage. In the new Controller, the pc add imm signal corresponds to the branch signal. • The Controller becomes asynchronous. You have to remove the state machine, and compute the control signals in a combinatorial way. • You can ignore the break instruction. The following figure shows the modified controller entity. imm_signed sel_b op_alu read write sel_pc Controller 6 6 6 branch_op op sel_mem opx pc_sel_imm pc_sel_a sel_ra rf_retaddr sel_rC rf_wren 5 2.3 The System You have to modify the System to provide separate memory storage for the data and the instructions. • Connect the data port of the CPU to the ROM, the RAM and the peripherals. • Make a copy the ROM module and connect it to the instruction port. • For simplicity, and because it’s the only module connected to the instruction port, we remove the cs input signal of the instruction ROM. The read input can also be removed, because the ROM is read during each cycle. 4 of 5 Version 1.6 of 9th May 2015, EPFL ©2015 Pipelining The Nios II 3 Exercise • Make a copy of your multi-cycle processor project. You will use this project to implement the pipelined processor. If you don’t have a version of the multi-cycle processor, you can use the provided project template, which includes a complete multi-cycle processor without any interrupt signals. • Modifiy the PC and Controller, create the necessary modules and register stages. • To verify your design, write a simple program in Nios2Sim. This program should call a procedure, do some branches, and give some feedback (through the LEDs for example). • Don’t forget to take into account the fact that the current pipeline doesn’t care about data hazards, and that there are delay slots for the branch and jump instructions. Insert nop instructions when its necessary. • Generate the hex file. Compile your design and program your FPGA. • If the program is not executed properly, simulate your design with ModelSim. You only have to provide reset n and clk signals to your system. The processor will execute the instructions defined in the hex file of the ROM. • Would it be difficult to flush the first stages of the pipeline in the case of a jump or a branch? Think of a solution and propose it to an assistant. If you have the time, try to implement it. 4 Submission To get points for this lab, you should submit your files (CPU.vhd, PC.vhd and controller.vhd) and demonstrate your work to one of the assistants. The names of the ports should be the same as the ones on the schema in Section 2. Version 1.6 of 9th May 2015, EPFL ©2015 5 of 5
© Copyright 2024