Pipelining The Nios II - Processor Architecture Laboratory

Pipelining The Nios II
Learning Goal: Processor pipeline.
Requirements: Quartus, ModelSim, FPGA4U, Nios2Sim.
1
Introduction
During this lab, you will create a relatively simple pipelined version of the Nios II. Start from the multicycle version (template), reorganize the components, add some register stages, and modify the Program
Counter and the Controller.
2
The CPU Pipeline
You must implement a relatively simple 5-stage pipeline.
• Harvard architecture.
• 5 stages (Fetch, Decode, Execute, Memory, and Writeback).
• All instructions go through these 5 stages, even if some of them are not used.
• There are no forwarding paths, stalls or flushes.
• The branch instructions (including br) have 2 delay slots.
• The jump instructions (e.g., jmp, call, ret) have a single delay slot.
In the multi-cycle version of the Nios II (shown in the following figure), instruction and data memory
accesses use the same memory port, since these different accesses never occurred at the same time.
LEDs_out
Buttons_cs
LEDs_cs
RAM_cs
Decoder ROM_cs
96
addr
cs
clk
read
ROM
(4KB)
addr
rddata
clk
reset_n
read
CPU
write
address
rddata
wrdata
16
address11..2
reset_n
cs
clk
read
write
RAM
(4KB)
addr
10
clk
reset_n
Buttons_in
wrdata
rddata
cs
reset_n
clk
read
write
addr
10
address11..2
4
LEDs
reset_n
LEDs
wrdata
rddata
Buttons
cs
reset_n
clk
Buttons
read
write
addr
2
address3..2
wrdata
rddata
1
address2
32
32
Version 1.6 of 9th May 2015, EPFL ©2015
1 of 5
Pipelining The Nios II
In the pipelined version of the Nios II, the processor reads the instruction memory at every cycle.
Therefore, it is not possible to use a single memory port without stalling the pipeline when a data access
occurs.
To simplify the implementation and avoid stalling, we will switch to a Harvard architecture (i.e.,
have separated ports for instructions and data). The original single memory port of the multi-cycle
processor (i.e., the one connected to the ROM, the RAM and the peripherals) will become the data port.
An instruction port should be introduced and connected to a duplicate copy of the ROM.
LEDs_out
Buttons_cs
LEDs_cs
RAM_cs
Decoder ROM_cs
96
addr
clk
cs
clk
read
Instruction
ROM (4KB)
addr
rddata
addr
16
32
clk
reset_n
I_addr
D_read
CPU
D_write
D_addr
D_rddata
D_wrdata
RAM
(4KB)
addr
rddata
address11..2
reset_n
cs
clk
read
write
Data
ROM (4KB)
10
clk
reset_n
Buttons_in
wrdata
rddata
cs
reset_n
clk
read
write
addr
10
4
LEDs
reset_n
LEDs
wrdata
rddata
Buttons
cs
reset_n
clk
Buttons
read
write
addr
2
address11..2
address3..2
wrdata
rddata
1
address2
32
32
I_rddata
The following figure illustrates the new CPU entity.
D_addr
clk
D_read
reset_n
16
32
I_addr
I_rddata
16
CPU
D_write
D_rddata
D_wrdata
32
32
The following figure shows what the architecture of the your pipeline could be. You can see that most
of the components implemented for the multi-cycle processor can be reused without any modification.
The function of each of the five stages is very close to the states of the multi-cycle processor.
• Fetch: The next instruction and the next instruction address (i.e., PC+4) are fetched in the first
register stage.
• Decode: The instruction is decoded by the Controller, which generates the control signals for the
next stages and also for the PC, so that in the case of a jump instruction, the PC is directly updated
during the Decode stage. Additionally, the register operands are read from the Register File and
stored in the next register stage.
• Execute: This is where the ALU operations occur. It’s during this cycle that the decision of taking
the branch is done. The next instruction address and the immediate value are sent to the PC.
The memory control signals are sent to the memory port, so that in the case of a read, the data is
received during the next stage. The result of the ALU and some of the control signals are stored in
the next register stage.
• Memory: In the case of a ldw instruction, the data from memory is stored in the next stage.
• Writeback: The control signals are sent to the Register File to write the result of the instruction.
2 of 5
Version 1.6 of 9th May 2015, EPFL ©2015
I_rddata
D_rddata
D_wrdata
D_read
D_write
D_address
Pipelining The Nios II
CPU
31..27
26..22
clk
clk
5
aa
5
aw
aw
wren
wren
wrdata
Register
File
ab
5
32
a
b
32
32
32
32
ALU
32
alu_res
0
0
1
0
wrdata
wrdata
1
1
signed
sel_imm
sel_branch
a
d_imm
e_imm
pc_addr
PC
0
sel_imm
21..17
1
d_imm
16
e_imm
16
pc_addr
next_addr
addr
sel_branch
wren
rf_wren
pc_sel_imm
sel_ra
0
5
aw
1
branch
16
alu_res0
sel_mem
sel_imm
26..22
a
branch_op
5
sel_a
16
E MW
6
Controller
pc_sel_a
opx
op_alu
D
op
rf_retaddr
5..0
16..11
sel_b
op_alu
read
write
sel_pc
sel_a
reset_n
e_imm
imm_signed
sel_rC
clk
sel_a
d_imm 32
sel_b
F
imm32
sel_mem
instr
Extend
sel_pc
imm16
raed
16
write
21..6
clk
reset_n
pc_addr
6
32
16
pc_addr
16
rst
rst
rst
rst
16
I_address
The Program Counter and the Controller are the only modules that require to be modified. The
following subsections will give you a description of what must be done.
2.1
The Program Counter
• In this pipelined version of the Nios II, the PC is always enabled and loads a new instruction at
every cycle. Therefore, you must remove the en input signal. The following figure shows the
modified entity.
clk
reset_n
PC
sel_a
sel_imm
branch
16
a
16
d_imm
16
e_imm
16
pc_addr
next_addr
16
addr
16
• The a and imm input values must be provided by the Decode stage with their corresponding control
signals (i.e., sel a and sel imm).
• Reduce the latency of the pipeline by providing the addr signal to the ROM directly after the next
address selection, and before the counter register (see the following figure).
Version 1.6 of 9th May 2015, EPFL ©2015
3 of 5
Pipelining The Nios II
PC
16
clk
reset_n
0
pc_addr
reset_n
1
+
16
0
16
00
01
10
1
branch
Rst
16
clk
next_addr
En
16
e_imm
+
16
'1'
16
0x0004
addr
+
16
a
d_imm
16
«2
sel_imm
sel_a
2.2
The Controller
• Now that we have a Harvard architecture, the sel addr control signal becomes useless.
• The PC being always enabled, the pc en signal becomes useless as well.
• The pc add imm signal cannot be provided directly by the Controller. Instead, it is generated in
the Execute stage. In the new Controller, the pc add imm signal corresponds to the branch signal.
• The Controller becomes asynchronous. You have to remove the state machine, and compute the
control signals in a combinatorial way.
• You can ignore the break instruction.
The following figure shows the modified controller entity.
imm_signed
sel_b
op_alu
read
write
sel_pc
Controller
6
6
6
branch_op
op
sel_mem
opx
pc_sel_imm
pc_sel_a
sel_ra
rf_retaddr
sel_rC
rf_wren
5
2.3
The System
You have to modify the System to provide separate memory storage for the data and the instructions.
• Connect the data port of the CPU to the ROM, the RAM and the peripherals.
• Make a copy the ROM module and connect it to the instruction port.
• For simplicity, and because it’s the only module connected to the instruction port, we remove the
cs input signal of the instruction ROM. The read input can also be removed, because the ROM is
read during each cycle.
4 of 5
Version 1.6 of 9th May 2015, EPFL ©2015
Pipelining The Nios II
3
Exercise
• Make a copy of your multi-cycle processor project. You will use this project to implement the
pipelined processor. If you don’t have a version of the multi-cycle processor, you can use the
provided project template, which includes a complete multi-cycle processor without any interrupt
signals.
• Modifiy the PC and Controller, create the necessary modules and register stages.
• To verify your design, write a simple program in Nios2Sim. This program should call a procedure,
do some branches, and give some feedback (through the LEDs for example).
• Don’t forget to take into account the fact that the current pipeline doesn’t care about data hazards,
and that there are delay slots for the branch and jump instructions. Insert nop instructions when
its necessary.
• Generate the hex file. Compile your design and program your FPGA.
• If the program is not executed properly, simulate your design with ModelSim. You only have
to provide reset n and clk signals to your system. The processor will execute the instructions
defined in the hex file of the ROM.
• Would it be difficult to flush the first stages of the pipeline in the case of a jump or a branch? Think
of a solution and propose it to an assistant. If you have the time, try to implement it.
4
Submission
To get points for this lab, you should submit your files (CPU.vhd, PC.vhd and controller.vhd) and
demonstrate your work to one of the assistants. The names of the ports should be the same as the ones
on the schema in Section 2.
Version 1.6 of 9th May 2015, EPFL ©2015
5 of 5