Download Report

Exercise: OpenRISC Programming
Increasing efficiency of the OpenRISC core with
simple instruction extensions
23.03.2015
Michael Gautschi
Antonio Pullini
Integrated Systems Laboratory
Integrated Systems Laboratory
Introduction
• All exercises will be performed on the following configuration:
– 4 OpenRISC or10n cores
• With private instruction cache (I$)
– 8 banks of shared level 1 memory (TCDM)
– Large L2 memory for instructions and data
– Since the focus of this exercise is on the core, only one core is used and
the others are put to sleep.
or10n
processor cores
Not used / sleeping
23.03.2015
2
Integrated Systems Laboratory
Exercise Overview
1. Introduction example
–
Compile & execute Helloworld on the Instruction Set Simulator (ISS)
2. Simulator Basics
–
Analyze ISS output and debug
3. Benchmarking
–
Analyze performance improvements of the new instructions using the simulator
4. RTL simulations and benchmarking
–
Compare to simulator
5. Unaligned memory accesses
–
Program a simple stencil and show the benefit of unaligned memory accesses
6. Auto vectorization
–
Analyze the improvements of vector operations on a matrix addition
7. Interrupts and events
–
23.03.2015
Compare interrupts and event response times
3
Integrated Systems Laboratory
Getting Started – 1/2
• Copy data from master account:
$ mkdir 2_OpenRISC
$ cp /home/soc_master/2_OpenRISC/openrisc.tar.gz 2_OpenRISC/.
$ tar –xzf OpenRISC.tar.gz
2_OpenRISC directory
• Source the instruction set simulator
$ source /home/soc_master/germain/pulp-sdk2014.1-ub14.10_2/env/setup.sh
• We will be working in the software (sw) and
build directories
– sw-dir:
• contains application sourcecode
– build-dir:
• Holds compiler and simulator output here
• Simulator and RTL-simulations will be run here
– sim-dir:
• Contains precompiled code for RTL-simulation
23.03.2015
4
Integrated Systems Laboratory
Getting Started – 2/2
• We will be working on the scratch because we are going to
generate some data 
1.
Create a build directory and set up the compiler
•
2.
Start the compiler in a new shell:
•
•
3.
$ mkdir /scratch/soc_xx/build_id
$ or1k –v1.7.3 xterm
$ tcsh
(switch to cshell)
Configure the build directory in the new shell
•
•
•
•
•
$ cd /scratch/soc_xx/build_id
$ cp /home/soc_xx/2_OpenRISC/build/configure_template.tcl
/scratch/soc_xx/build_id/configure.tcl
In the configure script: Set the path to your exercise folder:
OPENRISC_DIR=“/home/soc_xx/2_OpenRISC”
$ ./configure.tcl
You have successfully set up the build directory!
– Lets get started with exercise 1!
23.03.2015
5
Integrated Systems Laboratory
Exercise 1 – Introduction
a)
The build directory is created, the compiler is configured, and the
simulator is set up. We are ready to start with a simple helloworld
application.
b) Compile Helloworld.c
• Helloworld.c is located in sw/apps/sequential_tests/helloworld/.
• To compile the application enter the build folder and run the makefile
$ cd build
$ make helloworld
$ make helloworld.read
$ make helloworld.slm.cmd
c)
: to generate an executable for the simulator
: to generate the assemebly
: to generate input data for RTL simulations
Run Helloworld.c
• The simulator is called “pulp-run” and can be started like this:
$ pulp-run --load-binary=./apps/sequential_tests/helloworld/helloworld:0xf
• Console should output “helloworld”
• Output is also written to the file: stdout/stdout_pe0
23.03.2015
6
Integrated Systems Laboratory
Exercise 2 – Simulator basics 1/4

Simulator commands to run an application



Enter debug mode



Run an application:
$ pulp-run --load-binary=./applicationName:0xf
Get assembly traces:
$ pulp-run --load-binary=./applicationName:0xf --iis-trace
Force entering the debugger in the beginning:
$ pulp-run --load-binary=./applicationName:0xf --pdb-break
press “Ctrl+C” during execution
Check current status of core:
(Cmd) state

: bootaddress = 0x1c000000
Breakpoints and inspection of memory (in debug mode)

Set a breakpoint on memory access:
(Cmd) bkp address region rw

Display all breakpoints:

Remove a breakpoint:

Inspect memory:
: address = 32 bit memory address;
: default region = 0x4 (1 word)
: rw = read or write access (default rw)
(Cmd) bkp_list
23.03.2015
(Cmd) bkp_dis ID
: ID shown in bkp_list
(Cmd) mem_dump address size
: 32bit addresses
: default size = 0x4 (1 word)
7
Integrated Systems Laboratory
Exercise 2 – Simulator basics 2/4
• Analyze the rijendael application (aes en/decryption)
–
Generate the assembly:
$ make rijndael.read
–
Have a look at the assembly file rijndael.read
•
function header
PC
The most important functions are:
–
–
–
main
encrypt
decrypt
compute_aes
encfile
decfile
Instruction in big and little endian format
Disassembled instructions
Absolute and relative
jump/branch targets
23.03.2015
8
Integrated Systems Laboratory
Exercise 2 – Simulator basics 3/4
•
Simulator traces:
Latency of the instruction
New register values, effective addresses
Disassembled instructions
•
To trace an application:
–
–
Run the application with the option --iss-trace
Four traces are generated, one for each core
•
–
23.03.2015
PC
we are actually using a four-core configuration of which three cores are
sleeping
trace_cluster0_core0.log shows the interesting traces!
9
Integrated Systems Laboratory
Exercise 2 – Simulator basics 4/4
•
23.03.2015
Tasks:
–
Run the rijndael application with traces and make use of the debugger
and breakpoints
–
Which instructions take more than one cycle to execute?
–
Why are some load/store instructions (l.lwz, l.sw) taking more than one
cycle?
–
Do you see an example of a data hazard?
–
How many times is address 0x1c008000 read?
–
What is the encryption key?
10
Integrated Systems Laboratory
Exercise 3 – Benchmarking 1/5
•
Intro:
The timer can be used to count execution time (in cycles)
Check matrixMul.c for example
Include <timer.h> and use the functions:
–
–
–
reset_timer()
stop_timer()
•
•
•
start_timer()
get_time()
Hardware loops
–
–
To run the application with different hardware loop settings, the compiler needs to be reconfigured
There are two options:
1.
2.
Create a new build folder with a new configure.tcl script
Modify current folder and modify the configuration script: configure.tcl
–
To support different numbers of hardware loops, add the following to TARGET_C_FLAGS in configure.tcl:
remove the option -mno-hwlp to enable hwlps
-mmax-hwloops=<number_of_hwlp>
: number_of_hwlp [0,4]
–
Reconfigure build folder:
$ ./configure.tcl
•
The compiler will generate the following hwloop instructions to produce efficient loops:
- lp.start
- lp.count
- lp.setup
23.03.2015
- lp.end
- lp.counti
- lp.setupi
11
Integrated Systems Laboratory
Exercise 3 – Benchmarking 2/5
Tasks:
•
–
–
–
–
Check if hardware loops are generated (in the matrixMul.read file)
How many instructions are actually saved? Compare the
matrixMul.read with and w/o hwloops.
Estimate the speedup with 1,2,4 hardware loops
Do your measurements match your estimations?
•
Why is the benefit of the first one higher?
Execution time: (# cycles/ % improvement)
Baseline
Hardware loop (1 register set)
2 register sets
3 register sets
4 register sets
23.03.2015
12
Integrated Systems Laboratory
Exercise 3 – Benchmarking 3/5
• Pre/post increment immediate:
Old MAC:
– Activated by default!
– Deactivate with –mno-idxls
• Pre/post increment register:
– From a hardware perspective, what is the
drawback of this instruction?
– Deactivate with –mno-rrls
• Multiply-accumulate instruction:
– Old architecture:
New MAC:
• Accumulation register stored in a special
register which is not in the register file
• Accumulation result can be accessed in
two cycles
– New architecture:
• Accumulates directly on the register file
• Enable new MAC with -mmac3
23.03.2015
13
Integrated Systems Laboratory
Exercise 3 – Benchmarking 4/5
• Vector Instructions:
– Add, sub, mul, mac, comparisons are all supported in vector mode
– In parallel one can compute:
• One word
• Two halfwords, or
• Four bytes
– Enable the vector extensions with the compiler option:
•
•
–mlv32
-munaligned-ls
for vector generation
for unaligned memory accesses
– Check in the .read me if vector code is generated. Vector instructions have the
format:
• lv.{mul,mac,sub,add,…}
• Tasks:
– Run the matrixMul application with the different compiler options
– Summarize your results in the first column on the next page
23.03.2015
14
Integrated Systems Laboratory
Exercise 3 – Benchmarking 5/5
Execution time: (# cycles/ % improvement)
Simulator:
RTL-SIM:
Baseline
Hardware-loop
Pre/post incr. imm.
Pre/post incr. reg
mac
Vector
Unaligned access
• Can you explain your results?
• Are the results you obtained optimal?
• What can be done better?
– Try to improve the matrix multiplication such that vectors are used more
efficiently
23.03.2015
15
Integrated Systems Laboratory
Exercise 4 – RTL Simulation 1/2
• The simulator is not 100% accurate because:
•
•
•
•
It is still under development
Not all components are modeled (no caches/memory contentions/
simplified DMA)
But a lot faster than RTL simulation
Start RTL simulations with the Makefile in your build folders:
–
$ make testName_vsim
• Rerun the matrixMul in the RTL simulation and complete the table
– If you have a build folder for each configuration, you can run the
simulations in parallel (each RTL-sim will take ~5min)
– Are you able to reproduce the previously obtained results? Do you
observe any differences?
– Do you have an explanation?
23.03.2015
16
Integrated Systems Laboratory
Exercise 4 – RTL Simulation 2/2
• In matrixMul.c the function
computeGold() which is actually
computing the multiplication is
called two times with the same
inputs and outputs!
Code snipped of matrixMul.c:
• Why do you think this is the case?
– Compare the execution time of
the first iteration between the
RTL-simulation and the simulator!
– What do you observe?
23.03.2015
17
Integrated Systems Laboratory
Exercise 5 – Unaligned Memory Access
•
Unaligned memory access
– If data is not aligned, for example when
computing the stencil over an image, a lot of
memory accesses and shifting is required.
– If data can be accessed in an unaligned
fashion, the shifting is not required anymore.
– The or10n core supports unaligned accesses
in two consecutive cycles!
•
Pixel = 1 char
Tasks:
– Complete the stencil template program
(sw/apps/sequential_tests/stencil/stencil.c)
which computes four stencils in parallel over
a 32x32 pixel image
– Do you observe a speedup with unaligned
memory access enabled?
– Compare to a vector only implementation
23.03.2015
Image – 1 stencil
4 parallel stencils computed with
vectorial unit
18
Integrated Systems Laboratory
Exercise 6 – Auto Vectorization
•
•
•
In the matrixMul example, the automatic vector support did not bring the desired speedup
If the matrices are accessed in a more regular fashion, this is expected to change!
Run the matrixAdd{8,16,32} applications in RTL simulation and measure the speedup.
–
–
–
•
matrixAdd8 is based on characters
matrixAdd16 on short integers
matrixAdd32 on integers
Run the motion_detection application and compare first the execution time of the baseline and
the optimized architecture without vector support.
–
And then add first unaligned memory access, and then vector instructions
Execution time: (# cycles/ % improvement)
matrixAdd32
matrixAdd16
matrixAdd8
motion_detection
Baseline
Without vector
Unaligned access
With vector
23.03.2015
19
Integrated Systems Laboratory
Exercise 7 – Interrupts & Events 1/2
•
Core 3 is used to model an external interface in the following
•
Receiving and processing of data is completed as follows:
1.
2.
3.
4.
5.
23.03.2015
Core 0 configures its interrupt or event mask before completing other tasks/going to sleep
Core 3 generates data and sends/places it in a buffer (here in L2 memory)
Core 3 sends commands to global event unit, to generate an event for core 0
Core 0 receives an interrupt/ wakes up and processes the data in L2 memory
Continue from beginning
20
Integrated Systems Laboratory
Exercise 7 – Interrupts & Events 2/2
•
Task 1: Interrupts
–
–
Check out the interrupt/event template: sw/apps/sequential_tests/event_interrupt.c
Define an interrupt handler
•
–
–
–
•
Initialize interrupts
Add interrupt handler on interrupt GP0 (see sw/libs/sys_lib/src/int.c)
Set interrupt mask (1= interrupt is not masked)
Task 2: Events
–
In the function wait_and_proc_data():
•
•
•
Copy data from receive_buffer to your memory (received_data)
go to sleep, wait for event
When an event is received: copy data from receive_buffer to your memory (received data)
Task 3: Comparison
–
Compare the two solutions:
•
•
Execution time
Measure the time required to process one package in modelsim
$ make event_interrupt_vsim_debug
•
Hints to measure time:
–
–
•
23.03.2015
Events: observe the clock of core 0
Interrupts: observe the state of exception routine (exc_running_p)
What are the benefits of using events?
21
Integrated Systems Laboratory
Questions & Answers
•
You have successfully completed the exercise
•
If you are interested in a mini-project we can offer you:
– Implementing and optimizing of a benchmark, using the multicore pulp
environment
• See last exercise about the pulp architecture
– Compare simulator and RTL simulations and come up with examples how to
improve it
– Add performance counters on RTL level
– Compare OR10N core to a state-of-the art micro-controller with DSP
functionalities like the ARM Cortex M4
– Add a small unit which observes the stack pointer, and issues an exception if a
stack overflow has been detected
– We are open to your own ideas!
23.03.2015
22