Exercise: OpenRISC Programming Increasing efficiency of the OpenRISC core with simple instruction extensions 23.03.2015 Michael Gautschi Antonio Pullini Integrated Systems Laboratory Integrated Systems Laboratory Introduction • All exercises will be performed on the following configuration: – 4 OpenRISC or10n cores • With private instruction cache (I$) – 8 banks of shared level 1 memory (TCDM) – Large L2 memory for instructions and data – Since the focus of this exercise is on the core, only one core is used and the others are put to sleep. or10n processor cores Not used / sleeping 23.03.2015 2 Integrated Systems Laboratory Exercise Overview 1. Introduction example – Compile & execute Helloworld on the Instruction Set Simulator (ISS) 2. Simulator Basics – Analyze ISS output and debug 3. Benchmarking – Analyze performance improvements of the new instructions using the simulator 4. RTL simulations and benchmarking – Compare to simulator 5. Unaligned memory accesses – Program a simple stencil and show the benefit of unaligned memory accesses 6. Auto vectorization – Analyze the improvements of vector operations on a matrix addition 7. Interrupts and events – 23.03.2015 Compare interrupts and event response times 3 Integrated Systems Laboratory Getting Started – 1/2 • Copy data from master account: $ mkdir 2_OpenRISC $ cp /home/soc_master/2_OpenRISC/openrisc.tar.gz 2_OpenRISC/. $ tar –xzf OpenRISC.tar.gz 2_OpenRISC directory • Source the instruction set simulator $ source /home/soc_master/germain/pulp-sdk2014.1-ub14.10_2/env/setup.sh • We will be working in the software (sw) and build directories – sw-dir: • contains application sourcecode – build-dir: • Holds compiler and simulator output here • Simulator and RTL-simulations will be run here – sim-dir: • Contains precompiled code for RTL-simulation 23.03.2015 4 Integrated Systems Laboratory Getting Started – 2/2 • We will be working on the scratch because we are going to generate some data 1. Create a build directory and set up the compiler • 2. Start the compiler in a new shell: • • 3. $ mkdir /scratch/soc_xx/build_id $ or1k –v1.7.3 xterm $ tcsh (switch to cshell) Configure the build directory in the new shell • • • • • $ cd /scratch/soc_xx/build_id $ cp /home/soc_xx/2_OpenRISC/build/configure_template.tcl /scratch/soc_xx/build_id/configure.tcl In the configure script: Set the path to your exercise folder: OPENRISC_DIR=“/home/soc_xx/2_OpenRISC” $ ./configure.tcl You have successfully set up the build directory! – Lets get started with exercise 1! 23.03.2015 5 Integrated Systems Laboratory Exercise 1 – Introduction a) The build directory is created, the compiler is configured, and the simulator is set up. We are ready to start with a simple helloworld application. b) Compile Helloworld.c • Helloworld.c is located in sw/apps/sequential_tests/helloworld/. • To compile the application enter the build folder and run the makefile $ cd build $ make helloworld $ make helloworld.read $ make helloworld.slm.cmd c) : to generate an executable for the simulator : to generate the assemebly : to generate input data for RTL simulations Run Helloworld.c • The simulator is called “pulp-run” and can be started like this: $ pulp-run --load-binary=./apps/sequential_tests/helloworld/helloworld:0xf • Console should output “helloworld” • Output is also written to the file: stdout/stdout_pe0 23.03.2015 6 Integrated Systems Laboratory Exercise 2 – Simulator basics 1/4 Simulator commands to run an application Enter debug mode Run an application: $ pulp-run --load-binary=./applicationName:0xf Get assembly traces: $ pulp-run --load-binary=./applicationName:0xf --iis-trace Force entering the debugger in the beginning: $ pulp-run --load-binary=./applicationName:0xf --pdb-break press “Ctrl+C” during execution Check current status of core: (Cmd) state : bootaddress = 0x1c000000 Breakpoints and inspection of memory (in debug mode) Set a breakpoint on memory access: (Cmd) bkp address region rw Display all breakpoints: Remove a breakpoint: Inspect memory: : address = 32 bit memory address; : default region = 0x4 (1 word) : rw = read or write access (default rw) (Cmd) bkp_list 23.03.2015 (Cmd) bkp_dis ID : ID shown in bkp_list (Cmd) mem_dump address size : 32bit addresses : default size = 0x4 (1 word) 7 Integrated Systems Laboratory Exercise 2 – Simulator basics 2/4 • Analyze the rijendael application (aes en/decryption) – Generate the assembly: $ make rijndael.read – Have a look at the assembly file rijndael.read • function header PC The most important functions are: – – – main encrypt decrypt compute_aes encfile decfile Instruction in big and little endian format Disassembled instructions Absolute and relative jump/branch targets 23.03.2015 8 Integrated Systems Laboratory Exercise 2 – Simulator basics 3/4 • Simulator traces: Latency of the instruction New register values, effective addresses Disassembled instructions • To trace an application: – – Run the application with the option --iss-trace Four traces are generated, one for each core • – 23.03.2015 PC we are actually using a four-core configuration of which three cores are sleeping trace_cluster0_core0.log shows the interesting traces! 9 Integrated Systems Laboratory Exercise 2 – Simulator basics 4/4 • 23.03.2015 Tasks: – Run the rijndael application with traces and make use of the debugger and breakpoints – Which instructions take more than one cycle to execute? – Why are some load/store instructions (l.lwz, l.sw) taking more than one cycle? – Do you see an example of a data hazard? – How many times is address 0x1c008000 read? – What is the encryption key? 10 Integrated Systems Laboratory Exercise 3 – Benchmarking 1/5 • Intro: The timer can be used to count execution time (in cycles) Check matrixMul.c for example Include <timer.h> and use the functions: – – – reset_timer() stop_timer() • • • start_timer() get_time() Hardware loops – – To run the application with different hardware loop settings, the compiler needs to be reconfigured There are two options: 1. 2. Create a new build folder with a new configure.tcl script Modify current folder and modify the configuration script: configure.tcl – To support different numbers of hardware loops, add the following to TARGET_C_FLAGS in configure.tcl: remove the option -mno-hwlp to enable hwlps -mmax-hwloops=<number_of_hwlp> : number_of_hwlp [0,4] – Reconfigure build folder: $ ./configure.tcl • The compiler will generate the following hwloop instructions to produce efficient loops: - lp.start - lp.count - lp.setup 23.03.2015 - lp.end - lp.counti - lp.setupi 11 Integrated Systems Laboratory Exercise 3 – Benchmarking 2/5 Tasks: • – – – – Check if hardware loops are generated (in the matrixMul.read file) How many instructions are actually saved? Compare the matrixMul.read with and w/o hwloops. Estimate the speedup with 1,2,4 hardware loops Do your measurements match your estimations? • Why is the benefit of the first one higher? Execution time: (# cycles/ % improvement) Baseline Hardware loop (1 register set) 2 register sets 3 register sets 4 register sets 23.03.2015 12 Integrated Systems Laboratory Exercise 3 – Benchmarking 3/5 • Pre/post increment immediate: Old MAC: – Activated by default! – Deactivate with –mno-idxls • Pre/post increment register: – From a hardware perspective, what is the drawback of this instruction? – Deactivate with –mno-rrls • Multiply-accumulate instruction: – Old architecture: New MAC: • Accumulation register stored in a special register which is not in the register file • Accumulation result can be accessed in two cycles – New architecture: • Accumulates directly on the register file • Enable new MAC with -mmac3 23.03.2015 13 Integrated Systems Laboratory Exercise 3 – Benchmarking 4/5 • Vector Instructions: – Add, sub, mul, mac, comparisons are all supported in vector mode – In parallel one can compute: • One word • Two halfwords, or • Four bytes – Enable the vector extensions with the compiler option: • • –mlv32 -munaligned-ls for vector generation for unaligned memory accesses – Check in the .read me if vector code is generated. Vector instructions have the format: • lv.{mul,mac,sub,add,…} • Tasks: – Run the matrixMul application with the different compiler options – Summarize your results in the first column on the next page 23.03.2015 14 Integrated Systems Laboratory Exercise 3 – Benchmarking 5/5 Execution time: (# cycles/ % improvement) Simulator: RTL-SIM: Baseline Hardware-loop Pre/post incr. imm. Pre/post incr. reg mac Vector Unaligned access • Can you explain your results? • Are the results you obtained optimal? • What can be done better? – Try to improve the matrix multiplication such that vectors are used more efficiently 23.03.2015 15 Integrated Systems Laboratory Exercise 4 – RTL Simulation 1/2 • The simulator is not 100% accurate because: • • • • It is still under development Not all components are modeled (no caches/memory contentions/ simplified DMA) But a lot faster than RTL simulation Start RTL simulations with the Makefile in your build folders: – $ make testName_vsim • Rerun the matrixMul in the RTL simulation and complete the table – If you have a build folder for each configuration, you can run the simulations in parallel (each RTL-sim will take ~5min) – Are you able to reproduce the previously obtained results? Do you observe any differences? – Do you have an explanation? 23.03.2015 16 Integrated Systems Laboratory Exercise 4 – RTL Simulation 2/2 • In matrixMul.c the function computeGold() which is actually computing the multiplication is called two times with the same inputs and outputs! Code snipped of matrixMul.c: • Why do you think this is the case? – Compare the execution time of the first iteration between the RTL-simulation and the simulator! – What do you observe? 23.03.2015 17 Integrated Systems Laboratory Exercise 5 – Unaligned Memory Access • Unaligned memory access – If data is not aligned, for example when computing the stencil over an image, a lot of memory accesses and shifting is required. – If data can be accessed in an unaligned fashion, the shifting is not required anymore. – The or10n core supports unaligned accesses in two consecutive cycles! • Pixel = 1 char Tasks: – Complete the stencil template program (sw/apps/sequential_tests/stencil/stencil.c) which computes four stencils in parallel over a 32x32 pixel image – Do you observe a speedup with unaligned memory access enabled? – Compare to a vector only implementation 23.03.2015 Image – 1 stencil 4 parallel stencils computed with vectorial unit 18 Integrated Systems Laboratory Exercise 6 – Auto Vectorization • • • In the matrixMul example, the automatic vector support did not bring the desired speedup If the matrices are accessed in a more regular fashion, this is expected to change! Run the matrixAdd{8,16,32} applications in RTL simulation and measure the speedup. – – – • matrixAdd8 is based on characters matrixAdd16 on short integers matrixAdd32 on integers Run the motion_detection application and compare first the execution time of the baseline and the optimized architecture without vector support. – And then add first unaligned memory access, and then vector instructions Execution time: (# cycles/ % improvement) matrixAdd32 matrixAdd16 matrixAdd8 motion_detection Baseline Without vector Unaligned access With vector 23.03.2015 19 Integrated Systems Laboratory Exercise 7 – Interrupts & Events 1/2 • Core 3 is used to model an external interface in the following • Receiving and processing of data is completed as follows: 1. 2. 3. 4. 5. 23.03.2015 Core 0 configures its interrupt or event mask before completing other tasks/going to sleep Core 3 generates data and sends/places it in a buffer (here in L2 memory) Core 3 sends commands to global event unit, to generate an event for core 0 Core 0 receives an interrupt/ wakes up and processes the data in L2 memory Continue from beginning 20 Integrated Systems Laboratory Exercise 7 – Interrupts & Events 2/2 • Task 1: Interrupts – – Check out the interrupt/event template: sw/apps/sequential_tests/event_interrupt.c Define an interrupt handler • – – – • Initialize interrupts Add interrupt handler on interrupt GP0 (see sw/libs/sys_lib/src/int.c) Set interrupt mask (1= interrupt is not masked) Task 2: Events – In the function wait_and_proc_data(): • • • Copy data from receive_buffer to your memory (received_data) go to sleep, wait for event When an event is received: copy data from receive_buffer to your memory (received data) Task 3: Comparison – Compare the two solutions: • • Execution time Measure the time required to process one package in modelsim $ make event_interrupt_vsim_debug • Hints to measure time: – – • 23.03.2015 Events: observe the clock of core 0 Interrupts: observe the state of exception routine (exc_running_p) What are the benefits of using events? 21 Integrated Systems Laboratory Questions & Answers • You have successfully completed the exercise • If you are interested in a mini-project we can offer you: – Implementing and optimizing of a benchmark, using the multicore pulp environment • See last exercise about the pulp architecture – Compare simulator and RTL simulations and come up with examples how to improve it – Add performance counters on RTL level – Compare OR10N core to a state-of-the art micro-controller with DSP functionalities like the ARM Cortex M4 – Add a small unit which observes the stack pointer, and issues an exception if a stack overflow has been detected – We are open to your own ideas! 23.03.2015 22
© Copyright 2024