How to port embedded Windows CE 6.0 R2

How to port embedded Windows CE 6.0 R2
to OMAP-L138 (Part 2)
Know the pros and cons of OMAP’s programmable real-time unit.
By Artsiom Staliarou and Denis Mihaevich
Founers
Axonim Devices Company
In addition to the capabilities discussed in part 1, an important feature of the OMAP-L138 SoC family that is of
enormous benefit to a developer is the availability of a separate sub-system, the Programmable Real-Time Unit
(PRU). The PRU is based on two 32bit cores, each with its own memory for storage of commands and data.
Applications of this sub-system (PRUSS) are diverse, such as implementation of additional interfaces or
maintenance of interfaces in order to arrange specific protocols such as in an auxiliary DSP or ARM core.
Figure 1 shows the general structure of PRU sub-system (PRUSS). This sub-system contains two independent 32bit
cores, each with their own instruction sets, independent of either the DSP or ARM cores.
Figure 1: PRU sub-system structure.
These cores have a simplified RISC architecture that supports 40 commands with determined time of execution (1
time unit), making possible enhanced opportunity for handling bits in registers. The cores do not have a 'commands
convey ' instruction or interrupt vector commands for hardware multiplication and division - all interrupts are
processed in the mode of scanning via the indicator in one of the registers.
The PRUSS also has a general interrupt controller that allows unification of events from the peripheral, ARM, DSP
and PRU cores. This controller can handle 32 events in two directions, both from the PRUSS to ARM and DSP cores,
and from the ARM and DSP cores to PRUSS. Thus the interrupt controller can send any event to the similar interrupt
controller on an associated ARM or DSP core, which in its turn leads to the call of the interrupt processors of these
cores if they are enabled in the respective registers. Using the interrupt controller on a PRU module, it is possible to
implement a simple interaction not only between PRU cores, but also between the ARM and DSP cores.
EE Times-India | eetindia.com
Copyright © 2012 eMedia Asia Ltd.
Page 1 of 6
Figure 2 shows the structure of the PRU sub-systems. Within it, each PRU core contains 32 registers, a process
execution module, a table with 29 constants, and 4-KB RAM commands. Independent fast input/output ports
(GPIOs) associated with each core are connected directly to two registers, allowing the developer to make use of
either the core's own communications interfaces or the GPIOs to interface to standard interfaces such as UART,
CAN, or ProfiBus.
Figure 2: PRU core structure.
Command RAM is available in the core itself and provides for the execution of instructions for any single time unit.
All four SoC cores (ARM, DSP, PRU0, and PRU1) have access to RAM data, but each PRU core can execute a code only
from its own command RAM, even though both PRU cores have access to all peripherals via the central bus.
Availability of such an embedded command set and data RAM allows the developer to unload the SoC central bus
and implement interaction with the peripheral, mDDR/DDR 2 memory, and ARM/DSP cores with minimal load.
Optionally, the system can manage power and timing of the PRU sub-system. For sub-system timing, half of the ARM
core frequency is used. This means that when the core is operating at a frequency of 450MHz, it is possible to start
the PRU cores at 225MHz (4.4 ns per instruction). The power manager allows the PRU sub-system to be stopped or
disabled when it is not needed, thus reducing the SoC's general power consumption.
There is no official compiler in the C language for the PRUSS, nor any official support in TI‘s Code Composer Studio
that we were able to determine. Despite that it is possible to set the Code Composer Studio environment for
automated compilation of the PRU module code for the convenience of program development and to bring all data
into one project.
To implement the system execution code, a specialised version of the open source PASM compiler is applied that
uses an assembler as a basic language. An example of the code for the PRU0 node is shown below:
.setcallreg r28.w2
.origin 0
.entrypoint PRU0_AUDIO_PROCESS_CODE
#include «PRU0.hp»
PRU0_AUDIO_PROCESS_CODE:
MOV r0, 0x00000000
MOV r1, CTPPR_1
ST32 r0, r1
MOV r0, 0x00000000
MOV r1, CTPPR_0
ST32 r0, r1
MOV32 regEDMA_2_ICR, 0x01C02470
MOV32 regEDMA_3_ICR, 0x01C02670
// Initialise pointer to INTC registers
MOV32 regOffset, 0x00000000
// Clear SYS_EVT
MOV32 r31, 0x00000000
// Global enable of all host interrupts
LDI regVal.w0, 0x0001
EE Times-India | eetindia.com
Copyright © 2012 eMedia Asia Ltd.
Page 2 of 6
SBCO regVal, CONST_PRUSSINTC, GER_OFFSET, 2
The PASM compiler supports several types of output files: binary, С-array, HEX-file, and other (including annotated
listing). An example of an output file in the form of a C-array is shown below:
const unsigned int PRU0_Code[] =
{
0x240000e0,
0x24702ce1,
0xe1002180,
0x240000e0,
0x247028e1,
0xe1002180,
0x24247095,
0x2401c0d5,
0x24267096,
...
};
The compiler locates the code directly from the zero address of the command RAM. This allows a C-file to be
attached to the basic program, such as one that might be associated with an ARM core processor, and to copy data
from the file directly to the command RAM of the appropriate core.
For environments other than Code Composer Studio, TI provides for the use of Notepad++ or TextPad for
convenient code development with syntax highlighting. Setup files are provided with support of code syntax for the
PRU module that has already been developed.
In BSP for Windows CE 6.0 for OMAP-L138 there is no support for the PRU sub-system. Officially, the code loader
driver exists only in Linux and only for cases using a specialised patch. That is why during implementation of our
projects a monolithic driver version of the PRU module was developed with support added for hardware interrupts
from the PRU sub-system. This driver is configured to deliver a continuous stream of data during a specific interval
of time between interrupts.
Figure 3 shows the driver subroutines needed for interaction with the OS and user applications. The PRU_Init
software subroutine performs primary initialisation of the driver and translates physical addresses of the memory
allocated for the PRU sub-system into virtual ones for further use.
Figure 3: PRU Cores.dll driver functions.
The PRU_Deinit subroutine implements release of resources during the code loader driver uploading. The
PRU_PreDeinit and PRU_PreClose subroutines are used as stubs. The rest of the subroutines are used for serving the
software/hardware interface operations. Thus, the PRU_Open subroutine returns the device descriptor to the
DeviceIOControl software subroutine. In its turn, PRU_Close performs context cleaning and is executed when calling
the CloseHandle subroutine as the device descriptor is executed.
EE Times-India | eetindia.com
Copyright © 2012 eMedia Asia Ltd.
Page 3 of 6
The PRU_PowerUp and PRU_PowerDown subroutines are used for notification of the PRU sub-system on transition
to Suspend state and on cancellation of this state. In addition, the PRU_IOControl subroutine contains the whole
functional implementation of the driver. When PRU_IOControl is called, the following operations are performed:
IOCTL_PRU_REQ_INT returns the system interrupt number that belongs to a specific event number (3…10) of the
ARM-core interrupt controller;
IOCTL_PRU_RELESE_INT releases the system interrupts allocated using IOCTL_PRU_REQ_INT;
IOCTL_PRU_INT_INIT links a system event to a specific descriptor obtained from the API function of CreateEvent for
further application of the WaitForSingleObject command with the help of an API software driver routine in the user
application consisting of the following subroutines:
IOCTL_PRU_INT_DONE signals the core that the user application has processed the interrupt from PRU-core
(InterruptDone analogue);
IOCTL_PRU_LOAD_CODE loads code into the command RAM of the PRU core (with a mandatory halt of the core).
This sub routine also includes control of such operations as power starting of PRU sub-system in PSC controller
(Power and Sleep Controller);
IOCTL_PRU_MAKE_SINGLESTEP starts program stepping (for debugging);
IOCTL_PRU_RUN starts PRU core for free program execution in the command RAM;
IOCTL_PRU_STOP stops PRU core;
IOCTL_PRU_WAIT_FOR_HALT waits for HALT command execution by PRU core;
IOCTL_PRU_SET_PC_STARTUP_POINT sets the program start-up point;
IOCTL_PRU_SLEEP switches PRU core into the sleeping mode with the option for it to return to the normal mode on
various events;
IOCTL_PRU_ENABLE_COUNTER switches the PRU core cycle counter;
IOCTL_PRU_GET_PC_COUNTER returns the current address of the command under execution;
IOCTL_PRU_GET_CYCLE_COUNT returns the cycle counter value;
IOCTL_PRU_SET_CYCLE_COUNT registers a new value of the cycle counter;
IOCTL_PRU_GET_STALL_COUNT returns the quantity of time units missed due to the code absence;
IOCTL_PRU_WRITE_GP logs in general-purpose registers (for debugging);
IOCTL_PRU_READ_GP reads from general-purpose registers (for debugging);
IOCTL_PRU_GET_DR AM_PTR returns the indicator to the data RAM area of PRU core translated to the user
application memory area.
TheAPI software driver routine detailed above is a link between the device and the user application and is used to
simplify the process of development and accelerate the final product manufacturing. For debugging of applications
developed for the PRU sub-system, several methods are used. The one we prefer is the display of control points via
data RAM and general-purpose registers with the help of the following hardware operations:
•
•
•
Interrupts of ARM/DSP- cores;
Use of the fast input/output port (R30 register); and
Use of infinite cycles and storage in a register to indicate the current address of the executed command.
However, in general the choice of the debugging method depends upon the application type and convenience of the
method (several methods may be combined).
EE Times-India | eetindia.com
Copyright © 2012 eMedia Asia Ltd.
Page 4 of 6
DSPLink communications/loading options
When developing electronic devices, there are a number of ways to lower the cost and simplify the whole system.
But in the majority of cases, the use of multi-cores offers many advantages, especially as it relates to mediaintensive embedded designs.
In the design we were considering, using multi-cores allowed us to create two and more independent cores that
shared a common multi-layer bus, common multiple DMA channels, a common set of peripherals, and to use a
communications channel that could be shared between the cores.
The DSP sub-system in SoC OMAP-L138 includes a TMS320C674x processor core with the option of operation at the
frequency of 450MHz, cache memory (instructions and data), L2 RAM, and integrated debugging tools (Advanced
Event Triggering - AET). Such calculation options allow implementation of algorithms for processing of images.
From the point of view of Windows CE 6.0 OS, the DSP core itself is of no interest since ОС Windows CE 6.0 code
cannot be started by the DSP core. Instead it is more beneficial to use the DSP core for unrelated operations, such as
resolution of complex tasks involving video data decoding and encoding.
The DSP core can perform floating point mathematical operations, which allows the developer to eliminate porting
the necessary DSP algorithms generated by a Matlab program. Otherwise, it would be necessary to have them
executed in software by the main ARM processor.
For example, in the case of signal processing operations such as fixed point mathematical calculations, it would be
necessary translate the algorithm first into the floating point code, debug it, and then, observing numerous
restrictions, port it to the DSP core.
Subsequent to floating point operations in applications such as video decoding and encoding, two different tasks
arise. The first is to encode and decode video and audio data flows. The second is to implement an independent
algorithm at the DSP core. The latter task can also be divided into two branches. The first is to use DSP/BIOS OS by
the SoC manufacturer or some other OS, the second is to develop a program that does not depend upon an OS (bare
metal code).
Windows CE 6.0 OS allows implementation of either or both variants, especially with the provision on the TI OMAP
of a ready communications channel between the cores using DSPLink, a library for the arrangement of interaction
between processors (ARM<->DSP) using an already existing API. However, DSPLink presupposes using TI’s
DSP/BIOS OS on the DSP side.
Figure 4 shows the structure of the interaction between ARM and DSP sub-systems with the use of the above
DSPLink library. The OMAP-L138 SoC has no Inter-Processor Communication (IPC) module. Instead, the device‘s L2
DSP RAM, shared RAM (Shared RAM), or mDDR/DDR2 RAM are used for information exchange between the cores.
Figure 4: Interaction of ARM and DSP sub-systems with the use of
DSPLink library.
EE Times-India | eetindia.com
Copyright © 2012 eMedia Asia Ltd.
Page 5 of 6
With the DSPLink library it is possible to load a code into the DSP core and execute it (and, of course, arranging for a
channel for data exchange with the ARM core using ready API.) This API not only allows execution of the current
state of the DSP core, but is also used to arrange message exchanges between the cores (MSGQ), exchange flow data
(CHNL), and create circular buffers (RingIO).
The basic purpose of these mechanisms is to create an ecosystem for working with encoding and decoding of audio
and video data. TI provides codecs (in the form of binary libraries) for decoding and encoding of audio data (AAC,
MP3 - decoding only, WMA), voice data (G.711, G.722, G.726), video data (H.264, MPEG2 - decoding only, MPEG4),
and images (JPEG). About the authors
Artsiom Staliarou and Denis Mihaevich are founders of the AXONIM Devices Company, a Microsoft Embedded
Partner, independent embedded electronics system design centre and system integrator with 25 engineers based in
Minsk, Belarus.
Artsiom has a degree in Radiophysics and has more than 10 years experience in embedded system design based on
ARM/Blackfin/TI DSP C2x/C5x/C6x)/x86 devices and using Embedded Linux/Windows Embedded OSes.
Denis also has a degree in Radiophysics and more than 12 years experience in embedded system design and video
analysis algorithm development, and has a certificate in optoelectronics.
EE Times-India | eetindia.com
Copyright © 2012 eMedia Asia Ltd.
Page 6 of 6