Document 176827

On-chip system level visibility
How to optimise ARM platforms &
shorten time to market ?
Serge Poublan
ARM CoreSight product manager
Javier Orensanz
ARM tools product manager
1
DRIVERS & USAGE CASES
2
Higher level of integration
Optimized ARM Smartphone Block Diagram
3
Software cost explodes
4
Many users need on-chip visibility
5
Visibility for all with CoreSight™
6
Wide industry support
7
CoreSight™: an overview
AMBA AXI
Cortex
A9
Cortex
R4
Example
ARM SoC
APB bridge
DSP
Shared
peripherals
8
Port
CoreSight & real-time trace
System Trace
AMBA AXI
Example
ARM SoC
APB bridge
Cortex
A9
Cortex
R4
DSP
CPU Trace
CPU Trace
DSP ETM
Bus trace
System
trace
Shared
peripherals
Port
Processor Trace
Source
Code
associated
coverage
instructions
Cycles per instruction
9
Interlock information
CoreSight & multi-core debug
Cost effective debug
AMBA AXI
Example
ARM SoC
Cross trigger matrix
APB bridge
Cross Trigger
DSP
Interface
Cross Trigger
Cortex
R4
Interface
Cross Trigger
DAP
Interface
SWD
Cortex
A9
Shared
peripherals
Debug bus (APB)
Debug
control bus
RealView ICE
RealView Trace
10
Port
CoreSight & trace management
System Trace
Cost effective debug
AMBA AXI
Example
ARM SoC
Cross trigger matrix
DSP ETM
Cross Trigger
DSP
Interface
CPU Trace
APB bridge
Cross Trigger
Cortex
R4
Interface
CPU Trace
Cross Trigger
DAP
Interface
SWD
Cortex
A9
Bus trace
System
trace
Shared
peripherals
Port
Debug bus (APB)
Trace bus (ATB)
Funnel
RealView ICE
RealView Trace
11
Debug
control bus
Trace bus for
system trace
Trace Collection
strategies
DEBUG INTERFACE
12
Debug …. It is all about cost!
 A mature, simple and low cost 2 pin debug interface


2 pins (clock & data), a simple protocol
Optimised to access memory mapped debug devices

Fully synchronous for high performance (100Mhz) & synthesis
 Over 20 tools vendors support SWD
 Multi-drop support with SWD v2
 Inter-operable with IEEE1149.7
13
SYSTEM TRACE
14
System level visibility up to the final product
 CoreSight System Trace Macrocell
 For system & application level debug & optimisation
 Deployable up to end product at very low cost
Funnel
15
System level visibility up to the final product
 CoreSight System Trace Macrocell
 For system & application level debug & optimisation
 Deployable up to end product at very low cost
 A single resource for
 High level application software
view (Apps, kernel, firmware)
Funnel
16
System level visibility up to the final product
 CoreSight System Trace Macrocell
 For system & application level debug & optimisation
 Deployable up to end product at very low cost
 A single resource for
 High level application software

Funnel
17
view (Apps, kernel, firmware)
Tuning of system performance
System level visibility up to the final product
 CoreSight System Trace Macrocell
 For system & application level debug & optimisation
 Deployable up to end product at very low cost
 A single resource for
 High level application software


Funnel
18
view (Apps, kernel, firmware)
Tuning of system performance
Tracing of internal SoC
signals (e.g IRQ, DMA, …)
High level view for software developers

High performance “hardware printf” with minimum intrusion
 Focus on “key part” of your s/w code
 Monitor the whole system not only CPU



Comply with MIPI® STPv2
Linux drivers & libraries
Decrease tooling cost
19
TRACE MANAGEMENT
20
Cost effective trace collection
 CoreSight merges trace sources to
System
Trace
Bus
Trace
Trace bus (ATB)
Funnel
21
reduce packaging cost
CPU
Trace
CPU
Trace
Cost effective trace collection
 CoreSight merge trace sources to
reduce packaging cost
CPU
Trace
CPU
Trace
System
Trace
Bus
Trace
 Export trace to
 Debug pins (2 pins)
 Dedicated Trace port (parallel or
Trace bus (ATB)
Funnel
22

serial high speed trace)
Existing functional links
Cost effective trace collection
 CoreSight merge trace sources to
reduce packaging cost
CPU
Trace
CPU
Trace
System
Trace
Bus
Trace
 Export trace to
 Debug pins (2 pins)
 Dedicated Trace port (parallel or
Trace bus (ATB)

serial high speed trace)
Existing functional links
Funnel
 Capture trace in
 System memory with OS

23
management
Dedicated trace buffer (SRAM)
Trace export through 2-pin SWD
PTM
STM
Trace bus (ATB)
Funnel
Export trace with 2 pins
Serial Wire Debug
Debug bus (APB)
TMC
SWD
24
DAP
Buffer
Reduce trace port size with FIFO mode
PTM
STM
Trace bus (ATB)
Funnel
Re-use SRAM for FIFO
mode
Export trace with 2 pins
Serial Wire Debug
Debug bus (APB)
SWD
DAP
Average bandwidth out to
allow fitting of a narrower
Trace Port
25
Buffer
FIFO
TPIU
Trace Port
Bits / cycle
TMC
Route trace to existing SoC resources
PTM
STM
Trace bus (ATB)
Funnel
Re-use SRAM for FIFO
mode
Export trace with 2 pins
Serial Wire Debug
Debug bus (APB)
IO
Controller
TMC
SWD
DAP
Buffer
FIFO
System
Memory
Router
AMBA AXI
Average bandwidth out to
allow fitting of a narrower
Trace Port
26
TPIU
Trace Port
Route trace to IO controller or
system memory
SUMMARY
27
CoreSight™ visibility affordable for ALL
 Silicon vendors



Reduce cost of implementing trace
Visibility of signals and software with STM
Output performance and power profiling data
 OEMs



High level application software view
(Apps, OS, firmware) with STM
Equip more s/w developers with on-chip
visibility (STM) at lower cost (TMC)
Optimize software stack on real product
 Tool vendors


28
Deliver on-chip visibility to more users
Complement processor trace
Use of CoreSight™ enabled
hardware for software optimisation
Case Study with the ARM Profiler
Optimizing Android Media Framework
29
Optimization Iterative Process
Development Tools
Profile Analysis
Automatic or
manual optimization
Analyse
Source
code
Compiler
 Compiler
 Linker
 Assembler
 Libraries
01010101
machine code
30
to target
Trace Capture and Analysis
Processor
ETM
Trace port
SoC



31
Compressed
trace
USB
Trace port
analyzer
Further
compression
Host PC
Trace port analyzer compresses and streams trace info
Host PC decompresses and analyzes the trace stream
Profiler displays results of analysis
 Profiling and memory usage information
 Code coverage
 At thread, function and instruction level
 Non-intrusive and long-running
Case Study – GStreamer

FFmpeg execution under Android
 Google Android OS: www.android.com
 GStreamer multimedia framework: gstreamer.freedesktop.org
 FFmpeg audio and video library plug-in: www.ffmpeg.org

Target: Mistral EVM OMAP35xx
 Based on Cortex-A8 processor
32
Hardware Set Up
33
ARM Profiler Set Up
34
ARM Profiler Set Up
35
Live Update – Running Android
Progress of trace
collection and analysis
Current processes
Processor load
over time
Current threads
Processor exceptions
over time
36
ARM Profiler: Top Level Report
Top 5 threads
Top 5 functions
37
ARM Profiler: Top Level Report
Instructions
Exceptions
Time line
38
Mem access
ARM Profiler: Top Level Report
Details for
selected time
Time selector
Top 5 processes
Top 5 threads
39
Detailed Views
 Look at top functions by time, memory access and delay
 yuv420p_to_rgb565 is the function to optimize
40
Detailed Views
 Analyze in code view, change and profile again!
41
Example: optimise yuv420_to_rgb565
yuv420_to_rgb565 (image *dst, image *src, int width, int height)
for (; height >=2; height -=2)
for (; width >=2; width -=2)
Process 4 pixels (2x2 square)
if odd width  handle last column
if odd height  handle last row
42
Example: optimise yuv420_to_rgb565
yuv420_to_rgb565 (image *dst, image *src, int width, int height)
for (; height >=2; height -=2)
for (; width >=2; width -=2)
Process 4 pixels (2x2 square)
if odd width  handle last column
if odd height  handle last row
yuv420_to_rgb565 (image *dst, image *src, int width, int height)
if odd height odd width
if odd height even width
 call yuv420_to_rgb565_1
 call yuv420_to_rgb565_2
 call yuv420_to_rgb565_3
if even height even width  call yuv420_to_rgb565_4
if even height odd width
yuv420_to_rgb565_4 (image __restrict *dst, image __restrict *src,
int width, int height)
//independent pointers
for (; height > 0; height -=2)
//comparison with 0
for (; width > 0; width -=2)
//comparison with 0
Process 4 pixels (2x2 square)
//no checking if odd width
43
Results
 Hot function: 5% higher performance
 Whole application: 3% higher performance
44
Conclusion
45
Conclusion
 CoreSight accelerates SoC
developments & reduce time
to market
 CoreSight is available now
on all major open platforms
 e.g TI OMAP3, Freescale iMX51, STE Nomadik
 And on many ASSP & ASIC
 New CoreSight IP makes on-chip visibility
affordable for more developers
46