Armstrong Usecases (.ppt)

Measuring the Whole System
Holistic Profiling of CPU and GPU for Optimal Vision
Applications on ARM Platforms
Tim Hartley
1
COPYRIGHT © 2015 ARM
The Evolution of Mobile GPU Compute
OpenGL ES 3.1 Compute Shaders
GPU Compute within OpenGL ES API
OpenCL™ Full Profile / RenderScript
Portable Heterogeneous Parallel Computation
Mali-T600
Series
OpenGL ES 2.0
Mali-T700 &
T800 Series
Mali-400 MP
Mali-450 MP
Programmable pipeline
Mali-200
Mali-300
OpenGL® ES 1.1
Fixed pipeline
ARM® Mali™-55 GPU
2007
2
COPYRIGHT © 2015 ARM
2009
2010
2012
2013,2014,2015
Measuring the Whole System
 Computer Vision will, for some time, succeed in using every drop of
processing power we give it
 And techniques in computer vision still evolving rapidly
 New, complex, sustained low power use cases
 Building computer vision applications an ever more complex process
 The availability of more processors and processor types makes this even more so
Capturing and analyzing accurate and effective measurements from platforms plays
a vital role in achieving optimal performance
3
COPYRIGHT © 2015 ARM
Modern Computer Vision Applications
CPU Core
NEON
CPU Core
NEON
CPU Core
NEON
CPU Core
GPU
NEON
Vision Application
DSP
4
COPYRIGHT © 2015 ARM
Inside an ARM Mali Midgard Core
 SIMD: Several components
per operation
 128-bit registers
 VLIW: Several operations per
instruction word
 Some operations are “free”
 Built in function library
 Accelerated in hardware
T  max( A0 , A1 , LS , Tex)
5
COPYRIGHT © 2015 ARM
Hardware Counters
 Counters per core
 Active cycles
 Pipe activity
 L1 cache
 Counters for the GPU
 Active cycles
 L2 caches
 MMU
 Accessed through DS-5 Streamline
 Timeline of all hardware counters, and more
 Explore the execution of the full application
 Zoom in on details
6
COPYRIGHT © 2015 ARM
DS-5 Streamline
Identify hotspots and system bottlenecks at a glance
Select from CPU/GPU counters
OS level and custom data sources
Select one or more tasks to
isolate their contribution
Accumulate counters, measure time
and find instant hotspots
Combined task switching trace and
sample-based profile
7
COPYRIGHT © 2015 ARM
Example: Complex Computer Vision Application
8
COPYRIGHT © 2015 ARM
Lane and Car Detection
9
COPYRIGHT © 2015 ARM
Streamline
10
COPYRIGHT © 2015 ARM
Streamline: OpenCL Timeline
11
COPYRIGHT © 2015 ARM
Streamline: OpenCL Timeline
12
COPYRIGHT © 2015 ARM
Optimisation
Overview
kernel
Limited by kernel execution
time or mem management?
Mem ops
Limited by Arith ops or
Mem ops?
High number of instruction
re-issues?
No
Arithmetic
Yes
Yes
Limited to 64 threads?
Large no. of instruction cache misses?
Limited to 64 threads?
Large no. of register bank conflicts?
Large no. of instruction cache misses?
No
memory
Ensure you are not copying
memory unnecessarily
Yes
No
Reduce register pressure.
Simplify or shorten kernels
Reduce register pressure.
Simplify or shorten kernels
Improve memory access
pattern to improve cache
efficiency
Vectorise the LS operations if possible.
Decrease mem accesses if possible.
Vectorise the kernel if possible.
Decrease the arith work if possible.
Limited by same factors?
Done optimising
13
COPYRIGHT © 2015 ARM
Yes
No
Reiterate
Deriving Meaning from Hardware Counters
 Counters on their own usually don’t mean a huge amount
 Combining counters is more useful
 Comparing values to determine limiting pipes
 Calculating more meaningful values from multiple values
 New graph traces can be added from these counters
 …and become an integral part of the timeline
14
COPYRIGHT © 2015 ARM
Custom Charts: Bringing Counters Together
100 * $MaliCoreCyclesTripipeCycles / $MaliJobManagerCyclesGPUCycles
100 * $MaliArithmeticPipeAInstructions / $MaliCoreCyclesTripipeCycles
100 * MaliLoadStorePipeLSInstructionIssues / $MaliCoreCyclesTripipeCycles
100 * $MaliLoadStorePipeLSInstructions / $MaliLoadStorePipeLSInstructionIssues
15
COPYRIGHT © 2015 ARM
ALU Bound kernel
 One load
 One store
 “n” ALU operations
16
COPYRIGHT © 2015 ARM
__kernel void kernel_alu_bound( global float* arr, uint n)
{
float value = arr[get_global_id(0)];
for(uint i = 0; i < n; i++)
{
value += sin(value);
}
arr[get_global_id(0)] = value;
}
ALU Bound kernel
 One load
 One store
 “n” ALU operations
17
COPYRIGHT © 2015 ARM
__kernel void kernel_alu_bound( global float* arr, uint n)
{
float value = arr[get_global_id(0)];
for(uint i = 0; i < n; i++)
{
value += sin(value);
}
arr[get_global_id(0)] = value;
}
L/S Bound kernel
 One load
 One store
 No ALU operation
18
COPYRIGHT © 2015 ARM
__kernel void kernel_memcpy( global float *a, global float *b)
{
float4 v = vload4(0, a);
vstore4(v, get_global_id(0), b);
}
L/S Bound kernel
 One load
 One store
 No ALU operation
19
COPYRIGHT © 2015 ARM
__kernel void kernel_memcpy( global float *a, global float *b)
{
float4 v = vload4(0, a);
vstore4(v, get_global_id(0), b);
}
Cache misses
 One byte read every 64 bytes
 One byte written every 64 bytes
 Really bad cache utilisation!
20
COPYRIGHT © 2015 ARM
__kernel void kernel_cache_misses( global uchar *a,
global uchar *b)
{
b[64 * get_global_id(0)] = a[64 * get_global_id(0)];
}
Cache misses
 One byte read every 64 bytes
 One byte written every 64 bytes
 Really bad cache utilisation!
21
COPYRIGHT © 2015 ARM
__kernel void kernel_cache_misses( global uchar *a,
global uchar *b)
{
b[64 * get_global_id(0)] = a[64 * get_global_id(0)];
}
What does good whole-system optimisation look like?
22
COPYRIGHT © 2015 ARM
Conclusions
 Computer vision applications need careful optimisation
 Understanding your system as a whole is a vital first step
 Understanding each individual processor core type is the next
 Use tools to measure hardware counters across the entire platform
 Whole-system views of the relative performance of heterogeneous architectures are invaluable
 Allows you to decide where there is capacity to move workloads
 And how to target optimisations by exposing the limiting component within individual cores
 Ideally, use these tool throughout the development process, not just at the end
 The Mali Ecosystem is making GPU Compute a reality today
 ARM enables developers with platforms, drivers, tools and support
 Industry leaders take advantage of ARM Mali GPU capabilities to innovate and deliver
 Be one of them!
Tomorrow at the EVA Summit, 4pm:
“Understanding the Role of Integrated GPUs in Vision Applications”, Roberto Mijat
23
COPYRIGHT © 2015 ARM
Ecosystem Resources
 www.malideveloper.com
 Download guides, papers, tools (including DS-5 Streamline), etc.
 http://community.arm.com/welcome
 Community forums, blogs and more
 [email protected]
 Graphics and GPU Compute developer support
 http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/
 A range of video and written tutorials for GPU Compute
 http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/
 ARM Mali-T600 series GPU user-space binary drivers available for download
 Linaro BSP now available with Mali-T600 series GPU support
24
COPYRIGHT © 2015 ARM
Measuring the Whole System
Holistic Profiling of CPU and GPU for Optimal Vision
Applications on ARM Platforms
Tim Hartley
25
COPYRIGHT © 2015 ARM