How to Optimize Your Mobile Game with ARM Tools and Practical

How to Optimize Your Mobile Game with
ARM Tools and Practical Examples
Lorenzo Dal Col
Product Manager
Development Solutions Group
1
Agenda
1.
Introduction to ARM tools for developers
2.
DS-5 Streamline Performance Analyzer
 Cocos2Dx Case Study
3.
Mali Graphics Debugger
4.
Introduction to ARM® Mali™ GPU ‘Midgard’
Architecture
5.
Analyse the performance of Epic Citadel
 GPU profiling with Streamline
 Find the bottleneck with GPU hardware counters
 Overdraw and frame analysis
 Bandwidth
 Texture compression
6.
2
Q &A
Resources
Enhancing Your Mobile Games
ARM Guide to Unity
malideveloper.arm.com
Video Tutorials
youtube.com/user/ARMflix
Tools
malideveloper.arm.com/tools
3
Importance of Analysis & Debug
Mobile Platforms
 Expectation of amazing console-like graphics and playing experience
 Screen resolution beyond HD
 Limited power budget
Solution
 ARM Cortex® CPUs and Mali GPUs are designed for low power
whilst providing innovative features to keep up performance
 Software developers can be “smart” when developing apps
 Good tools can do the heavy lifting
4
Mali GPU Software Tools
Performance Analysis, Debug, and Software Development
Mali Graphics Debugger
• API Trace & Debug
• OpenGL® ES,
OpenCL™
• Debug and improve
performance at frame
level
OpenGL ES Emulator
• Emulate OpenGL ES
2.0 and 3.0
• Windows and Linux
• Khronos Conformant
5
ARM DS-5 Streamline
Mali GPU
- Timeline
- HW Counters
- OpenCL visualizer
Texture Compression Tool
• Command line and GUI
• ETC, ETC2, ASTC
• 3D textures
Mali Offline Compiler
• Analyze shader
performance
• Generates binary
shaders
• Command line tool
Third Party Tools
Integration with
partners’ tools
Performance Analysis and Debug with Mali GPUs
Debug
Mali Graphics Debugger
6
Analyze
Optimize
DS-5 Streamline
Mali Offline Compiler
ARM DS-5
Streamline Performance Analyzer
7
Timeline: Heat Map
Identify hotspots and system bottlenecks at a glance
Select from CPU/GPU counters
OS level and custom data sources
Select one or more tasks to
isolate their contribution
Accumulate counters, measure time
and find instant hotspots
Combined task switching trace and
sample-based profile
8
Timeline: Core and Cluster Maps
Optimize scheduling and parallelization on multi-core and big.LITTLE™ systems
Thread-based core or cluster mapping
9
Timeline: Mali GPU Graphics Performance Analysis
CPU and GPU fragment and
vertex processing activity
Mali GPU hardware and OpenGL driver counters,
including samples screenshots
Process heatmap focused on ‘GPU Fragment’ activity
10
Profiling Reports
Analysis of call paths, source code and generated assembly
11
(A Very Simple) Case Study
12
The Setup
 Hardware
 Single-core Cortex-A8 CPU + Mali-400 GPU MP4
 Application: Cocos2D TestCpp
 PerformanceTest  PerformanceNodeChildrenTest
 Test suite by Cocos2d-x team
FPS
13
Looking for the Bottleneck
As expected, CPU is maxed out
by the benchmark
Simple profiling shows
memcpy as hostspot
14
Optimizing
52%
faster
15
Mali Graphics Debugger
16
Investigation with the ARM Mali Graphics Debugger
Assets View
Frame Statistics
Frame Outline
Frame Capture:
Framebuffers
States
Uniforms
Vertex Attributes
Buffers
API Trace
Textures
Shaders
Dynamic Help
17
Mali Graphics Debugger
Trace API calls
 Trace OpenGL ES API calls with
arguments, time, process ID, thread ID
OpenGL ES 1.1, 2.0, 3.0, 3.1
EGL
OpenCL
Android Extensions
 Full trace from the start
 Automatic check for GL errors
 Search capabilities
18
Mali Graphics Debugger
Outline view
 Quick access to key graphics events
 Frames
 Render targets
 Draw calls
 Investigate at frame level
 Find what draw calls have higher geometry
impact
 Jump between the outline view and the
trace view seamlessly
19
Mali Graphics Debugger
Target state
 Shows the current state of the API at any
point of the trace
 Every time a new API call is selected in the
trace the state is updated
 Useful to debug problems and understand
causes for performance issues
 Discover when and how a certain state
was changed
20
Mali Graphics Debugger
Trace statistics and analysis
Statistics
High level statistics about the trace
 Per frame
 Per draw call
Trace Analysis
Analyze the trace to find common issues
 Performance warnings
 API misusage
 Highlight API calls causing a problem in the
trace view
21
Mali Graphics Debugger
Uniforms and vertex attributes for each draw call
 s
When a draw call is selected, all the
associated data is available
Uniforms
 Show uniform values, including samplers,
matrices and arrays
Vertex Attributes
 Show all the vertex attributes, their name
and their position
This can be useful to debug graphics issues
22
Mali Graphics Debugger
All the heavy assets are available for debugging, including data buffers and textures
 .
Buffers
 Client and server side buffers are captured
every time they change
 See how each API call affects them
Textures
 All the textures being uploaded are
captured at native resolution
 Check their size, format and type
23
Mali Graphics Debugger
Shaders reports and statistics
All the shaders being used by the application are
reported
Shader statistics
Each shader is compiled with the Mali Offline
Compiler and is statically analyzed to display:
 number of instructions for each GPU pipeline
 number of work registers and uniform registers
Additionally, for each draw call MGD knows how
many times that shader has been executed (i.e.
the number of vertices) and overall statistics are
calculated
24
Mali Graphics Debugger
Frame capture and analysis
Frames can be fully captured to analyze the
effect of each draw call
 A native resolution snapshot of each
framebuffer is captured after every draw call
 The capture happens on target, so target
dependent bugs or precision issues can be
investigated
All the images can be exported and analyzed
separately.
25
Mali Graphics Debugger
Alternative drawing modes
Different drawing modes can be forced and
used both for live rendering and frame
captures
Native mode
 Frames are rendered with the original shaders
Overdraw mode
 Highlights where overdraw happens (ie.
objects are drawn on top of each other)
Shader map mode
 Native shaders are replaced with different
solid colors
Fragment count mode
 All the fragments that are processed by each
frame are counted
26
Mali Graphics Debugger
Daemon application for Android OS
Android visual application that can be
easily installed and runs MGD daemon




27
Start/stop daemon
List all the debuggable processes
Launch application to debug
Display OpenGL ES “strings” and
extensions
ARM Mali GPU Rendering
28
Main Bottlenecks
 CPU
CPU
 Too many draw calls
 Complex physics
 Vertex processing
Vertices
Textures
Uniforms
 Too many vertices
 Too much computation per vertex
 Fragment processing
 Too many fragments, overdraw
 Too much computation per fragment
 Bandwidth
 Big and uncompressed textures
 High resolution framebuffer
 Battery life
 Energy consumption strongly affects User
Experience
29
Memory
Vertices
Uniforms
Triangles
Varyings
Vertex
Shader
Textures
Uniforms
Varyings
Pixels
Fragment
Shader
ARM Mali GPU Rendering
 Mali is a tile-based deferred rendering
architecture
30
ARM Mali GPU Rendering
 Mali is a tile-based deferred rendering
architecture
 The framebuffer is divided into tiles
 The GPU renders the framebuffer tile
by tile
31
ARM Mali GPU Rendering
 Mali is a tile-based deferred rendering
architecture
 The framebuffer is divided into tiles
 The GPU renders the framebuffer tile
by tile
 The tile size on Mali is 16x16 pixels,
stored in fast, on-chip memory
 Color
 Depth
 Stencil
32
JS0 cycles
JS0 jobs and tasks
Mali arithmetic pipe
Mali load/store pipe
Mali texture pipe
Tile-Based Rendering
Primitives loaded
Primitives dropped
Mali Fragment Quads
Quads rasterized
Quads doing early ZS
Quads killed early Z
JS1 cycles
JS1 jobs and tasks
Arithmetic pipe
ARM Mali load/store pipe
Vertex
Shader
Vertices
Polygon List
Builder
Rasterizer
Mali Fragment Tasks
Tiles rendered
Tile writes killed by TE
Mali Core Threads
Frag threads doing late ZS
…
Fragment
Shader
Polygon
Lists
Textures
L2 GPU Cache
Mali L2 Cache
External write/read
beats
External bus stalls
Memory
33
Z-Test
Blend &
Resolve
Framebuffer
ARM Mali-T860 GPU Tripipe
Tripipe Cycles
 Arithmetic instructions
 Load & Store instructions
 Texture instructions
 Instructions can run in parallel
 Each one can be a bottleneck
 There are two arithmetic pipelines so we should
aim to increase the arithmetic workload
34
Epic Citadel
35
CPU Activity ➞ User 24%
GPU Fragment ➞ Activity 99%
GPU Vertex➞ Activity 44%
36
The Application is GPU Bound
The CPU has to wait until the fragment processing has finished
12ms
28ms
12ms
While rendering the most complicated scene, the application is capable of 36 fps (29ms/frame)
37
CPU Bound
CPU
Memory
Vertex
Shader
38
Fragment
Shader
CPU Bound
Synchronous Rendering
 Mali GPU is a deferred architecture
 Do not force a pipeline flush by reading
back data (glReadPixels, glFinish, etc.)
 Reduce the amount of draw calls
 Try to combine your draw calls together
 Offload some of the work to the GPU
 Move physics from CPU to GPU
 Avoid unnecessary OpenGL ES calls
(glGetError, redundant stage changes,
etc.)
39
Deferred Rendering
GPU Activity Analysis
40
Inspect the Tripipe Counters
Reduce the load on the L/S pipeline
GPU Cycles 448m
Tripipe Cycles 444m
Load & Store 408m
Texture 197m
Arithmetic 105m
41
Tripipe Counters
Cycles per instruction metrics
 It’s easy to calculate a couple of CPI (cycles per instruction) metrics:
 For the load/store pipeline we have:
408m (Mali Load/Store Pipe ➞ LS instruction issues)
/ 195m (Mali Load/Store Pipe ➞ LS instructions)
= 2.09 cycles/instruction
100%
80%
60%
40%
 For the texture pipeline we have:
197m (Mali Texture Pipe ➞ T instruction issues)
/ 169m (Mali Texture Pipe ➞ T instructions)
= 1.16 cycles/instruction
42
Stalls
Instructions
20%
0%
Load &
Store
Pipeline
Texture
Pipeline
Vertex Bound
CPU
Memory
Vertex
Shader
43
Fragment
Shader
Vertex Bound
 Get your artist to remove unnecessary
vertices
 LOD switching
 Only objects near the camera need to be
in high detail
 Use culling
 Too many cycles in the vertex shader
44
Vertex Count and Shader Optimizations
Identify the top heavyweight vertex shaders
Vertex Cycles Per Program
19%
35%
13%
33%
45
Program 175
Program 280
Program 187
Others
Fragment Bound
CPU
Memory
Vertex
Shader
46
Fragment
Shader
Fragment Bound
 Render to a smaller framebuffer
 Move computation from the fragment
to the vertex shader (use HW
interpolation)
 Drawing your objects front to back
instead of back to front reduces
overdraw
 Reduce the amount of transparency in
the scene
47
Overdraw
 This is when you draw to each pixel on the screen
more than once
 Drawing your objects front to back instead of back to
front reduces overdraw
 The transparent objects must be drawn back to
front for a correct ordering
 Limiting the amount of transparency in the scene can
help
 For OpenGL ES 3.0, avoid modifying the depth
buffer from the fragment shader or you will not
benefit from the early Z testing, making the front
to back sorting useless
48
Overdraw Factor
 We divide the number of output pixels
by the number of fragments, each
rendered fragment corresponds to one
fragment thread and each tile is 16x16
pixels, thus in our case:
49
90.7m (Mali Core Threads ➞ Fragment threads)
/ 143K (Mali Fragment Tasks ➞ Tiles rendered) x 256
= 2.48 threads/pixel
Frame Capture
50
Frame Analysis
Check the overdraw factor
1x
3-5x
8x
3-5x
5-7x
2x
51
Shader Map and Fragment Count
Identify the top heavyweight fragment shaders
Fragment Count Per Program
~10m instances
/ (2560×1600) pixel
= 2.44
4%7%
Program 175
14%
Program 280
Program 181
75%
52
Others
Shader Optimization
 Since the arithmetic workload is not
very big, we should reduce the number
of uniforms and varyings and calculate
them on-the-fly
 Reduce their size
 Reduce their precision: is highp always
necessary?
 Use the Mali Offline Shader Compiler!
http://malideveloper.arm.com/develop-formali/tools/analysis-debug/mali-gpu-offline-shadercompiler/
53
Bandwidth Bound
CPU
Memory
Vertex
Shader
54
Fragment
Shader
Bandwidth
 When creating embedded graphics
 The application is not bandwidth bound as
applications, bandwidth is a scarce
resource
it performs, over a period of one second:
 A typical embedded device can handle 5.0
Gigabytes a second of bandwidth
 A typical desktop GPU can do in excess of
100 Gigabytes a second
(96m (Mali L2 Cache ➞ External read beats) +
90.7m (Mali L2 Cache ➞ External write beats)) x 16
~= 2.9 GB/s
 Since bandwidth usage is related to energy
consumption it’s always worth optimizing it
55
Bandwidth Bound
Vertices
 Reduce the number of vertices and
varyings
 Interleave vertices, normals, texture
coordinates
 Use Vertex Buffer Objects
Fragments
 Use texture compression
 Enable texture mipmapping
This will also cause a better cache
utilization.
56
Texture Compression Formats
 ETC
 4bpp
 RGB no alpha channel
 ETC2
 4bpp
 Backward compatible
 RGB also handles alpha & punch through
 ASTC
 0.8bpp to 8bpp
 Supports RGB, RGB alpha, luminance,
luminance alpha, normal maps
 Also supports HDR
 Also supports 3D textures
57
Textures Analysis in Epic Citadel
Mali Graphics Debugger shows size and compression format
Uncompressed
Texture Weight by Dimension
(Uncompressed RGBA)
3% 3%
ETC1
ASTC 5x5
944 MB
Other
256 x 256
512 x 384
28%
512 x 512
1024 x 1024
2%
64%
2560 x 1504
2048 x 2048
236 MB
151 MB
Total Texture Memory
58
Resources
Enhancing Your Mobile Games
ARM Guide to Unity
http://malideveloper.arm.com/develop-formali/tutorials-developer-guides/developerguides/arm-guide-unity-enhancing-mobilegames/
Video Tutorials
http://www.youtube.com/user/ARM
flix
Tools
http://malideveloper.arm.com/devel
op-for-mali/tools/
59
To Find Out More….
 ARM Booth #1624 on Expo Floor
 Live demos of tools covered in this session
 In-depth Q&A with ARM engineers
 More tech talks at the ARM Lecture Theatre
 http://malideveloper.arm.com/GDC2015
 Revisit this talk in PDF and video format post GDC
 Download the tools and resources
60
Enhancing your Unity mobile game
Thurs 4PM; West Hall 3014
Learn how to get the most out of Unity when developing under the unique
challenges of mobile platforms.
61