How to Optimize Your Mobile Game with ARM Tools and Practical Examples Lorenzo Dal Col Product Manager Development Solutions Group 1 Agenda 1. Introduction to ARM tools for developers 2. DS-5 Streamline Performance Analyzer Cocos2Dx Case Study 3. Mali Graphics Debugger 4. Introduction to ARM® Mali™ GPU ‘Midgard’ Architecture 5. Analyse the performance of Epic Citadel GPU profiling with Streamline Find the bottleneck with GPU hardware counters Overdraw and frame analysis Bandwidth Texture compression 6. 2 Q &A Resources Enhancing Your Mobile Games ARM Guide to Unity malideveloper.arm.com Video Tutorials youtube.com/user/ARMflix Tools malideveloper.arm.com/tools 3 Importance of Analysis & Debug Mobile Platforms Expectation of amazing console-like graphics and playing experience Screen resolution beyond HD Limited power budget Solution ARM Cortex® CPUs and Mali GPUs are designed for low power whilst providing innovative features to keep up performance Software developers can be “smart” when developing apps Good tools can do the heavy lifting 4 Mali GPU Software Tools Performance Analysis, Debug, and Software Development Mali Graphics Debugger • API Trace & Debug • OpenGL® ES, OpenCL™ • Debug and improve performance at frame level OpenGL ES Emulator • Emulate OpenGL ES 2.0 and 3.0 • Windows and Linux • Khronos Conformant 5 ARM DS-5 Streamline Mali GPU - Timeline - HW Counters - OpenCL visualizer Texture Compression Tool • Command line and GUI • ETC, ETC2, ASTC • 3D textures Mali Offline Compiler • Analyze shader performance • Generates binary shaders • Command line tool Third Party Tools Integration with partners’ tools Performance Analysis and Debug with Mali GPUs Debug Mali Graphics Debugger 6 Analyze Optimize DS-5 Streamline Mali Offline Compiler ARM DS-5 Streamline Performance Analyzer 7 Timeline: Heat Map Identify hotspots and system bottlenecks at a glance Select from CPU/GPU counters OS level and custom data sources Select one or more tasks to isolate their contribution Accumulate counters, measure time and find instant hotspots Combined task switching trace and sample-based profile 8 Timeline: Core and Cluster Maps Optimize scheduling and parallelization on multi-core and big.LITTLE™ systems Thread-based core or cluster mapping 9 Timeline: Mali GPU Graphics Performance Analysis CPU and GPU fragment and vertex processing activity Mali GPU hardware and OpenGL driver counters, including samples screenshots Process heatmap focused on ‘GPU Fragment’ activity 10 Profiling Reports Analysis of call paths, source code and generated assembly 11 (A Very Simple) Case Study 12 The Setup Hardware Single-core Cortex-A8 CPU + Mali-400 GPU MP4 Application: Cocos2D TestCpp PerformanceTest PerformanceNodeChildrenTest Test suite by Cocos2d-x team FPS 13 Looking for the Bottleneck As expected, CPU is maxed out by the benchmark Simple profiling shows memcpy as hostspot 14 Optimizing 52% faster 15 Mali Graphics Debugger 16 Investigation with the ARM Mali Graphics Debugger Assets View Frame Statistics Frame Outline Frame Capture: Framebuffers States Uniforms Vertex Attributes Buffers API Trace Textures Shaders Dynamic Help 17 Mali Graphics Debugger Trace API calls Trace OpenGL ES API calls with arguments, time, process ID, thread ID OpenGL ES 1.1, 2.0, 3.0, 3.1 EGL OpenCL Android Extensions Full trace from the start Automatic check for GL errors Search capabilities 18 Mali Graphics Debugger Outline view Quick access to key graphics events Frames Render targets Draw calls Investigate at frame level Find what draw calls have higher geometry impact Jump between the outline view and the trace view seamlessly 19 Mali Graphics Debugger Target state Shows the current state of the API at any point of the trace Every time a new API call is selected in the trace the state is updated Useful to debug problems and understand causes for performance issues Discover when and how a certain state was changed 20 Mali Graphics Debugger Trace statistics and analysis Statistics High level statistics about the trace Per frame Per draw call Trace Analysis Analyze the trace to find common issues Performance warnings API misusage Highlight API calls causing a problem in the trace view 21 Mali Graphics Debugger Uniforms and vertex attributes for each draw call s When a draw call is selected, all the associated data is available Uniforms Show uniform values, including samplers, matrices and arrays Vertex Attributes Show all the vertex attributes, their name and their position This can be useful to debug graphics issues 22 Mali Graphics Debugger All the heavy assets are available for debugging, including data buffers and textures . Buffers Client and server side buffers are captured every time they change See how each API call affects them Textures All the textures being uploaded are captured at native resolution Check their size, format and type 23 Mali Graphics Debugger Shaders reports and statistics All the shaders being used by the application are reported Shader statistics Each shader is compiled with the Mali Offline Compiler and is statically analyzed to display: number of instructions for each GPU pipeline number of work registers and uniform registers Additionally, for each draw call MGD knows how many times that shader has been executed (i.e. the number of vertices) and overall statistics are calculated 24 Mali Graphics Debugger Frame capture and analysis Frames can be fully captured to analyze the effect of each draw call A native resolution snapshot of each framebuffer is captured after every draw call The capture happens on target, so target dependent bugs or precision issues can be investigated All the images can be exported and analyzed separately. 25 Mali Graphics Debugger Alternative drawing modes Different drawing modes can be forced and used both for live rendering and frame captures Native mode Frames are rendered with the original shaders Overdraw mode Highlights where overdraw happens (ie. objects are drawn on top of each other) Shader map mode Native shaders are replaced with different solid colors Fragment count mode All the fragments that are processed by each frame are counted 26 Mali Graphics Debugger Daemon application for Android OS Android visual application that can be easily installed and runs MGD daemon 27 Start/stop daemon List all the debuggable processes Launch application to debug Display OpenGL ES “strings” and extensions ARM Mali GPU Rendering 28 Main Bottlenecks CPU CPU Too many draw calls Complex physics Vertex processing Vertices Textures Uniforms Too many vertices Too much computation per vertex Fragment processing Too many fragments, overdraw Too much computation per fragment Bandwidth Big and uncompressed textures High resolution framebuffer Battery life Energy consumption strongly affects User Experience 29 Memory Vertices Uniforms Triangles Varyings Vertex Shader Textures Uniforms Varyings Pixels Fragment Shader ARM Mali GPU Rendering Mali is a tile-based deferred rendering architecture 30 ARM Mali GPU Rendering Mali is a tile-based deferred rendering architecture The framebuffer is divided into tiles The GPU renders the framebuffer tile by tile 31 ARM Mali GPU Rendering Mali is a tile-based deferred rendering architecture The framebuffer is divided into tiles The GPU renders the framebuffer tile by tile The tile size on Mali is 16x16 pixels, stored in fast, on-chip memory Color Depth Stencil 32 JS0 cycles JS0 jobs and tasks Mali arithmetic pipe Mali load/store pipe Mali texture pipe Tile-Based Rendering Primitives loaded Primitives dropped Mali Fragment Quads Quads rasterized Quads doing early ZS Quads killed early Z JS1 cycles JS1 jobs and tasks Arithmetic pipe ARM Mali load/store pipe Vertex Shader Vertices Polygon List Builder Rasterizer Mali Fragment Tasks Tiles rendered Tile writes killed by TE Mali Core Threads Frag threads doing late ZS … Fragment Shader Polygon Lists Textures L2 GPU Cache Mali L2 Cache External write/read beats External bus stalls Memory 33 Z-Test Blend & Resolve Framebuffer ARM Mali-T860 GPU Tripipe Tripipe Cycles Arithmetic instructions Load & Store instructions Texture instructions Instructions can run in parallel Each one can be a bottleneck There are two arithmetic pipelines so we should aim to increase the arithmetic workload 34 Epic Citadel 35 CPU Activity ➞ User 24% GPU Fragment ➞ Activity 99% GPU Vertex➞ Activity 44% 36 The Application is GPU Bound The CPU has to wait until the fragment processing has finished 12ms 28ms 12ms While rendering the most complicated scene, the application is capable of 36 fps (29ms/frame) 37 CPU Bound CPU Memory Vertex Shader 38 Fragment Shader CPU Bound Synchronous Rendering Mali GPU is a deferred architecture Do not force a pipeline flush by reading back data (glReadPixels, glFinish, etc.) Reduce the amount of draw calls Try to combine your draw calls together Offload some of the work to the GPU Move physics from CPU to GPU Avoid unnecessary OpenGL ES calls (glGetError, redundant stage changes, etc.) 39 Deferred Rendering GPU Activity Analysis 40 Inspect the Tripipe Counters Reduce the load on the L/S pipeline GPU Cycles 448m Tripipe Cycles 444m Load & Store 408m Texture 197m Arithmetic 105m 41 Tripipe Counters Cycles per instruction metrics It’s easy to calculate a couple of CPI (cycles per instruction) metrics: For the load/store pipeline we have: 408m (Mali Load/Store Pipe ➞ LS instruction issues) / 195m (Mali Load/Store Pipe ➞ LS instructions) = 2.09 cycles/instruction 100% 80% 60% 40% For the texture pipeline we have: 197m (Mali Texture Pipe ➞ T instruction issues) / 169m (Mali Texture Pipe ➞ T instructions) = 1.16 cycles/instruction 42 Stalls Instructions 20% 0% Load & Store Pipeline Texture Pipeline Vertex Bound CPU Memory Vertex Shader 43 Fragment Shader Vertex Bound Get your artist to remove unnecessary vertices LOD switching Only objects near the camera need to be in high detail Use culling Too many cycles in the vertex shader 44 Vertex Count and Shader Optimizations Identify the top heavyweight vertex shaders Vertex Cycles Per Program 19% 35% 13% 33% 45 Program 175 Program 280 Program 187 Others Fragment Bound CPU Memory Vertex Shader 46 Fragment Shader Fragment Bound Render to a smaller framebuffer Move computation from the fragment to the vertex shader (use HW interpolation) Drawing your objects front to back instead of back to front reduces overdraw Reduce the amount of transparency in the scene 47 Overdraw This is when you draw to each pixel on the screen more than once Drawing your objects front to back instead of back to front reduces overdraw The transparent objects must be drawn back to front for a correct ordering Limiting the amount of transparency in the scene can help For OpenGL ES 3.0, avoid modifying the depth buffer from the fragment shader or you will not benefit from the early Z testing, making the front to back sorting useless 48 Overdraw Factor We divide the number of output pixels by the number of fragments, each rendered fragment corresponds to one fragment thread and each tile is 16x16 pixels, thus in our case: 49 90.7m (Mali Core Threads ➞ Fragment threads) / 143K (Mali Fragment Tasks ➞ Tiles rendered) x 256 = 2.48 threads/pixel Frame Capture 50 Frame Analysis Check the overdraw factor 1x 3-5x 8x 3-5x 5-7x 2x 51 Shader Map and Fragment Count Identify the top heavyweight fragment shaders Fragment Count Per Program ~10m instances / (2560×1600) pixel = 2.44 4%7% Program 175 14% Program 280 Program 181 75% 52 Others Shader Optimization Since the arithmetic workload is not very big, we should reduce the number of uniforms and varyings and calculate them on-the-fly Reduce their size Reduce their precision: is highp always necessary? Use the Mali Offline Shader Compiler! http://malideveloper.arm.com/develop-formali/tools/analysis-debug/mali-gpu-offline-shadercompiler/ 53 Bandwidth Bound CPU Memory Vertex Shader 54 Fragment Shader Bandwidth When creating embedded graphics The application is not bandwidth bound as applications, bandwidth is a scarce resource it performs, over a period of one second: A typical embedded device can handle 5.0 Gigabytes a second of bandwidth A typical desktop GPU can do in excess of 100 Gigabytes a second (96m (Mali L2 Cache ➞ External read beats) + 90.7m (Mali L2 Cache ➞ External write beats)) x 16 ~= 2.9 GB/s Since bandwidth usage is related to energy consumption it’s always worth optimizing it 55 Bandwidth Bound Vertices Reduce the number of vertices and varyings Interleave vertices, normals, texture coordinates Use Vertex Buffer Objects Fragments Use texture compression Enable texture mipmapping This will also cause a better cache utilization. 56 Texture Compression Formats ETC 4bpp RGB no alpha channel ETC2 4bpp Backward compatible RGB also handles alpha & punch through ASTC 0.8bpp to 8bpp Supports RGB, RGB alpha, luminance, luminance alpha, normal maps Also supports HDR Also supports 3D textures 57 Textures Analysis in Epic Citadel Mali Graphics Debugger shows size and compression format Uncompressed Texture Weight by Dimension (Uncompressed RGBA) 3% 3% ETC1 ASTC 5x5 944 MB Other 256 x 256 512 x 384 28% 512 x 512 1024 x 1024 2% 64% 2560 x 1504 2048 x 2048 236 MB 151 MB Total Texture Memory 58 Resources Enhancing Your Mobile Games ARM Guide to Unity http://malideveloper.arm.com/develop-formali/tutorials-developer-guides/developerguides/arm-guide-unity-enhancing-mobilegames/ Video Tutorials http://www.youtube.com/user/ARM flix Tools http://malideveloper.arm.com/devel op-for-mali/tools/ 59 To Find Out More…. ARM Booth #1624 on Expo Floor Live demos of tools covered in this session In-depth Q&A with ARM engineers More tech talks at the ARM Lecture Theatre http://malideveloper.arm.com/GDC2015 Revisit this talk in PDF and video format post GDC Download the tools and resources 60 Enhancing your Unity mobile game Thurs 4PM; West Hall 3014 Learn how to get the most out of Unity when developing under the unique challenges of mobile platforms. 61
© Copyright 2024