Optimizing Memory Bandwidth in OpenVX Graph

Accelerating OpenVX Applications
on
Embedded Many-Core Accelerators
Giuseppe Tagliavini, DEI (University of Bologna)
Germain Haugou, IIS-ETHZ
Andrea Marongiu, DEI (University of Bologna) & IIS-ETHZ
Luca Benini, DEI (University of Bologna) & IIS-ETHZ
Outline
 Introduction
 OpenVX acceleration
 Work in progress
OpenVX overview
 Foundational API for vision
acceleration
– Focus on mobile and
embedded systems
 Stand-alone or
complementary to other
libraries
 Enable efficient
implementations on
different devices
– CPUs, GPUs, DSPs, manycore accelerators
Accelerator template
Multi-core
processor
Cluster-based
design
Cluster memory
(optional)
Host
L3
Cluster
MPMD
Processing
Elements
PE
PE
…
Cluster
L2
Many-core accelerator
PE
PE
PE
PE
…
PE
DDR3 memory
SoC design
(optional)
CC
Cluster
controller
(optional)
L1
Low latency
shared TCDM
memory
DMA
HWS
DMA
engine
(L1 ↔L3)
HW
synchronizer
PULP
Parallel Ultra-Low-Power platform
SRAM VOLTAGE DOMAIN (0.5V – 0.8V)
SRAM
#0
SRAM
#1
SCM
#0
SCM
#1
RMU
RMU
...
...
SRAM
#M-1
SCM
#M-1
RMU
LOW LATENCY INTERCONNECT
...
to RMUs
INSTRUCTION BUS
PE
#N-1
I$
PE
#1
I$
PE
#0
I$
PERIPHERALS
BRIDGES
BRIDGE
DMA
CLUSTER BUS
PERIPHERALS
BRIDGE
L2
MEMORY
CLUSTER
VOLTAGE
DOMAIN
(0.5V-0.8V)
PERIPHERAL
INTERCONNECT
SoC
VOLTAGE
DOMAIN
(0.8V)
OpenVX programming model
 The OpenVX model is based on a directed acyclic graph of
nodes (kernels), with data (images) as linkage
vx_image imgs[] = {
vxCreateImage(ctx, width, height, VX_DF_IMAGE_RGB),
vxCreateVirtualImage(graph, 0, VX_DF_IMAGE_U8),
… images are not required to actually reside in main memory
Virtual
vxCreateImage(ctx,
width,
height,
VX_DF_IMAGE_U8),
They
define a data dependency
between
kernels,
but they cannot be read/written
};
They
are the
main target
of our optimization efforts
An OpenVX
program
must
vx_node
nodes[]
= { be verified to guarantee some mandatory properties:
Inputs
and outputs compliant to theimgs[0],
node interface
vxColorConvertNode(graph,
imgs[1]),
No
cycles in the graph
vxSobel3x3Node(graph,
imgs[1], imgs[2], imgs[3]),
Only
a single writer node to anyimgs[2],
data objectimgs[3],
is allowed imgs[4]),
vxMagnitudeNode(graph,
Writes
have higher priorities than
reads. thresh, imgs[5]),
vxThresholdNode(graph,
imgs[4],
Virtual image must be resolved into concrete types
};
vxVerifyGraph(graph);
vxProcessGraph(graph);
Outline
 Introduction
 OpenVX acceleration
 Work in progress
A first solution: using OpenCL
to accelerate OpenVX kernels
 OpenCL is a commonly used programming model
for many-core accelerators
 First solution: OpenVX kernel == OpenCL kernel
– When a node is selected for execution, the related
OpenCL kernel is enqueued on the device
 Main limiting factor: memory bandwidth
OpenCL bandwidth
 Experiments performed on the STHORM evaluation board
10000
1391
922
1000
307
MB/s
290
100
71
199
71
38
31
15
10
1
OpenCL
Available BW
779
Our solution
 We realized an OpenVX framework for many-core
accelerators coupling a tiling approach with
algorithms for graph partition and scheduling
 Main goals:
– Reducing the memory bandwidth
– Maximize the accelerator efficiency
 Several steps are required:
–
–
–
–
–
Tile size propagation
Graph partitioning
Node scheduling
Buffer allocation
Buffer sizing
Common access patterns
for image processing kernels
(D) GLOBAL OPERATORS
Compute the value of a point in the
output image using the whole input
image
Support: Host exec / Graph partitioning
(A) POINT OPERATORS
Compute the value of each output point
from the corresponding input point
(E) GEOMETRIC OPERATORS
Compute the value of a point in the
output image using a non-rectangular
input window
Support: Host exec / Graph partitioning
(B) LOCAL NEIGHBOR OPERATORS
Compute the value of a point in the
output image that corresponds to the
input window
Support: Tile overlapping
(F) STATISTICAL OPERATORS
Compute any statistical functions of the
image points
(C) RECURSIVE NEIGHBOR OPERATORS
Like the previous ones, but also
consider the previously computed
values in the output window
Support: Persistent buffer
Support: Graph partitioning
Support: Basic tiling
Example (1)
MEMORY DOMAINS
NESTED GRAPH
L3
ACCELERATOR SUB-GRAPH
L1
L3
ADAPTIVE TILING
HOST NODE
Example (2)
PEs
CC
DMAin
DMAout
B5
B4
B3
B2
B1
B0
L3 access
S
a
b
i1
N
P
c
i2
NM
d
S
N
P
NM
…
e
i3
i4
o1
o2
time
L1 memory buffers
B0 and B5 use double buffering
Bandwidth reduction
10000
1000
290
MB/s
1391
922
779
307
307
34
10
359
71
71
100
199
38
36
24
44
31
15
8
8
1
OVX
OpenCL
Available BW
15
18
22
Speed-up w.r.t. OpenCL
12,00
10,00
9,61
8,00
6,73
5,64
6,00
5,04
4,00
3,86
3,46
3,50
2,81
3,12
2,92
2,00
1,00
0,00
Random
Edge
Object
Super
FAST9
graph detector detection resolution
Disparity Pyramid
Optical
Canny
Retina Disparity
preproc.
S4
Outline
 Introduction
 OpenVX acceleration
 Work in progress
OpenVX + Virtual Platform
Application
mapping
Virtual Platform
PE0
…
TestApplications
PEn
i
G
Mem
g
x
S
y
ADRENALINE
Platform
configuration
Run-time
support
Run-time policies
http://www-micrel.deis.unibo.it/adrenaline/
THANKS!!!
Work supported by EU-funded research projects