다운로드 버튼

BIG DATA STRATEGY ON GPU
[email protected]
Dec 2014
Danny Hillis, The Pattern on the Stone
“ The greatest achievement of our technology may well
be the creation of tools that allow us to go beyond
engineering – that allow us to create more than we can
understand.”
Computation: 3rd Pillar of Scientific Research
2
 . 
4G
c2
a

 2

3
a
a

 
1,000 years ago
Last 500 years
Last 50 years
Today
Experimental
Theoretical
Computational
Data
Description of natural
phenomena
Formulation
of Newton’s laws,
Maxwell’s equations …
Simulation of complex
phenomena
Distributed communities unifying
theory, experiment and
simulation with massive data
sets from multiple sources
and disciplines
Experimental methods
and quantification
Scientific simulations can
require quadrillions of parallel
computations per second.
Computer graphics require billions
to trillions of parallel computations
per second.
Big Data
Synopsis Scale :




Neural Nets : 1~10M
Google Brain : 1B – 10의 10승
Adult : 100T – 10의 12승
Infant : 1Q – 10의 15승
Iterative Algorithms :






Biology
Cellular Automata
Genetic Programming
Neural Networks
Quantum Computers
Wisdom of Crowds
What is Big Data
Big Data Market Size, 2015
$
Billion
8.0
7.0
6.7
6.0
5.0
4.0
3.0
2.4
2.0
1.0
0.0
Big Data Compute
Enterprise Search
Source: Wikibon and Frost & Sullivan
Big Data Market Size, segment (‘13-’17)
$Billion
50
Compute
45
Application
Everything Else
40
35
30
30
25
26
20
15
10
5
0
33
19
13
2
4
3
5
2013
2014
Note: For data related other segments, go to Appendix for reference
Source: Wikibon, Wikibon.org
5
6
7
7
8
8
2015
2016
2017
Big Data Market 2013-2017, by segment
In billion U.S. dollars
Services
Compute
Storage
Application
Database
Xaas
Infra
Software
Networking
Total
Value
2013
6.07
3.64
2.88
1.77
1.84
1.03
0.42
0.44
18.09
2014
9.24
5.23
4.39
3.24
2.73
1.71
0.67
0.67
27.88
2015
12.31
6.70
5.85
4.94
3.62
2.43
0.93
0.89
37.67
2016
14.06
7.50
6.68
6.05
4.15
2.87
1.08
1.02
43.41
2017
15.30
8.06
7.27
6.89
4.53
3.19
1.19
1.11
47.54
26.00%
21.99%
26.05%
40.46%
25.26%
32.66%
29.74%
26.03%
27.32%
CAGR
Source: Wikibon, Wikibon.org
6.96
7.10
Hardware
18.49
16.44
Hardware
Software
Software
Service
Service
12.61
4.03
Segment in '13
Segment in '17
Vertical Big Data Spending (WW, 2016, $M)
1400
3960
1590
1070
440
377
363
Government (U.S)
Financial Service
Social Media & Entertainment
Healthcare(& Life Sciences)
Retail (& Ecommerce)
Energy (& Utilities )
Others
Fast Growing GTC topics
GOOGLE BRAIN
1,000 CPU Servers
2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
STANFORD AI LAB
Artificial Neural Network at a Fraction
of the Cost with GPUs
$1M Artificial Brain on the Cheap
“
“Now You Can Build Google’s
-Wired
3 GPU-Accelerated Servers
12 GPUs • 18,432 cores
4 kWatts
$33,000
Neural Networks : unsupervised learning
Machine Learning
Deep Learning
비선형 변화기법의 조합을 통해 다량의 데이타나 복잡한 자료들 속에서 핵심적인
내용 또는 기능을 요약하는 작업 또는 알고리즘의 집합으로서 사람의 사고방식을
컴퓨터에게 가르치는 기계학습
이미지의 경우 픽셀정보를 열벡터로 표현하여 컴퓨터가 알아 들을 수 있는 형태
Neural Networks, Convolution Deep Neural Networks, Deep Believe Networks등
다양한 기법들이 컴퓨터비젼, 음성인식, 자연어처리, 음성/신호처리 등 분야에 적용
Stanford 대학의 Andrew Ng과 Google이 함께한 딥러닝 프로젝트로서 Neural
networks 그리고 DNN(Deep Neural Networks)을 이용한 사례
MIT가 선정한 2013년을 빛낼 10대 혁신 기술
1989년 backpropagation algorithm에 기반하여 우편물에 손으로 쓰여진
우편번호를 인식하는 Deep Neural Network을 소개
가트너는 2017년이 되면 컴퓨터의 10%는 데이터 처리가 아닌 딥러닝 기반의
학습을 하고 있을 것이다
Machine Learning
왜 다시 딥러닝인가?
1. Algorithm : 기존 인공신경망 모델의 단점이 극복 : 지역최소값 머무르는 원인,
불연속 시뮬레이션에서 초기 상태 선택 여부에 따른 수렴 또는 발진, overfitting이
해결
2. GPU : 강력한 GPU들은 복잡한 매트릭스와 벡터 계산이 혼재해 있는 경우 몇 주
걸리던 작업을 몇 일
3. Big data : 대량으로 쏟아져 나오는 데이터들 그리고 그것들을 수집하기 위한 노력
특히 SNS 자료
GPU Momentums in Big Data
Data
Analytics
Salesforce
Shazam
Cortexica
Map D
GeoInt
Machine
Learning/
Data Mining
Nuance
Google
Baidu
Yandex
Microsoft
Search
Academic
Research
Database and
Data
Warehouse
Academic
Research
Graph
Analytics
1 Big Data Market Size and Segmentation
2 Hadoop Framework and Opportunities
3 Next Step
Hadoop Framework
NSA
Machine
Learning
Mahout
JPMC
eBay
Data Mining
Search
Hive
MapReduce
Facebook
Indexing
Storm
HDFS
MTV Network
SQL
Solr & Lucene
Graph
Analytics
Giraph
Chevron
Sample
Customers
Scientific
Computing
Customer
Applications
Hama
Tools and
Applications
Basic Platform
Key Features for Applications
Search
Solr & Lucene
Real-time Stream
Processing
Storm
Machine Learning
Graphic Analytics
Data Warehouse
Mahout
Giraph
Hive
Social Graph
Analytics
Full-text Search
Real time analytics
Recommendation
Faceted
Search/Filter
Online ML
Classification
Continuous
Computing
Fraud Detection
Large Graph
Processing
Clustering
Rankings
Near Real-time
Indexing
Distributed Search
Indexing
Metadata Storage
Operating on
compress data
SQL-like
Queries
Key Algorithms for Applications
Search
Machine Learning
Graphic Analytics
Scientific Computing
Data Warehouse
Solr & Lucene
Mahout
Giraph
Hama
Hive
String Match
Recommenders
Page Rank
Word Count
K-Means
Depth-first Search
Similarity Score
Fuzzy K-Means
Nearest neighbor
Inverted Index
Naïve Bayes
Classifier
Shared connections
Page View
Rank/Count
Apriori
Matrix Multiplication
Decision Forests
Linear Regression
Frequent Itemset
Mining
Personalizationbased Popularity
Priority-queue
based traversals
Collocations
Not Fit with GPU
Not Sure
Bzip2
Gzip
N-Body Simulation
Snappy
Bellman-Ford
Canopy
Computing Intensive
Relational Algebra
Sparse Matrix-Vector
Multiplication
Why Solr with GPU
Most popular independent search engine
Over 4K companies/sites in 2011
Top 5 Apache project
100~200K downloads/month
Solr has pain points on indexing
Natural language analysis
High throughput applications
GPU is good at indexing
compression/Tokenizing
LRC, BWT, Var Int, Inverted Index
Michael Frumkin’s paper
Nankai-Baidu Lab’s paper
$
Enterprise Search
HPC
Billion
18.0
15.7
16.0
14.6
13.8
14.0
12.8
11.9
12.0
10.0
8.0
6.0
3.3
4.0
2.8
2.4
2.0
1.7
2.0
0.0
2013 2014 2015 2016 2017
CAGR:
•
•
HPC Server: 5.70%
Enterprise Search: 14.2%
Source: Frost & Sullivan, IDC
GPU Fits on MapReduce & Applications
Map Task
Reduce Task
Apache Storm
Apache Mahout
Apache Hive
Apache Pig
MAP
SORT
Low Effort
Hard
SORT
Hard
REDUCE
Low Effort
Apache Hbase
•
•
•
MITHRA (2009): Use Hadoop as a backbone to cluster GPU nodes for MR applications to achieve 254X speed up
GPMR (2011): MR framework designed for a GPU Cluster, relying on simple clustering
PMGMR(2013): Pipelined Multi-GPU MapReduce for Big-Data Processing (on-going)
eBay Case Study
Part of eBay Classifieds Group ( #1 online classifieds company in the world)
Business Requirement:
Simple location-based search, deep faceted navigation
Site metrics: 3.2M active ads, 16M~24M PVs/day, peak hour =~500PVs/s (in 2011)
Request metrics: 60M requests/day, peak hours = ~1500 request/s (in 2011)
Solr Solution:
Geographic proximity, context-specific relevancy
Faceting filter
Great scalability
Example Solr Features:
•
•
•
•
•
•
•
Faceting
Language Specific stemming
More Like This
Auto-Suggest
Spellchecking
Synonyms
Dynamic fields
1 Big Data Market Size and Segmentation
2 Hadoop Framework and Opportunities
3 Next Step
Tesla
K40
K80
GPU
GK110B
GK210
Peak SP
4.29TFLOPS
~5.6TFLOPS (Base)
(per board)
1.43 TFLOPS
1.68 TFLOPS(Boost)
~1.87 TFLOPS (Base)
~2.7 TFLOPS (Boost)
# of GPUs
1
2
# of CUDA
Cores/board
2880
4992
PCIe Gen
Gen 3
Gen 3
GDDR5 Memory Size
(per board)
12 GB
24 GB
Memory Bandwidth
288 GB/s
~480GB/s
GPUBoost
2 Levels
>10 levels
Power
235W
300W
Form Factors
PCIe Active
PCIe Passive
PCIe Passive
Public Launch @ SC14
Systems Availability ~Nov’2014
(board @ base clock)
12GB GDDR5
12GB GDDR5
Peak DP
GK210
GK210
PCIe
Switch
PCIe Connector
OEM Availability
QS
:
6/6
Production Samples
:
7/16, for OEM Qual
Production
:
Early Sept’14
Introducing cuDNN
NVIDIA cuDNN is a GPU-accelerated library of primitives for DNNs. It provides tuned implementations of routines
that arise frequently in DNN applications, such as:
convolution
pooling
softmax
neuron activations, including:
Sigmoid
Rectified linear (ReLU)
Hyperbolic tangent (TANH)
Of course these functions all support the usual forward and backward passes. cuDNN’s convolution routines aim
for performance competitive with the fastest GEMM-based (matrix multiply) implementations of such routines
while using significantly less memory.
Ease of Use : The cuDNN library is targeted at developers of DNN frameworks (eg. CAFFE, Torch). However it is
easy to use directly and you do not need to know CUDA in order to use it. The example code below shows how to
allocate storage for an input batch of images and a convolutional filter in cuDNN, and how to run the batch in the
forward direction through a convolutional layer.
cuDNN : CUDA Deep Neural Network
GPU Cloud Platform
Mac and PC Tablet & Smart Phone
가상머신을 이용한
전문가 작업
렌더팜 시스템을
이용한 실시간 작업
실무 교육 환경
iray
CAD
다양한 Display
CUDA교육
Vray
고해상도 4K
Display
3D projector
3D Printer
CAE
Octane,…
개인용
워크스테이션 및
슈퍼컴퓨터와
유사한 성능을 지원
프로젝트 결과물을
다양한 시스템을
이용하여 확인 및
품평
CAD프로그램에서
사용되는 렌더러가
구축이 된
렌더팜으로 실시간
작업
GPU Cloud Platform
GRID Software
확
장
성
확
장
성
Tesla
Quadro
GRID
VCA
시스템 및
Cloud
소프트웨어
참여 업체들의
체계적인 교육운영
NVIDIA GPU Cloud Platform을 통한
실무 교육 프로그램
GPU를 이용한 병렬화 프로그램 교육을 통해
슈퍼컴퓨터 사용자 저변 확대
프로그래밍
시스템 구축 및 운영 노우하우를 현장에서 습득
시스템 구축 및
유지보수
openACC
및
CUDA
체계적인
컨설팅
지식
슈퍼컴 용 상용화
소프트웨어 운영 및
서비스
다양항 프로그램 사용법 습득 및 실무교육
학제간 프로젝트 활성화 및 융합 사고 발전
산학 프로젝트 활성화 및 실무 역량 강화
http://nvidiakoreapsc.com
CUDA everywhere
2007
연세대
KAIST
CUDA
tour
2008
2009
2010
2011
2012
서울대
연세대
고려대
경북대
KAIST
GIST
포항공대
Round
Table
meeting
@Yangjae
고려대
동의대
KAIST
GIST
포항공대
경북대
인제대
UNIST
GIST
서울대
고려대
경북대
KAIST
GIST
부경대
포항공대
2013
2014
한양대
시립대
충남대
GIST
경북대
동명대
CUDA
workshop
강촌
안면도
덕산
서울
CUDA
contest
CUDA
trainings
곤지암
KISTI
http://nvidiakoreapsc.com
감사합니다
HCI 2015
CUDA Programming
Hyungon Ryu
Introduction to OpenACC
Hyungon Ryu
Parallel Model
CPU
C0
master
C0
C1
GPU
C2
CPU Thread Fork
CPU Thread Join
C0
master
G
0
G
1
G
2
GPU Thread Fork
GPU Thread Join
OpenACC
http://www.openacc.org
PGI compiler, Cray compiler and HMPP compiler support OpenACC
Current GCC and ICC compiler ignore OpenACC Directives
Without OpenACC option, PGCC also ignore OpenACC Directives
OpenACC Directives
Fortran
!$acc kernels
!$acc kernels loop
!$acc data
!$acc end data
C/C++
#pragma acc kernels
#pragma acc kernels for
#pragma acc data
Directives
OpenMP(CPU)
OpenACC(GPU)
Loop
#pragma omp parallel
#pragma omp for
Loop
#pragma acc kernels
#pragma acc kernels for
Data
Shared/Private/Reduction
Data
#pragma acc data
Example
Code :
serial Code(CPU)
openACC(GPU)
void saxpy_serial( int n, float a,
float *x,
float *restrict y )
{
void saxpy_parallel( int n, float a,
float *x,
float *restrict y )
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
Compile
command
}
pgcc ./saxpy.c
pgcc -acc -ta=nvidia ./saxpy.c
Performance Test
DGEMM in CPU
0.6440 GFLOPS
A
x
B
=
C
Serial(CPU)
void simple_dgemm(int n, double alpha, const double *A,
const double *B,
double beta, double *C)
{
int i, j, k;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
double inprod = 0;
for (k = 0; k < n; ++k) {
inprod += A[k * n + i] * B[j * n + k];
}
C[j * n + i] = alpha * inprod + beta * C[j * n + i];
}
}
}
Brainstorming
A
x
B
=
C
How to Parallelize FOR Loop
How to distribute Data
How to Utilize GPU
for ( int row=0; row<n; row++ ) {
for ( int col=0; col<n; col++ ) {
double val = 0;
for ( int k=0; k<n; k++ ) {
val += a[row*n+k] * b[k*n+col];
}
c[row*n+col] = val;
}
OpenACC try1
0.6440 GFLOPS
1.1301 GFLOPS
Parallel(GPU)
parallelize
VS
0.6440 GFLOPS
Serial(CPU)
OpenACC try2
47.9231 GFLOPS
Parallel(GPU)
OpenACC
Configure
# of Block & thread
CUDA Converting
CPU
code
for ( int row=0; row<n; row++ ) {
for ( int col=0; col<n; col++ ) {
double val = 0;
for ( int k=0; k<n; k++ ) {
val += a[row*n+k] *
b[k*n+col];
}
c[row*n+col] = val;
}
Use GPU for this function
Data in GPU
__global__ void simple_dgemm_GPU(int n, double alpha, const double *A, const double
*B,
double beta, double *C)
{
int i = blockIdx.x * (gridDim.x) * threadIdx.x; // row
Parallelize FOR LOOP
int j = blockIdx.y * (gridDim.y) * threadIdx.y; //rol
int k;
CUDA(GPU)
code
double prod = 0;
for (k = 0; k < n; ++k) {
prod += A[k * n + i] * B[j * n + k];
}
C[j * n + i] = alpha * prod + beta * C[j * n + i];
CUDA try1
19 GFLOPS
CUDA Parallel(GPU)
Use GPU for this function
Data in GPU
Parallelize FOR LOOP
CUDA try2
91.7255 Gflops/s
CUDA Parallel(GPU)
0.6440 GFLOPS
use shared Memory
CUDA try3
140 GFLOPS
CUDA Parallel(GPU)
CUDA
0.6440 GFLOPS
submatrix split
cuBLAS Library
for ( int row=0; row<n; row++ ) {
for ( int col=0; col<n; col++ ) {
double val = 0;
for ( int k=0; k<n; k++ ) {
val += a[row*n+k] *
b[k*n+col];
}
c[row*n+col] = val;
}
1,100 GFLOPS
CUDA Parallel(GPU)
cublasAlloc (n2, sizeof(d_A[0]), (void**)&d_A );
cublasAlloc (n2, sizeof(d_B[0]), (void **)&d_B );
cublasAlloc (n2, sizeof(d_C[0]), (void **)&d_C );
cublasSetVector (n2, sizeof(h_A[0]), h_A, 1, d_A, 1 );
cublasSetVector (n2, sizeof(h_B[0]), h_B, 1, d_B, 1 );
cublasSetVector (n2, sizeof(h_C[0]), h_C, 1, d_C, 1 );
Call subroutine
cublasDgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N );
cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1 );
Future of CUDA
Performance gap continues to grow
Peak Double Precision FLOPS
GFLOPS
Peak Memory Bandwidth
GB/s
600
2500
2x GK210
2x GK210
500
2000
400
GK100
1500
GK100
300
1000
Fermi
200
Fermi
Haswell
500
GT200
GT200
100
Sandy Bridge
Nehalem
Nehalem
Sandy Bridge
Haswell
0
0
2008
2010
NVIDIA GPU
2012
x86 CPU
2014
2008
2010
NVIDIA GPU
2012
x86 CPU
2014
Remote Development with Nsight Eclipse Edition
Local IDE, remote application
Edit locally, build & run remotely
Automatic sync via ssh
Cross-compilation to ARM
Full debugging & profiling via remote
connection
Edit
sync
Build
Run
Debug
Profile
Unified Memory
Dramatically Lower Developer Effort
Developer View Today
System
Memory
GPU Memory
Developer View With
Unified Memory
Unified Memory
CUDA Roadmap
CUDA 6
CUDA 6.5
CUDA 7
April-2014
Q3 2014
Q1 2015
Unified Memory
Simpler Programming & Memory Model
CUDA FORTRAN Tools
support
C++11
Multi-GPU aware libraries
Automatic Scaling to >1 GPU per node
Operate directly on large datasets that reside in CPU
memory
Drop-in FFTW and BLAS libraries
Accelerate FFT and BLAS with no code changes
CUDA tools for Hyper-Q/MPI
GPUDirect RDMA & OpenMPI
Optimizations
Reduce inter-node latency
Improvements for MPI Application Scaling
TESLA
Power8 Support
cuFFT Callbacks
Improves performance
Better Error detection
and Reporting
CUDA C JIT
Compile CUDA Kernels at runtime
XID 13
PGI Supported as host C++
Compiler
RDMA/P2P Topology
viewer
Hyper-Q for Multi GPU
support
TESLA
TESLA
Questions?