BIG DATA STRATEGY ON GPU [email protected] Dec 2014 Danny Hillis, The Pattern on the Stone “ The greatest achievement of our technology may well be the creation of tools that allow us to go beyond engineering – that allow us to create more than we can understand.” Computation: 3rd Pillar of Scientific Research 2  .  4G c2 a   2  3 a a    1,000 years ago Last 500 years Last 50 years Today Experimental Theoretical Computational Data Description of natural phenomena Formulation of Newton’s laws, Maxwell’s equations … Simulation of complex phenomena Distributed communities unifying theory, experiment and simulation with massive data sets from multiple sources and disciplines Experimental methods and quantification Scientific simulations can require quadrillions of parallel computations per second. Computer graphics require billions to trillions of parallel computations per second. Big Data Synopsis Scale :     Neural Nets : 1~10M Google Brain : 1B – 10의 10승 Adult : 100T – 10의 12승 Infant : 1Q – 10의 15승 Iterative Algorithms :       Biology Cellular Automata Genetic Programming Neural Networks Quantum Computers Wisdom of Crowds What is Big Data Big Data Market Size, 2015 $ Billion 8.0 7.0 6.7 6.0 5.0 4.0 3.0 2.4 2.0 1.0 0.0 Big Data Compute Enterprise Search Source: Wikibon and Frost & Sullivan Big Data Market Size, segment (‘13-’17) $Billion 50 Compute 45 Application Everything Else 40 35 30 30 25 26 20 15 10 5 0 33 19 13 2 4 3 5 2013 2014 Note: For data related other segments, go to Appendix for reference Source: Wikibon, Wikibon.org 5 6 7 7 8 8 2015 2016 2017 Big Data Market 2013-2017, by segment In billion U.S. dollars Services Compute Storage Application Database Xaas Infra Software Networking Total Value 2013 6.07 3.64 2.88 1.77 1.84 1.03 0.42 0.44 18.09 2014 9.24 5.23 4.39 3.24 2.73 1.71 0.67 0.67 27.88 2015 12.31 6.70 5.85 4.94 3.62 2.43 0.93 0.89 37.67 2016 14.06 7.50 6.68 6.05 4.15 2.87 1.08 1.02 43.41 2017 15.30 8.06 7.27 6.89 4.53 3.19 1.19 1.11 47.54 26.00% 21.99% 26.05% 40.46% 25.26% 32.66% 29.74% 26.03% 27.32% CAGR Source: Wikibon, Wikibon.org 6.96 7.10 Hardware 18.49 16.44 Hardware Software Software Service Service 12.61 4.03 Segment in '13 Segment in '17 Vertical Big Data Spending (WW, 2016, $M) 1400 3960 1590 1070 440 377 363 Government (U.S) Financial Service Social Media & Entertainment Healthcare(& Life Sciences) Retail (& Ecommerce) Energy (& Utilities ) Others Fast Growing GTC topics GOOGLE BRAIN 1,000 CPU Servers 2,000 CPUs • 16,000 cores 600 kWatts $5,000,000 STANFORD AI LAB Artificial Neural Network at a Fraction of the Cost with GPUs $1M Artificial Brain on the Cheap “ “Now You Can Build Google’s -Wired 3 GPU-Accelerated Servers 12 GPUs • 18,432 cores 4 kWatts $33,000 Neural Networks : unsupervised learning Machine Learning Deep Learning 비선형 변화기법의 조합을 통해 다량의 데이타나 복잡한 자료들 속에서 핵심적인 내용 또는 기능을 요약하는 작업 또는 알고리즘의 집합으로서 사람의 사고방식을 컴퓨터에게 가르치는 기계학습 이미지의 경우 픽셀정보를 열벡터로 표현하여 컴퓨터가 알아 들을 수 있는 형태 Neural Networks, Convolution Deep Neural Networks, Deep Believe Networks등 다양한 기법들이 컴퓨터비젼, 음성인식, 자연어처리, 음성/신호처리 등 분야에 적용 Stanford 대학의 Andrew Ng과 Google이 함께한 딥러닝 프로젝트로서 Neural networks 그리고 DNN(Deep Neural Networks)을 이용한 사례 MIT가 선정한 2013년을 빛낼 10대 혁신 기술 1989년 backpropagation algorithm에 기반하여 우편물에 손으로 쓰여진 우편번호를 인식하는 Deep Neural Network을 소개 가트너는 2017년이 되면 컴퓨터의 10%는 데이터 처리가 아닌 딥러닝 기반의 학습을 하고 있을 것이다 Machine Learning 왜 다시 딥러닝인가? 1. Algorithm : 기존 인공신경망 모델의 단점이 극복 : 지역최소값 머무르는 원인, 불연속 시뮬레이션에서 초기 상태 선택 여부에 따른 수렴 또는 발진, overfitting이 해결 2. GPU : 강력한 GPU들은 복잡한 매트릭스와 벡터 계산이 혼재해 있는 경우 몇 주 걸리던 작업을 몇 일 3. Big data : 대량으로 쏟아져 나오는 데이터들 그리고 그것들을 수집하기 위한 노력 특히 SNS 자료 GPU Momentums in Big Data Data Analytics Salesforce Shazam Cortexica Map D GeoInt Machine Learning/ Data Mining Nuance Google Baidu Yandex Microsoft Search Academic Research Database and Data Warehouse Academic Research Graph Analytics 1 Big Data Market Size and Segmentation 2 Hadoop Framework and Opportunities 3 Next Step Hadoop Framework NSA Machine Learning Mahout JPMC eBay Data Mining Search Hive MapReduce Facebook Indexing Storm HDFS MTV Network SQL Solr & Lucene Graph Analytics Giraph Chevron Sample Customers Scientific Computing Customer Applications Hama Tools and Applications Basic Platform Key Features for Applications Search Solr & Lucene Real-time Stream Processing Storm Machine Learning Graphic Analytics Data Warehouse Mahout Giraph Hive Social Graph Analytics Full-text Search Real time analytics Recommendation Faceted Search/Filter Online ML Classification Continuous Computing Fraud Detection Large Graph Processing Clustering Rankings Near Real-time Indexing Distributed Search Indexing Metadata Storage Operating on compress data SQL-like Queries Key Algorithms for Applications Search Machine Learning Graphic Analytics Scientific Computing Data Warehouse Solr & Lucene Mahout Giraph Hama Hive String Match Recommenders Page Rank Word Count K-Means Depth-first Search Similarity Score Fuzzy K-Means Nearest neighbor Inverted Index Naïve Bayes Classifier Shared connections Page View Rank/Count Apriori Matrix Multiplication Decision Forests Linear Regression Frequent Itemset Mining Personalizationbased Popularity Priority-queue based traversals Collocations Not Fit with GPU Not Sure Bzip2 Gzip N-Body Simulation Snappy Bellman-Ford Canopy Computing Intensive Relational Algebra Sparse Matrix-Vector Multiplication Why Solr with GPU Most popular independent search engine Over 4K companies/sites in 2011 Top 5 Apache project 100~200K downloads/month Solr has pain points on indexing Natural language analysis High throughput applications GPU is good at indexing compression/Tokenizing LRC, BWT, Var Int, Inverted Index Michael Frumkin’s paper Nankai-Baidu Lab’s paper $ Enterprise Search HPC Billion 18.0 15.7 16.0 14.6 13.8 14.0 12.8 11.9 12.0 10.0 8.0 6.0 3.3 4.0 2.8 2.4 2.0 1.7 2.0 0.0 2013 2014 2015 2016 2017 CAGR: • • HPC Server: 5.70% Enterprise Search: 14.2% Source: Frost & Sullivan, IDC GPU Fits on MapReduce & Applications Map Task Reduce Task Apache Storm Apache Mahout Apache Hive Apache Pig MAP SORT Low Effort Hard SORT Hard REDUCE Low Effort Apache Hbase • • • MITHRA (2009): Use Hadoop as a backbone to cluster GPU nodes for MR applications to achieve 254X speed up GPMR (2011): MR framework designed for a GPU Cluster, relying on simple clustering PMGMR(2013): Pipelined Multi-GPU MapReduce for Big-Data Processing (on-going) eBay Case Study Part of eBay Classifieds Group ( #1 online classifieds company in the world) Business Requirement: Simple location-based search, deep faceted navigation Site metrics: 3.2M active ads, 16M~24M PVs/day, peak hour =~500PVs/s (in 2011) Request metrics: 60M requests/day, peak hours = ~1500 request/s (in 2011) Solr Solution: Geographic proximity, context-specific relevancy Faceting filter Great scalability Example Solr Features: • • • • • • • Faceting Language Specific stemming More Like This Auto-Suggest Spellchecking Synonyms Dynamic fields 1 Big Data Market Size and Segmentation 2 Hadoop Framework and Opportunities 3 Next Step Tesla K40 K80 GPU GK110B GK210 Peak SP 4.29TFLOPS ~5.6TFLOPS (Base) (per board) 1.43 TFLOPS 1.68 TFLOPS(Boost) ~1.87 TFLOPS (Base) ~2.7 TFLOPS (Boost) # of GPUs 1 2 # of CUDA Cores/board 2880 4992 PCIe Gen Gen 3 Gen 3 GDDR5 Memory Size (per board) 12 GB 24 GB Memory Bandwidth 288 GB/s ~480GB/s GPUBoost 2 Levels >10 levels Power 235W 300W Form Factors PCIe Active PCIe Passive PCIe Passive Public Launch @ SC14 Systems Availability ~Nov’2014 (board @ base clock) 12GB GDDR5 12GB GDDR5 Peak DP GK210 GK210 PCIe Switch PCIe Connector OEM Availability QS : 6/6 Production Samples : 7/16, for OEM Qual Production : Early Sept’14 Introducing cuDNN NVIDIA cuDNN is a GPU-accelerated library of primitives for DNNs. It provides tuned implementations of routines that arise frequently in DNN applications, such as: convolution pooling softmax neuron activations, including: Sigmoid Rectified linear (ReLU) Hyperbolic tangent (TANH) Of course these functions all support the usual forward and backward passes. cuDNN’s convolution routines aim for performance competitive with the fastest GEMM-based (matrix multiply) implementations of such routines while using significantly less memory. Ease of Use : The cuDNN library is targeted at developers of DNN frameworks (eg. CAFFE, Torch). However it is easy to use directly and you do not need to know CUDA in order to use it. The example code below shows how to allocate storage for an input batch of images and a convolutional filter in cuDNN, and how to run the batch in the forward direction through a convolutional layer. cuDNN : CUDA Deep Neural Network GPU Cloud Platform Mac and PC Tablet & Smart Phone 가상머신을 이용한 전문가 작업 렌더팜 시스템을 이용한 실시간 작업 실무 교육 환경 iray CAD 다양한 Display CUDA교육 Vray 고해상도 4K Display 3D projector 3D Printer CAE Octane,… 개인용 워크스테이션 및 슈퍼컴퓨터와 유사한 성능을 지원 프로젝트 결과물을 다양한 시스템을 이용하여 확인 및 품평 CAD프로그램에서 사용되는 렌더러가 구축이 된 렌더팜으로 실시간 작업 GPU Cloud Platform GRID Software 확 장 성 확 장 성 Tesla Quadro GRID VCA 시스템 및 Cloud 소프트웨어 참여 업체들의 체계적인 교육운영 NVIDIA GPU Cloud Platform을 통한 실무 교육 프로그램 GPU를 이용한 병렬화 프로그램 교육을 통해 슈퍼컴퓨터 사용자 저변 확대 프로그래밍 시스템 구축 및 운영 노우하우를 현장에서 습득 시스템 구축 및 유지보수 openACC 및 CUDA 체계적인 컨설팅 지식 슈퍼컴 용 상용화 소프트웨어 운영 및 서비스 다양항 프로그램 사용법 습득 및 실무교육 학제간 프로젝트 활성화 및 융합 사고 발전 산학 프로젝트 활성화 및 실무 역량 강화 http://nvidiakoreapsc.com CUDA everywhere 2007 연세대 KAIST CUDA tour 2008 2009 2010 2011 2012 서울대 연세대 고려대 경북대 KAIST GIST 포항공대 Round Table meeting @Yangjae 고려대 동의대 KAIST GIST 포항공대 경북대 인제대 UNIST GIST 서울대 고려대 경북대 KAIST GIST 부경대 포항공대 2013 2014 한양대 시립대 충남대 GIST 경북대 동명대 CUDA workshop 강촌 안면도 덕산 서울 CUDA contest CUDA trainings 곤지암 KISTI http://nvidiakoreapsc.com 감사합니다 HCI 2015 CUDA Programming Hyungon Ryu Introduction to OpenACC Hyungon Ryu Parallel Model CPU C0 master C0 C1 GPU C2 CPU Thread Fork CPU Thread Join C0 master G 0 G 1 G 2 GPU Thread Fork GPU Thread Join OpenACC http://www.openacc.org PGI compiler, Cray compiler and HMPP compiler support OpenACC Current GCC and ICC compiler ignore OpenACC Directives Without OpenACC option, PGCC also ignore OpenACC Directives OpenACC Directives Fortran !$acc kernels !$acc kernels loop !$acc data !$acc end data C/C++ #pragma acc kernels #pragma acc kernels for #pragma acc data Directives OpenMP(CPU) OpenACC(GPU) Loop #pragma omp parallel #pragma omp for Loop #pragma acc kernels #pragma acc kernels for Data Shared/Private/Reduction Data #pragma acc data Example Code : serial Code(CPU) openACC(GPU) void saxpy_serial( int n, float a, float *x, float *restrict y ) { void saxpy_parallel( int n, float a, float *x, float *restrict y ) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Compile command } pgcc ./saxpy.c pgcc -acc -ta=nvidia ./saxpy.c Performance Test DGEMM in CPU 0.6440 GFLOPS A x B = C Serial(CPU) void simple_dgemm(int n, double alpha, const double *A, const double *B, double beta, double *C) { int i, j, k; for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j) { double inprod = 0; for (k = 0; k < n; ++k) { inprod += A[k * n + i] * B[j * n + k]; } C[j * n + i] = alpha * inprod + beta * C[j * n + i]; } } } Brainstorming A x B = C How to Parallelize FOR Loop How to distribute Data How to Utilize GPU for ( int row=0; row<n; row++ ) { for ( int col=0; col<n; col++ ) { double val = 0; for ( int k=0; k<n; k++ ) { val += a[row*n+k] * b[k*n+col]; } c[row*n+col] = val; } OpenACC try1 0.6440 GFLOPS 1.1301 GFLOPS Parallel(GPU) parallelize VS 0.6440 GFLOPS Serial(CPU) OpenACC try2 47.9231 GFLOPS Parallel(GPU) OpenACC Configure # of Block & thread CUDA Converting CPU code for ( int row=0; row<n; row++ ) { for ( int col=0; col<n; col++ ) { double val = 0; for ( int k=0; k<n; k++ ) { val += a[row*n+k] * b[k*n+col]; } c[row*n+col] = val; } Use GPU for this function Data in GPU __global__ void simple_dgemm_GPU(int n, double alpha, const double *A, const double *B, double beta, double *C) { int i = blockIdx.x * (gridDim.x) * threadIdx.x; // row Parallelize FOR LOOP int j = blockIdx.y * (gridDim.y) * threadIdx.y; //rol int k; CUDA(GPU) code double prod = 0; for (k = 0; k < n; ++k) { prod += A[k * n + i] * B[j * n + k]; } C[j * n + i] = alpha * prod + beta * C[j * n + i]; CUDA try1 19 GFLOPS CUDA Parallel(GPU) Use GPU for this function Data in GPU Parallelize FOR LOOP CUDA try2 91.7255 Gflops/s CUDA Parallel(GPU) 0.6440 GFLOPS use shared Memory CUDA try3 140 GFLOPS CUDA Parallel(GPU) CUDA 0.6440 GFLOPS submatrix split cuBLAS Library for ( int row=0; row<n; row++ ) { for ( int col=0; col<n; col++ ) { double val = 0; for ( int k=0; k<n; k++ ) { val += a[row*n+k] * b[k*n+col]; } c[row*n+col] = val; } 1,100 GFLOPS CUDA Parallel(GPU) cublasAlloc (n2, sizeof(d_A[0]), (void**)&d_A ); cublasAlloc (n2, sizeof(d_B[0]), (void **)&d_B ); cublasAlloc (n2, sizeof(d_C[0]), (void **)&d_C ); cublasSetVector (n2, sizeof(h_A[0]), h_A, 1, d_A, 1 ); cublasSetVector (n2, sizeof(h_B[0]), h_B, 1, d_B, 1 ); cublasSetVector (n2, sizeof(h_C[0]), h_C, 1, d_C, 1 ); Call subroutine cublasDgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N ); cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1 ); Future of CUDA Performance gap continues to grow Peak Double Precision FLOPS GFLOPS Peak Memory Bandwidth GB/s 600 2500 2x GK210 2x GK210 500 2000 400 GK100 1500 GK100 300 1000 Fermi 200 Fermi Haswell 500 GT200 GT200 100 Sandy Bridge Nehalem Nehalem Sandy Bridge Haswell 0 0 2008 2010 NVIDIA GPU 2012 x86 CPU 2014 2008 2010 NVIDIA GPU 2012 x86 CPU 2014 Remote Development with Nsight Eclipse Edition Local IDE, remote application Edit locally, build & run remotely Automatic sync via ssh Cross-compilation to ARM Full debugging & profiling via remote connection Edit sync Build Run Debug Profile Unified Memory Dramatically Lower Developer Effort Developer View Today System Memory GPU Memory Developer View With Unified Memory Unified Memory CUDA Roadmap CUDA 6 CUDA 6.5 CUDA 7 April-2014 Q3 2014 Q1 2015 Unified Memory Simpler Programming & Memory Model CUDA FORTRAN Tools support C++11 Multi-GPU aware libraries Automatic Scaling to >1 GPU per node Operate directly on large datasets that reside in CPU memory Drop-in FFTW and BLAS libraries Accelerate FFT and BLAS with no code changes CUDA tools for Hyper-Q/MPI GPUDirect RDMA & OpenMPI Optimizations Reduce inter-node latency Improvements for MPI Application Scaling TESLA Power8 Support cuFFT Callbacks Improves performance Better Error detection and Reporting CUDA C JIT Compile CUDA Kernels at runtime XID 13 PGI Supported as host C++ Compiler RDMA/P2P Topology viewer Hyper-Q for Multi GPU support TESLA TESLA Questions?