How to Avoid Global Synchronization by Domino Scheme Lung-Sheng Chien, NVIDIA [email protected] Outline What should be done in sparse direct solver - what we have now on CPU - what is low hanging fruit on GPU Domino scheme Conclusions We have: Sparse direct solver Work done by professor Timothy A. Davis - UMFPACK: unsymmetric multifrontal sparse LU factorization package - KLU: sparse LU factorization, for circuit simulation - CHOLMOD: supernodal sparse Cholesky factorization and update/downdate. Now it supports GPU - SPQR: Sparse QR Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU Architectures, SIAM CSE 2013 S4204 - Multifrontal Sparse QR Factorization on the GPU, GTC2014 Low hanging fruit Intercept each BLAS call and re-direct it to GPU by - re-designing the code and porting cuBLAS, or - using drop-in library, nvBLAS Pros: easy, not error-prone Cons: PCI bandwidth limits the performance 12GB/s of PCI gen 3 versus 280GB/s of GTX TITAN Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU Architectures, SIAM CSE 2013 Step back and rethink Cholesky Cholesky factorization ( = ) does not need pivoting Symbolic analysis can easily find out zero fill-in and figure out sparsity pattern of Cholesky factor Reform original matrix with sparsity pattern of Cholesky factor Perform Cholesky factorization (the same as incomplete Cholesky factorization) = 0 0 = + () B = 0() ic0 = incomplete Cholesky factorization without zero fill-in Benchmark: spd matrices From Florida Matrix Collection except Laplacian operator Laplacian operator is standard Finite Difference 3-pt, 5-pt and 7pt with Dirichlet boundary condition nnzA is # of nonzeros of A nnzL is # of nonzero of L after symamd levels is # of levels in incomplete Cholesky factorization matrix name m nnzA nnzL nnzA/m nnzL/m levels 1138_bus.mtx 1138 2596 3264 2.3 2.9 40 aft01.mtx 8205 66886 304194 8.2 37.1 699 bcsstk13.mtx 2003 42943 289870 21.4 144.7 673 bcsstk34.mtx 588 11003 55120 18.7 93.7 364 bodyy4.mtx 17546 69742 584769 4.0 33.3 708 ex9.mtx 3363 51417 116942 15.3 34.8 662 LF10000.mtx 19998 59990 69988 3.0 3.5 19997 minsurfo.mtx 40806 122214 1038156 3.0 25.4 1177 nos6.mtx 675 1965 6200 2.9 9.2 101 s1rmt3m1.mtx 5489 112505 543103 20.5 98.9 1133 mhd4800b.mtx 4800 16160 16160 3.4 3.4 1198 Chem97ZtZ.mtx 2541 4951 4951 1.9 1.9 3 bcsstk26.mtx 1922 16129 42243 8.4 22.0 309 plat1919.mtx 1919 17159 66803 8.9 34.8 459 nasa1824.mtx 1824 20516 69926 11.2 38.3 299 ex33.mtx 1733 11961 33167 6.9 19.1 200 Muu.mtx 7102 88618 198606 12.5 28.0 323 sts4098.mtx 4098 38227 144225 9.3 35.2 371 bodyy5.mtx 18589 73935 627327 4.0 33.7 766 gyro_m.mtx 17361 178896 413709 10.3 23.8 489 Pres_Poisson.mtx 14822 365313 2447857 24.6 165.2 2758 Kuu.mtx 7102 173651 385828 24.5 54.3 612 shallow_water1.mtx 81920 204800 2448970 2.5 29.9 1428 finan512.mtx 74752 335872 2028289 4.5 27.1 1681 cant.mtx 62451 2034917 27456636 32.6 439.7 18072 lap1D_3pt_n1000000.mtx 1000000 1999999 1999999 2.0 2.0 1000000 lap2D_5pt_n100.mtx 10000 29800 25116 3.0 2.5 36 lap3D_7pt_n20.mtx 8000 30800 63281 3.9 7.9 141 http://www.cise.ufl.edu/research/sparse/matrices/ lap3D_7pt_n20 Original A symamd(A) L=chol(symamd(A)) domino-Cholesky versus CHOLMOD simplicial Domino-Cholesky is 1.33x faster than CHOLMOD simplicial in average Domino-Cholesky loses when 1) zero fill-in unevenly (load imbalance), 2) too many levels (more like sequential), 3) whole matrix can fit into CPU cache GPU: K40 CPU: i7-3930K CPU @ 3.20GHz 1.33x SuiteSparse-3.6.0 /CHOLMOD/Demo /cholmod_l_demo.c domino-Cholesky versus CHOLMOD Supernodal Domino-Cholesky is 4.7x faster than CHOLMOD supernodal in average Domino-Cholesky wins if there are ONLY few big supernodes or it is NOT sequential GPU: K40 CPU: i7-3930K CPU @ 3.20GHz SuiteSparse-3.6.0 /CHOLMOD/Demo /cholmod_l_demo.c Outline What should be done in sparse direct solver Domino scheme - review scheduler of triangular solve - how to trace DAG by a single kernel Conclusions Triangular solve: global sync CUSPARSE library performs an analysis phase, then a solve phase Maxim Naumov shows how to extract parallelism and how to keep utilization in GTC2012, “On the Parallel Solution of Sparse Triangular Linear System” Global barrier between consecutive two levels Each level running in parallel by atomic operations Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012 Drawback of global synchronization Critical path = max{level 1} + max{level 2} + max{level 3} = row 2 + row 4 + row 9 Avoid global synchronization Critical path = row 1 + row 4 + row 9 Domino scheme Motivation Jonathan Hogg proposed a new trsv (dense triangular solve) in 2012 Applications - csrsv (sparse triangular solve) - csric0 (sparse incomplete Cholesky factorization) - csrilu0 (sparse incomplete LU) - sparse Cholesky - sparse QR by householder reflection or Givens rotation Idea Each row is a domino. A domino waits until parents are done, and triggers next dominos when it is done Requirements - to avoid deadlock, parents of a domino must either be done or be running (a logical CTA id is required) - whole matrix must fit into GPU Jonathan’s trsv DAG (implicit) is described by two constraints - The work associated with a given off-diagonal block cannot begin until the solve for the diagonal block in the column has completed - The diagonal solve in a given row cannot commence until all matrix-vector multiplies for blocks in that row have completed One semaphore (lock) is used to schedule the DAG - the semaphore is an integer indicating which row is done One semaphore is used to query logical CTA id to avoid deadlock CTA 0 CTA 0 time S CTA 1 CTA 1 mv CTA 2 CTA 2 mv mv CTA 3 mv mv CTA 3 S S mv S Extend to sparse linear algebra D1, D2 and D3 run in parallel and other rows are waiting Once D1 is done, it will trigger its children, D4 and D5. Other rows are still waiting Each domino needs a semaphore (lock) Pros and cons of Domino scheme Pros - hide global barrier inside the kernel - overlap write phase and read phase of different dominos - can run without levels - reproducibility (determinism) - quick analysis phase for csrsv/csric0/csrilu0 Cons - load imbalance - long latency of L2 cache - cannot adjust computational resources during the execution - low utilization due to insufficient parallelism Conclusions Domino scheme can trace dependence graph, it works for dense/sparse linear algebra Global synchronization is painful, there are several workarounds - system level (across multiple nodes) overlap MPI transfer and computation - system level (single node) overlap PCI transfer and computation S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD - GPU level domino-scheme Domino scheme still suffers - long latency of L2 cache (hardware issue) - not enough parallelism (characteristic of a matrix) - load imbalance (big trouble) Penalty of domino is significant on csrsv Batched domino is much faster (9x) because - more work to do for each level - penalty of domino becomes small New domino-based functions in r6.0 - csrsv2 - csric02 - csrilu02 - bsrsv2 - bsric02 - bsrilu02 domino Non-domino Reproducibility v x Quick analysis phase v x Memory efficient v x Report structural zero or numerical zero v x Buffer provided by users v x Disable level scheduling v x Sparse direct solver: QR Symbolic analysis to find zero fill-in and dependence graph, then do left-looking householder QR by domino scheme QR has 9x more zero fill-in than Cholesky and is 14x slower than Cholesky 161 37 GPU: K40 Future work for sparse direct solver Sparse QR, including householder and Givens rotation - symbolic analysis on GPU (now on CPU) - improve numerical factorization Sparse LU with partial pivoting Thank you ! [1] umfpack, cholmod, KLU, http://www.cise.ufl.edu/research/sparse/ [2] Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012 [3] J. D. Hogg, A Fast Dense Triangular Solve in CUDA, SIAM Journal on Scientific Computing 2013, Vol. 35, No. 3, pp. C303-C322 [4] Maxim Naumov, What Does It Take to Accelerate SPICE on the GPU?, GTC2013 Related talks GTC2014 S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD Wednesday, 03/26, 14:00 - 14:50, Room LL20D S4243 - A GPU Sparse Direct Solver for AX=B Wednesday, 03/26, 09:00 - 09:25, Room LL20D S4524 - Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Wednesday, 03/26, 09:30 - 09:55, Room LL20D Domino-csrsv versus MKL Domino-csrsv is 4.2x faster than MKL in average 4.2x GPU: K40 CPU: i73930K CPU @ 3.20GHz Benchmark: unsymmetric matrices From Florida Matrix Collection except dense2.mtx dense2.mtx is 2000x2000 dense matrix nnz is # of nonzeros of A levels is # of levels in incomplete LU factorization matrix name cage14.mtx shermanACb.mtx Chebyshev4.mtx dense2.mtx webbase-1M.mtx cage12.mtx lung2.mtx cryg10000.mtx ASIC_680ks.mtx FEM_3D_thermal2.mtx thermomech_dK.mtx ASIC_320ks.mtx cage13.mtx atmosmodd.mtx m nnz nnz/m levels 1505785 27130349 18.0 100 18510 145149 7.8 34 68121 5377761 78.9 17424 2000 4000000 2000.0 2000 1000005 3105536 3.1 512 130228 2032536 15.6 66 109460 492564 4.5 479 10000 49699 5.0 198 682712 2329176 3.4 49 147900 3489300 23.6 3757 204316 2846228 13.9 644 321671 1827807 5.7 35 445315 7479343 16.8 83 1270432 8814880 6.9 352 http://www.cise.ufl.edu/research/sparse/matrices/ No zero fill-in domino-ilu0 versus MKL Domino-ilu0 is 5.4x faster than MKL in average (exclude dense2.mtx) GPU: K40 CPU: i73930K CPU @ 3.20GHz 5.4x Batched domino-ilu0 Same sparsity pattern but different value Batch size is 32 Speedup = (32 * single ilu0)/(batched ilu0) Batched domino-ilu0 is 9x faster than 32 domino-ilu0 in average GPU: K40