Slides

Outline

1.) N-Body Methods

2.) Dynamic Programming

3.) Sparse Linear Algebra

4.) Unstructured Grids

5.) Conclusion
1
N-Body Methods
Overview
N-Body Methods
• Particle simulation
– Large number of simple entities
– Each entity is dependent on each other
– Regularly structured FP computations
• Xeon Phi offers acceptable performance
– Regularity leads to good auto-vectorization
– Cache misses result in very high memory latency
• L1 misses mostly lead to L2 misses
N-Body platform comparison
Source: [3]
Dynamic Programming
Overview
Dynamic Programming
• Problem is divided into smaller problems
– Each smaller (and simpler) problem is solved first
– Results are combined to compute larger problems
– Often used for sequence alignment algorithms
• Example: DNA Sequence Alignment
– Xeon Phi 5110P vs. 2 GPU configurations
( GeForce GTX 480, K20c )
• Outperforms GTX 480
• Reaches 74,3% performance compared to the K20c
– Computation heavily based on peak integer performance
• Xeon Phi has ~ 57,3 % peak integer performance of the K20c
Sparse Linear Algebra
Overview
Sparse Linear Algebra
• Matrices with scattered non-zero entries
• Linear system solvers, Eigen solvers
• Different matrix structures
– Patterns can arise
– Irregularities
• False predictions
• Vector registers contain zero entries
Case Study:
Sparse MatrixVector Operations
• Large sparse matrix A
• Dense vector x
• Multiply-Add
• Simple instructions, large quantitiy
• Parallel execution
• Access patterns
Image Source: [2]
Bottleneck Considerations
Bandwidth limits
Memory bandwidth
Computation bandwidth
Memory latency
SIMD efficiency
Vector register utilization
Core utilization
Algorithm Showcase
• CSR / SELLPACK
format
– Column blocks
– Finite-Window
sorting
• Adaptive load
balancing
Image Source: [1]
Evaluation:
Effects of single improvements
Source: [1]
Evaluation:
Platform comparison
Source: [1]
[Gflops/s]
Evaluation:
Platform comparison
Matrices from the UFL Sparse Matrix Collection (/20)
Matrix 8 has much more non-zero entries than the rest.
Based on data from [2]
Conclusion:
Sparse Linear Algebra on Xeon Phi
• High potential in sparse linear algebra
– Wide vector registers
– High number of cores
• Main problem points:
– Irregular access patterns
– Sparse data structures
– Memory latency high on L2 cache misses
• Performance extremely dependent on data
Unstructured Grids
Image Source: [4]
1
6
Image Source: [5]
Image Source: [6]
Unstructured Grids
●
●
●
●
Partitioning the grid (few connections)
Irregular acces-patterns
=> bad for auto-vectorization
Software prefetching grid cells
Ideal prefetch amount:
MM latency * MM bandwidth
1
9
Unstructured Grids


Data races can not be determined
statically
„loop over edges and accessing data on
edges“ => data races
2
0
Airfoil Benchmark
●
2D inviscid airfoil code
●
considered bandwith bound
●
2,800,000 cell mesh
Image Source: [7]
2
1
Competitors
3500
2 × Xeon E5-2680
Xeon Phi 5110P
Tesla K40
3000
2500
2000
1500
1000
500
0
Price $
# Cores
Mem-Bandwidth
double GFlops
Last Level Cache
2
2
Airfoil on Xeon Phi


At the highest level:
Message Passing Protocol (MPI)
At the lower level:
OpenMP + Vector intrinsics
2
3
Xeon Phi Benchmark graph
Image Source: [8]
2
4
Benchmark graph
Image Source: [8]
2
5
Unstructured Grids conclusion



Benchmark probably not fair (price)
DO manual Vectorization:
Speedup 1.7-1.82
Use MPI
+
OpenMP
2
6
Who did what



Philipp Bartels:
Architecture introduction: slide 18 and 1 to 8
Presentation team x2: slide 1 and 16 to 27
Eugen Seljutin:
Presentation team x2: slide 2 to 15
2
7
Credits



[1] Xing Liu et al.: Efficient Sparse Matrix-Vector Multiplication on
x86-Based Many-Core Processors
[2] Erik Saule et al.: Performance Evaluation of Sparse Matrix
Multiplication Kernels on Intel Xeon Phi
[3] Konstantinos Krommydas et al.: On the Characterization of
OpenCL Dwarfs on Fixed and Reconfigurable Platforms

[4] http://en.wikipedia.org/wiki/Unstructured_grid

[5] http://view.eecs.berkeley.edu/wiki/Unstructured_Grids

[6] https://cfd.gmu.edu/~jcebral/gallery/vis04/index.html
2
8
Credits


[7] http://en.wikipedia.org/wiki/File:Airfoil_with_flow.png
[8] I. Z. Reguly et al.: Vectorizing Unstructured Mesh Computations
for Many-core Architecture
2
9