Download Report

MPI-3 Is Here: Optimize and Perform with Intel MPI
Tools
Gergana Slavova
[email protected]
Technical Consulting Engineer
Intel Software and Service Group
Faster Code Faster
Intel® Parallel Studio XE 2015
Faster Code

Explicit vector programming speeds more code

Optimizations for Intel® Xeon Phi™ coprocessor, Skylake and Broadwell
microarchitectures

MPI library now supports latest MPI-3 standard

Faster processing of small matrixes

Parallel direct sparse solvers for clusters
Code Faster

Comprehensive compiler optimization reports

Analyze Windows* or Linux* profile data on a Mac*
Latest standards support

MPI-3, OpenMP 4, Full C++11 and Fortran 2003
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
2
How Intel® Parallel Studio XE 2015
helps make Faster Code Faster for HPC
HPC Cluster
Cluster Edition
Multi-fabric MPI
library
MPI error checking
and tuning
Professional Edition
MPI Messages
Threading design
& prototyping
Parallel performance
tuning
Memory & thread
correctness
Composer Edition
Vectorized
& Threaded
Node
Intel® C++ and
Fortran compilers
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Parallel models
(e.g., OpenMP*)
Optimization Notice
Optimized
libraries
3
Available Standalone or in Cluster suite
Achieve High Performance for MPI
Cluster Applications
with
Intel® MPI
Elevate Development Tools to a
Comprehensive Shared,
Distributed & Hybrid Application
Development Suite
MPI Library
+
Library+
MPI Library - MPICH Based
MPI 3.0 Standard
Benchmarks & Tuning
Compilers
Performance Libs
Analysis Tools
Also Available in Intel®
Parallel Studio XE
2015 Cluster Edition+
+Available
for Windows** and Linux*
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
4
Intel® MPI Library Overview
Optimized MPI application performance

Application-specific tuning

Automatic tuning
Lower latency and multi-vendor interoperability

Industry leading latency

Performance optimized support for the latest OFED capabilities
through DAPL 2.x
Faster MPI communication

iWARP
Optimized collectives
Sustainable scalability beyond 150K cores

Native InfiniBand* interface support allows for lower latencies, higher
bandwidth, and reduced memory requirements
More robust MPI applications

Seamless interoperability with Intel® Trace Analyzer and Collector
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
5
Reduced Latency Means Faster Performance
Intel® MPI Library
Superior Performance with Intel® MPI Library 5.0
Superior Performance with Intel® MPI Library 5.0
4 bytes
Intel MPI 5.0
1
512 bytes 16 Kbytes 128 Kbytes 4 Mbytes
Platform MPI 9.1.2 CE
MVAPICH2 2.0rc2
1
1.8
1
1
1
1
1.1
1.6
1.9
1.5
1
2.2
1
1
0.0
1
1
1.0
2
2.0
2.9
2.0
Speedup (times)
64 Processes, 8 nodes (InfiniBand + shared memory), Linux* 64
Relative (Geomean) MPI Latency Benchmarks (Higher is Better)
3.4
2.5
3.0
3.1
Speedup (times)
192 Processes, 8 nodes (InfiniBand + shared memory), Linux* 64
Relative (Geomean) MPI Latency Benchmarks (Higher is Better)
0.5
0
4 bytes
OpenMPI 1.7.3
512 bytes 16 Kbytes 128 Kbytes 4 Mbytes
Intel MPI 5.0
MVAPICH2-2.0 RC2
Configuration: Hardware: CPU: Dual Intel® Xeon [email protected]; 64 GB RAM. Interconnect: Mellanox Technologies* MT27500 Family [ConnectX*-3] FDR..
Software: RedHat* RHEL 6.2; OFED 3.5-2; Intel® MPI Library 5.0 Intel® MPI Benchmarks 3.2.4 (default parameters; built with Intel® C++ Compiler XE 13.1.1 for Linux*);
Configuration: Hardware: Intel® Xeon® CPU E5-2680 @ 2.70GHz, RAM 64GB; Interconnect: InfiniBand, ConnectX adapters; FDR. MIC: C0-KNC 1238095 kHz; 61 cores.
RAM: 15872 MB per card. Software: RHEL 6.2, OFED 1.5.4.1, MPSS Version: 3.2, Intel® C/C++ Compiler XE 13.1.1, Intel® MPI Benchmarks 3.2.4.;
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel
Corporation
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance
of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
6
Intel® MPI Library 5.0
What’s New
MPI-3 Standard Support

Non-Blocking Collectives

Fast RMA

Large Counts
MPICH ABI Compatibility

Compatibility with MPICH* v3.1, IBM* MPI v1.4, Cray*
MPT v7.0
Performance & Scaling

Memory Consumption Optimizations

Scaling up to 150K Ranks*

Gains up to 35% reduction on Collectives

Hydra now default job manager on Windows*
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
7
What is in MPI-3?
Supported in Intel® MPI
Library 5.0?
Topic
Motivation
Main Result
Collective Operations
Collective performance
Non-Blocking &
Sparse Collectives
Yes
Remote Memory Access
Cache coherence, PGAS
support
Fast RMA
Yes
Backward Compatibility
Buffers > 2 GB
Large buffer support,
const buffers
Yes, partial
Fortran Bindings
Fortran 2008
Fortran 2008
bindings
Removed C++
bindings
No support in
MPICH3.0
Tools Support
PMPI Limitations
MPIT Interface
Yes
Hybrid Programming
Core count growth
MPI_Mprobe, shared
memory windows
Yes
Fault Tolerance
Node count growth
None. Next time?
N/A
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
8
I want a complete comm/comp overlap
Problem
 Computation/communication overlap
is not possible with the blocking
collective operations
Solution: Non-blocking Collectives
 Add non-blocking equivalents for
existing blocking collectives
 Do not mix non-blocking and blocking
collectives on different ranks in the
same operation
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Example (C)
// Start synchronization
MPI_Ibarrier(comm, &req);
// Do extra computation
…
// Complete synchronization
MPI_Test(&req, …);
Optimization Notice
9
I have a sparse communication network
Problem
 Neighbor exchanges are poorly served
by the current collective operations
(memory and performance losses)
Solution: Sparse Collectives

Add blocking and non-blocking
Allgather* and Alltoall*
collectives based on neighborhoods
Example (FORTRAN)
call MPI_NEIGHBOR_ALLGATHER(&
& sendbuf, sendcount, sendtype,&
& recvbuf, recvcount, recvtype,&
& graph_comm, ierror)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
10
I want to use one-sided calls to reduce sync overhead
Problem
 MPI-2 one-sided operations are too
general to work efficiently on cache
coherent systems and compete with
PGAS languages
Solution: Fast Remote Memory Access

Eliminate unnecessary overheads by
adding a ‘unified’ memory model

Simplify usage model by supporting
the MPI_Request non-blocking call,
extra synchronization calls, relaxed
restrictions, shared memory, and
much more
Example (FORTRAN)
call MPI_WIN_GET_ATTR(win, MPI_WIN_MODEL, &
memory_model, flag, ierror)
if (memory_model .eq. MPI_WIN_UNIFIED) then
! private and public copies coincide
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
11
I’m sending *very* large messages
Problem
 Original MPI counts are limited to 2
Gigaunits, while applications want to
send much more
Solution: Large Buffer Support


“Hide” the long counts inside the
derived MPI datatypes
Add new datatype query calls to
manipulate long counts
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Example (FORTRAN)
// mpi_count may be, e.g.,
//
64-bit long
MPI_Get_elements_x(&status,
datatype, &mpi_count);
Optimization Notice
12
Comprehensive Shared,
Distributed & Hybrid
Application Development Suite
Compilers
Analysis Tools
Performance Libs
Achieve High Performance for MPI
Cluster Applications
with
Intel® Trace Analyzer and
Collector
A component of
Intel® Parallel Studio XE Cluster Edition
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
13
Intel® Trace Analyzer and Collector Overview
Intel® Trace Analyzer and Collector helps the developer:
 Visualize and understand parallel application behavior
 Evaluate profiling statistics and load balancing
-trace
 Identify
communication
hotspots
API
and -tcollect
Intel® Trace Collector
 Low overhead
 Excellent scalability
Linker
Binary
Trace File (.stf)
 Powerful aggregation and filtering functions
 Idealizer
Compiler
Objects
Features
 Event-based approach
Source
Code
Intel® Trace Analyzer
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Runtime
Output
Optimization Notice
14
Strengths of Event-based Tracing
Collect
Record
Predict
Detailed MPI program behavior
Exact sequence of program states – keep timing consistent
Collect information about exchange of messages: at what
times and in which order
An event-based approach is able to detect temporal dependencies!
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
15
Using the Intel® Trace Analyzer and Collector is …
Easy!
Step 1
Step 2
Run your binary and create a tracefile
$ mpirun –trace –n 2 ./test
View the Results:
$ traceanalyzer &
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
16
Intel® Trace Analyzer and Collector
Summary
page
Time interval
shown
Aggregation of shown
data
Tagging &
Filtering
Imbalance
Diagram
Settings
Compare
Idealizer
Perf
Assistant
Compare the event timelines
of two communication
profiles
Blue = computation
Red = communication
Chart showing how the MPI
processes interact
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
17
Intel® Trace Analyzer and Collector 9.0
What’s New
MPI Communications Profile Summary
Overview
Expanded Standards Support with MPI 3.0
Automatic Performance Assistant
 Detect common MPI performance issues
 Automated tips on potential solutions
Automatically detect
performance issues and their
impact on runtime
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
18
MPI-3 supported in Intel® Trace Analyzer and
Collector 9.0
Support for major MPI-3.0 features
 Non-blocking collectives
 Fast RMA
 Large counts
Non-blocking
Allreduce
(MPI_Iallreduce)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
19
Call to Action
Explore MPI-3 for your application today
Download the Intel® Parallel Studio XE 2015 Cluster Edition
Tell us about your experiences
 Intel® Clusters and HPC Technology forums
software.intel.com/en-us/forums/intel-clusters-and-hpc-technologychnology
 Intel® MPI Library product page (LEARN tab)
www.intel.com/go/mpi
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
20
Learn More for your Customers – Intel SW
Developer Webinars
Topic
Date & Time (Pacific)
Succeed by “Modernizing code”; Parallelize, Vectorize; Examples, Tips, Advice
Sep 9, 2014, 1:15-2:15PM
Knights Corner: Your Path to Knights Landing
Sep 17, 2014, 9:00-10:00AM
Update Now: What’s New in Intel® Compilers and Lirbaries
Sep 23, 2014, 9:00-10:00AM
Got Errors? Need to Thread? Intel® Software Dev Tools to the Rescue
Sep 30, 2014, 9:00-10:00AM
MPI-3 is Here: Optimize and Perform with Intel MPI Tools
Oct 7, 2014, 9:00-10:00AM
How an Oil and Gas Application Gained 10x Performance
Oct 14, 2014, 9:00-10:00AM
Accelerate Complex Simulations: An Example from Manufacturing
Oct 21, 2014, 9:00-10:00AM
New Intel® Math Kernel Library Features Boost Performance for Tiny and Gigantic
Computations
Nov 4, 2014, 9:00-10:00AM
Design Code that Scales
Sign
Up
https://software.intel.com/en-us/articles/intel-software-tools-technicalwebinar-series
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
21
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND
INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Optimization Notice
22