Download Report

PGAS BoF
The Partitioned Global
Address Space (PGAS)
Programming Languages
Organizers:
Tarek El-Ghazawi, Lauren Smith,
Bill Carlson and Kathy Yelick
www.pgas.org
1
Agenda
 Welcome - Tarek El-Ghazawi and Lauren Smith (2 min)
 PGAS 2014 Summary - Jeff Hammond (8 min)
 PGAS 2015 Plans – Tarek (3 min)
 PGAS Announcements (2 min)
 Quick Updates (20 min)
Applications – Kathy Yelick, Irene Moulitsas and Christian
Simmendinger
Technology – Sameer Shende, Oscar R. Hernandez, Nenad
Vukicevic, Yili Zheng, Hitoshi Murai, Vivek Kumar, Jacob Nelson,
Deepak Majeti, Olivier Serres, Sung-Eun Choi and Salvatore Filippone
 Questions (25 min)
www.pgas.org
2
Summary of PGAS 2014
One person’s impression…
Stuff to read
• http://www.csm.ornl.gov/OpenSHMEM2014/i
ndex.html
• http://nic.uoregon.edu/pgas14/schedule.php
• http://nic.uoregon.edu/pgas14/keynote.php
OpenSHMEM User Group
• New features proposed: thread support,
flexible memory allocation, FT.
• Multiple papers on how to implement
efficiently
• Major progress towards open community
standardization discussion (resembled MPI
Forum).
Intel PGAS Tutorial
•
•
•
•
•
Intel: Prototype SHMEM over SFI
Intel: MPI-3 features for PGAS
LBNL: UPC/GASNet
OSU: MVAPICH2-X
Intel: OpenSHMEM over MPI-3
Users interested in both SFI and MPI-3 as new
PGAS runtimes.
PGAS is not just about UPC…
•
•
•
•
•
Day-long OpenSHMEM User Group event
Full session on OpenSHMEM in PGAS
Runtimes: SFI, GASNet, MPI-3, OpenCoarrays
Models: HabenaroUPC++, HPX, Chapel, UPC++
Applications using MPI+OSHM, UPC++, CAF
The big winners at PGAS14 were C++ and
OpenSHMEM…
Best Paper
Native Mode-Based Optimizations of Remote
Memory Accesses in OpenSHMEM for Intel Xeon Phi
Naveen Namashivayam, Sayan Ghosh, Dounia Khaldi,
Deepak Eachempati and Barbara Chapman.
Right: NVIDIA CTO of Tesla Steve
Oberlin presenting the Best Paper
award to Naveen.
Panel summary
• Need better HW support for PGAS in general.
• 128-bit pointers and X1E behavior desired…
• Questions about heterogeneity
(vendor/ISA/core size).
• Debate about active-messages.
• Unclear if/how Chapel will ever take off. What
is the killer app/library here?
• Python ecosystem overwhelming. How to get
this for PGAS?
Plans




Tarek El-Ghazawi, General Chair
D.K. Panda, Program Chair
Tentative Dates: 9/16-9/18, 2015
Start on the morning of the 16th and end around noon on
the 18th
 Location, GW Main Campus, D.C. City Center- At Foggy
Bottom Metro Stop,





Metro in from anywhere in Greater DC
Walk to the White House
Walk to tens of DC restaurants
Visit the Smithonsians
Visit Georgetown for fun or NSF, DARPA, DoE, .. for funding!
 Keep an eye on www.pgas.org for emerging details
PGAS Announcements
 PGAS Booth in SC2014 : #2255
 BoF
Wednesday 5:30pm - 7pm
 Application
Experiences with Emerging PGAS APIs: MPI-3,
OpenSHMEM and GASPI - 386
 Chapel Users Group Meeting – 383
 OpenSHMEM: Further Developing the SHMEM Standard for
the HPC Community – 294
 Mailing list
To register, send an empty email to:
[email protected]
 Announcements from the audience
www.pgas.org
12
Low Overhead Atomic Updates Enable Genomics Assembly
Grand Challenge
Meraculous Assembly Pipeline
reads
x
x
New fast I/O using SeqDB over HDF5
Meraculous assembler is used in production at the
Joint Genome Institute
• Wheat assembly is a “grand challenge”
• Hardest part is contig generation (large inmemory hash table)
k-mers
New analysis filters errors using
probabilistic “Bloom Filter”
contigs
Graph algorithm (connected components)
scales to 15K cores on NERSC’s Edison
Human: 44 hours to 20 secs
Wheat: “doesn’t run” to 32 secs
Ongoing work: Scaffolds using Scalable Alignment
UPC
• Gives tera- to petabtye “shared” memory
• Combines with parallel I/O new genome
mapping algorithm to anchor 92% of
wheat chromosome
Uses Dynamic Aggregation
Kathy Yelick
Dynamic Runtime for Productive Asynchronous Remote
Updates Enables Scalable Data Fusion
New PGAS: Asynchronous invocation
Old PGAS:
E.g. *p = … or … = a[i];ç
finish { … async f (x)…}
•
•
•
Uses UPC++ Async unpack
Seismic modeling for energy applications “fuses”
observational data into simulation.
PGAS illusion of scalable shared memory to
construct matrix and measure data “fit”
New UPC++ dialect supports PGAS libraries;
future distributed data structure library
Cores: 48
192
768
3K
12K
Kathy Yelick
Dr. Irene Moulitsas
[email protected]
School of Engineering
Lattice Boltzmann solver using UPC
The LB method is a mesoscopic approach for fluid dynamics. Its governing equation
describes how a density distribution function changes in time. On the numerical side,
this is resolved in certain directions and the equation is solved in two steps: collision
step and streaming step. The method is able to resolve the velocity, pressure and
density fields for incompressible flows.
Validation: The captured von Kármán
vortex street behind a cylinder.
Intra-node speedup of the UPC version vs. serial
version on ASTRAL (SGI) on a 800x800 mesh.
SC14, New Orleans
19 November 2014
Inter-node
speedup
of the UPC
version and
vs. serial
Validation:
The velocity
magnitudes
version
on ASTRAL
(SGI)
on a 1600x1600
streamlines
in the
lid-driven
cavity flow.mesh.
Page 15
Irene Moulitsas
Dr. Irene Moulitsas
[email protected]
School of Engineering
Navier Stokes solver using CAF
We solve compressible Navier Stokes equations on mixed type unstructured meshes
employing different numerical schemes – First Order, MUSCL-2, MUSCL-3, WENO-3,
WENO-4, WENO-5.
Speedup of the Coarray Fortran version vs.
Speedup of the Coarray Fortran version vs.
Validation
RAE 2822 and
the MPI version on ASTRAL
(SGI)studies
using were performed for thethe
MPI version on ARCHER (CRAY XC-30)
NACA 0012 aerofoils. Experimental data was obtained from
INTEL compiler
using CRAY compiler.
the NPARC Alliance Verification and Validation archive.
SC14, New Orleans
19 November 2014
Page 16
Irene Moulitsas
GASPI/GPI2
GASPI - A Failure Tolerant PGAS API for
Asynchronous Dataflow on Heterogeneous
Architectures.
GPI2-1.1.1
• Support for GPU/Xeon Phi
• Minor fixes
PGAS community benchmarks – CFD-Proxy
• Multithreaded OpenMP/MPI/GASPI calculation
of green gauss gradients for 2 million point
aircraft mesh with halo (ghost cell) exchange.
• Strong scaling benchmark. We aim for ~100
points per thread/core.
Christian Simmendinger
CFD-Proxy
Application
Experiences with
Emerging PGAS APIs:
MPI-3, OpenSHMEM
and GASPI
Wednesday , 17:30
https://github.com/PGAS-community-benchmarks/CFD-Proxy
Christian Simmendinger
TAU Performance System
®
•
Parallel Profiling and Tracing Toolkit supports UPC, SHMEM, Co-Array
Fortran.
•
Cray CCE compiler support for instrumentation of UPC
• Added support for sampling, initial support for rewriting binaries
• Notify, fence, barrier, loops, forall instrumented
• Compiler-based instrumentation
• Runtime layer instrumentation, DMAPP layer
• 3D Communication Matrix, trace based views
• Other compilers supported:
• Berkeley UPC, IBM XLUPC, GNU UPC
• Planned: PDT update with EDG 4.9 UPC
•
Support for OpenSHMEM, Mellanox OpenSHMEM,
Cray SHMEM, SGI SHMEM
•
HPCLinux LiveDVD/OVA [http://www.hpclinux.org]
•
Please stop by the PGAS booth (#2255) for more information
http://tau.uoregon.edu
Sameer Shende
OpenSHMEM Highlights
Overview
Accomplishments
 ORNL and University of Houston are driving



Oscar Hernandez, Pavel Shamis, Manju Venkata ORNL,
UH & UTK Team
the OpenSHMEM specification
The OpenSHMEM 1.1 specification was ratified
on June 2014
We have defined a new roadmap for
OpenSHMEM, version 1.1, 1.2, 1.5 and 2.0
with community input
Recent work includes building a community and
tools eco-system for OpenSHMEM
 OpenSHMEM 1.1 specification was released.
 We are working with the community on the OpenSHMEM
1.2 specification.
 OpenSHMEM reference implementation is integrated with
UCCS and runs on Infiniband and uGNI
 OpenSHMEM User Group Meeting (OUG14)
 http://www.csm.ornl.gov/OpenSHMEM2014/
 Programming environment was enhanced with new tools:
Vampir/Score-P, OpenSHMEM Analyzer, TAU)
OpenSHMEM - Roadmap
• OpenSHMEM v1.1 (June 2014)
– Errata, bug fixes
• OpenSHMEM v1.6
OpenSHMEM Programming
Environment
– Non-blocking collectives
– Ratified (100+ tickets resolved)• OpenSHMEM v1.7
– Thread safety update
• OpenSHMEM v1.2 (November
2014) (20+ tickets)
• OpenSHMEM Next Generation
– API naming convention
(2.0)
– finalize(), global_exit()
– Let’s go wild !!! (Exascale!)
– Consistent data type support
– Version information
– Clarifications: zero-length, wait
• OpenSHMEM v1.5 (Late 2015)
– shmem_local_ptr()
– Non-blocking communication
semantics (RMA, AMO)
– teams, groups
– Thread safety
– Active set + Memory context
– Fault Tolerance
– Exit codes
– Locality
– I/O
White paper:
OpenSHMEM Tools API
OpenSHMEM 2014user group meeting
http://www.csm.ornl.gov/OpenSHMEM2014/
Oscar R. Hernandez
OpenSHMEM User Meeting 2014
• Co-located with PGAS’14 @ Eugene Oregon
– October 7th, 2014
• Invited Speakers
@
– Two invited speakers: NVIDIA, Mellanox
• Working Meeting
– Update from the Vendors/Community
– OpenSHMEM – Work in Progress Papers [10 short papers]
– OpenSHMEM 1.2 ratification and roadmap discussions
• Call for WIP (Work in Progress Papers)
– Addenum to PGAS proceedings
– Goal: Community presented extensions proposals and discussed their
OpenSHMEM experiences
• Website: www.csm.ornl.gov/workshops/oug2014
Oscar R. Hernandez
GNU/Clang UPC, Intrepid Technology
• Clang UPC 3.4.1 Released (clangupc.github.io)
– UPC 1.3 specification complaint
– Integrated into the latest Berkeley UPC toolset (2.20.0)
– SMP and Infiniband Portals4 runtime (clangupc.github.io/portals4)
• Clang UPC2C Translator (clangupc.github.io/clang-upc2c)
– Berkeley UPC translator compatible
– Integrated into the latest Berkeley UPC toolset (2.20.0)
• Clang UPC with remote pointers in LLVM
(clangupc.github.io/clang-upc-ir)
– Experimental branch with UPC shared pointers in LLVM IR
– Passes all UPC tests, full integration expected in 2015
• Clang UPC with libfabric runtime (clangupc.github.io/libfabric)
– Infiniband based runtime that supports OFIWG libfabric specification
• GNU UPC 5.0 (www.gccupc.org)
– Soon to be released
– Plan to merge into the GCC main trunk
Nenad Vukicevic
UPC++ Adds Data Structures, Hierarchy, and Async to
Yili Zheng and Amir Kamil, LBNL, leads
Traditional (UPC) PGAS
UPC++
• PGAS address space with put/get
• Default SPMD: All parallel, all the time
• Hierarchical parallelism (on-node,…)
• Remote async (dedicated some
threads for handling)
• Distributed data structure library in
progress (multi-D arrays, hash tables)
• Programmer-selected runtime (taskq,
DAG, etc.)
Hierarchical teams, locales and
coarrays, support portable code
for deep memory hierarchies
PGAS locality and lightweight communication
match NUMA, small memory/core, and
software-managed memories
Strawman
exascale node
architecture
Memory Stacks on Package
Low Capacity, High Bandwidth
Wide Core
Latency
Optimized
Memory
DRAM/DIMMS
High Capacity Low
Bandwidth
NVRAM: Burst
Buffers / rack-local
storage
“NIC” on Board
Domain-specific data structures (arrays,
hash tables, etc.) and 1-sided
communication improve programmability
Two-sided message passing
with buffers requires 4 steps
send recv
buffers buffer
box 0
(local)
1
i (unit stride)
2
3
4
box 1
(remote)
i (unit stride)
UPC++ 3d-arrays uses
only 1 step with details
handled by the library
box 0
(local)
1
i (unit stride)
box 1
(remote)
i (unit stride)
Yili Zheng
www.xcalablemp.org
 Directive-based PGAS extension for Fortran and C
•
•
•
•
Proposed by XMP Spec. WG of PC Cluster Consortium.
Ver. 1.2.1 spec. is available.
Now XMP/C++ on the table.
Adopted by Post-T2K and Post-K Projects in Japan.
 Supports two parallelization
paradigms:
• Global-view (with HPF-like
data/work mapping
directives)
• Local-view (with coarray)
 Allows mixture with MPI and/or
OpenMP.
Data Mapping
!$xmp nodes p(2,2)
!$xmp template t(n,n)
!$xmp distribute t(block,block) onto p
real a(n,n)
!$xmp align a(i,j) with t(i,j)
!$xmp shadow a(1,1)
!$xmp reflect (a)
!$xmp loop (i,j) on t(i,j)
do j = 2, n-1
do i = 2, n-1
w = a(i-1,j) + a(i+1,j) + ...
...
Work Mapping
Stencil Comm.
Hitoshi Murai
 Omni XMP compiler
omni-compiler.org
• Reference impl. being developed by RIKEN and U. Tsukuba.
• The latest Ver. 0.9.0 is now released as OSS.
• Platforms: K computer, Cray, IBM BlueGene, NEC SX, Hitachi SR,
Linux clusters, etc.
• HPC Challenge class2 award in 2013
•
•
•
•
Plasma (3D fluid)
Seismic Imaging (3D stencil)
Fusion (Particle-in-Cell)
etc.
Performance (TFLOPS)
 Applications
Peak
XMP
Number of Nodes
HPL performance on K (2013)
Hitoshi Murai
HabaneroUPC++: a Compiler-free PGAS Library
Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, Vivek Sarkar
LULESH Performance for HabaneroUPC++
See PGAS 2014 paper and
SC14 PGAS booth poster
for more details!
Vivek Kumar
Github site: http://habanero-rice.github.io/habanero-upc/
Grappa: Latency-tolerant PGAS
Can a PGAS system like Grappa be the platform for
in-memory “big data” frameworks?
Graph< VertexData, EdgeData > g;
http://grappa.io
Model
In-memory
MapReduce
Compared
with
while(g->active_vertices > 0) {
// gather values from neighboring vertices
forall(g, [=](Vertex& v){
if (!v->delta_cache_enabled) {
forall<async>(v.in_edges(), [=](Edge& e){
Lines of code
/ speedup
v->value += on(e.source, [=](Vertex& src){
return gather_edge(src, e);
});
});
152 lines
10x faster
Vertexcentric grap
h API
60 lines
1.3x
faster
Relational
query
execution
700 lines
12.5x faster
}
});
Context switch on
long-latency remote operations
// apply phase
forall(g, [=](Vertex& v){
apply_vertex(v);
});
Global view,
parallel loops
// scatter phase (also updates cached gather value)
forall(g, [=](Vertex& v){
if (v->active) {
forall<async>(v.out_edges(), [=](Edge& e){
on<async>(e.target, [=](Vertex& tgt){
tgt.delta_cache += scatter_edge(e);
});
});
}
});
}
Migrate computation
when return value not needed
Jacob Nelson
CHAPEL ON HSA + XTQ
Task
Runtime
CHAPEL Apps
CHAPEL Apps
C Program
C Program + HSAIL
Task
Runtime
Task
Runtime
Threads
Threads
HSA
CPU 0
CPU 1
APU 0
GASNet
X
T
Q
GASNet-X
X
T
Q
Task
Runtime
HSA
APU 1
Chapel Threading + GASNet
Chapel On HSA + XTQ
 Current Chapel Framework
 Proposed Chapel Framework
‒ Local tasks via Threads
‒ Local tasks via HSA
‒ Remote tasks via GASNet
Active Messages
‒ Remote tasks via XTQ
Contact: Mauricio Breternitz, ([email protected]), AMD
Deepak Majeti ([email protected]), Rice University
28
Deepak Majeti
Hardware Support for Efficient PGAS Programming
• Mapping of the PGAS memory model
to virtual memory through Hardware
Support and instruction extension
– Low latency, handles local accesses
and random patterns
• Prototype Hardware using FPGAs
• Full-System Simulation (Gem5)
– Prototype compiler support based on
Berkeley UPC over Gasnet
– 5.5x performance increase without
the need of manual optimizations,
10% faster/slower than manually
optimized code
• O. Serres, A. Kayi, A. Anbar, and T. El-Ghazawi, “Hardware support for address
mapping in PGAS languages; a UPC case study,” in CF ’14: Proceeding of the ACM
Conference on Computing Frontiers, pp. 22:1–22:2, ACM, 2014.
• O. Serres, A. Kayi, A. Anbar, and T. El-Ghazawi, “Enabling PGAS productivity with
hardware support for shared address mapping; a UPC case study,” in HPCC 2014: the
16th IEEE International Conference on High Performance and Communications, 2014.
Olivier Serres
Chapel: A Condensed Digest
Past: DARPA HPCS-funded research prototype
● Designed from first principles rather than by extension
● supports modern features while avoiding sequential baggage
● NOT SPMD
● global multitasking: more general; suitable for next-gen architectures
● namespace defined by lexical scoping: more natural and intuitive
● Open source and collaborative
Present: Towards production-grade
● Performance improvements and language features
● Next-gen architectures (e.g., massively multithreaded, accelerators)
● come hear more at the Emerging Technologies Booth (#233)
● New collaborations (e.g., Colorado State, AMD, ETH Zurich)
● Other application areas (e.g., Big Data, interpreted environment)
● Increase user base (e.g., Co-design Centers, universities, industry)
Future: The Chapel Foundation
● An independent entity to oversee the language specification and open
source implementation
● Come hear more at the CHUG BoF (5:30-7pm, room 383-84-85)
Sung-Eun Choi
30
OpenCoarrays
Coarrays in GNU Fortran
• OpenCoarrays is a free and efficient transport
layer that supports coarray Fortran compilers.
– The GNU Fortran compiler already uses it as an
external library.
• There are currently two versions: MPI based and
GASNet based.
• GNU Fortran + OpenCoarrays performance:
– Gfortran is better than the Intel Compiler in almost
every coarray transfer.
– Gfortran is better than Cray for some common
communication patterns.
• OpenCoarrays is distributed under BSD-3
For more information, please visit opencoarrays.org. Scheduled for GCC 5.0.
Salvatore Filippone
www.company.com
PGAS
 PGAS Booth in SC2014 : #2255
 BoF
 Wednesday 5:30pm - 7pm



Application Experiences with Emerging PGAS APIs: MPI-3, OpenSHMEM and
GASPI - 386
Chapel Users Group Meeting – 383
OpenSHMEM: Further Developing the SHMEM Standard for the HPC
Community – 294
 Mailing list
 To register, send an empty email to:
[email protected]
 Website
 www.pgas.org
 Any other PGAS announcements ?
www.pgas.org
32