Shoal: smart allocation and replication of memory

Shoal: smart allocation and replication
of memory for parallel programs
Stefan Kaestle (student, presenting), Reto Achermann (student), Timothy Roscoe, Tim Harris†
Systems Group, Dept. of Computer Science, ETH Zurich
Modern NUMA multicore machines exhibit complex latency and throughput characteristics, making it hard to allocate memory optimally for a given program’s access patterns. However, good placement of and access to program
data is crucial for application performance, and, if not carefully done, can significantly impact scalability [1, 3]. Although there is research (e.g. [1, 2]) for how to adapt software for concrete characteristics of such machines, many
programmers struggle to apply these to their applications.
It is unclear which NUMA optimization to apply in which
setting. For example, some programs have good access locality and it is worth co-locating code with the data it is
accessing. Other programs exhibit random accesses and it
is best to spread code and data across the entire machine
to maximize total bandwidth to memory. The situation becomes even more complex as we consider large page sizes
(where the data on a large page must be co-located physically), non-cache-coherent memory with different characteristics, or heterogeneous and specialized cores. With rapidly
evolving and diversifying hardware, programmers must repeatedly make manual changes to their software to keep up
with new hardware performance properties.
One solution to achieve better data placement and faster
data access is to rely on automatic online monitoring of program performance to decide how to migrate data [3]. However, program’s semantics then have to be guessed in retrospect from a narrow set of available information (i.e. from
data based on sampling using performance counters). Such
approaches are also limited to a relatively small number of
optimizations. For example, it is hard to incrementally activate large pages or enable the use of DMA hardware for data
copies dynamically.
We present Shoal, a system that abstracts memory access and provides a rich programming interface that accepts
hints on memory access patterns at runtime. These hints can
†
Oracle Labs, Cambridge, UK
either be manually written or automatically derived from
high-level descriptions of parallel programs. Shoal includes
a machine-aware runtime that selects optimal implementations for this memory abstraction dynamically based on the
hints and a concrete combination of machine and workload.
If available, Shoal is able to exploit not only NUMA properties but also hardware features such as large and huge pages
as well as DMA copy engines.
Our contributions are as follows: Firstly, we provide
higher-level memory abstractions to allow Shoal to change
memory allocation and access at runtime. In our prototype,
we provide an array-like abstraction, designed for use in
C/C++ code. Secondly, we introduce techniques for automatically selecting among several highly tuned array implementations using access patterns and machine characteristics. Finally, we modified Green-Marl [4], a graph analytics language, to show how Shoal can extract access patterns automatically from high-level descriptions. We demonstrate significant performance benefits when applying these
techniques to Green-Marl when using Shoal for unmodified
Green-Marl programs.
[1] A PPAVOO , J., S ILVA , D. D., K RIEGER , O., AUSLANDER , M.,
O STROWSKI , M., ROSENBURG , B., WATERLAND , A., W IS NIEWSKI , R. W., X ENIDIS , J., S TUMM , M., AND S OARES ,
L. Experience Distributing Objects in an SMMP OS. ACM
Transactions on Computer Systems 25, 3 (Aug. 2007).
[2] BAUMANN , A., BARHAM , P., DAGAND , P.-E., H ARRIS , T.,
¨
I SAACS , R., P ETER , S., ROSCOE , T., S CH UPBACH
, A., AND
S INGHANIA , A. The Multikernel: A new OS architecture
for scalable multicore systems. In Proceedings of the ACM
SIGOPS 22nd Symposium on Operating Systems Principles
(New York, NY, USA, 2009), SOSP ’09, ACM, pp. 29–44.
[3] DASHTI , M., F EDOROVA , A., F UNSTON , J., G AUD , F.,
L ACHAIZE , R., L EPERS , B., Q UEMA , V., AND ROTH , M.
Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA,
2013), ASPLOS ’13, ACM, pp. 381–394.
[4] H ONG , S., C HAFI , H., S EDLAR , E., AND O LUKOTUN , K.
Green-Marl: A DSL for Easy and Efficient Graph Analysis.
In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating
Systems (New York, NY, USA, 2012), ASPLOS XVII, ACM,
pp. 349–362.
[Copyright notice will appear here once ’preprint’ option is removed.]
1
2015/4/10
Shoal: self-tuning for memory allocation and access
based on access patterns and hardware helps to tackle hardware complexity
Stefan Kaestle
Reto Achermann
Timothy Roscoe
Tim Harris
Problem
Results
multicores: complex memory subsystems
• non-uniform memory access (NUMA)
• complex interconnects
• features (DMA, large/huge page)
future:
• global address space?
• cache coherence?
 Hard to program
high-level program
Ideas
Procedure pagerank(G: Graph, e,d: Double,
max: Int; pg_rank: Node_Prop<Double>)
{
// […] Initialization
Do {
diff = 0.0;
Foreach (t: G.Nodes) {
Double val = (1-d) / N + d *
Sum(w: t.InNbrs) {
w.pg_rank / w.OutDegree()};
diff += | val - t.pg_rank |;
t.pg_rank <= val @ t; }
cnt++;
} While ((diff > e) && (cnt < max));
}
high-level compiler Shoal
hardware
characteristics
access
patterns
low-level code (C/C++)
Shoal abstraction
program
compiler
Shoal library
Auto-Tune Memory Layout
• memory abstraction
• annotate memory allocation
• memory access patterns
• from high-level languages
• manually
• use hardware characteristics
• memory abstraction (arrays)
• specialized implementations
(replication, distribution)
Abstraction
• decouples memory from program logic
 exchange array implementation online
• high-level operations (copy, memset .. )
 transparent use of large page sizes, DMA engines
Informatik
Computer Science
Conclusion
• automatic scalability
• good performance across
various NUMA machines
• no programmer efforts
Future work
•
•
•
•
synchronization
scheduling
accelerators (Xeon Phi)
multiple address spaces