2008 11th IEEE International Conference on Computational Science and Engineering An Experimental Study on How to Build Efficient Multi-Core Clusters for High Performance Computing Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas Distributed Systems Research Laboratory ( LaPeSD ) Department of Informatics and Statistics ( INE ) Federal University of Santa Catarina ( UFSC ) { luigi, tomazella, mario }@inf.ufsc.br interconnect fabric, it roughly means that each MPI process execution is independent of the others. Nowadays, multi-processor (SMP) and now multicore (CMP) technologies are increasingly finding their way into cluster computing. Inevitably, clusters built using SMP and also CMP-SMP nodes will become more and more common. Lacking of a widely accepted term for CMP-SMP cluster design, both architectures will be referenced as CLUMPS, a usual term for defining a cluster of SMP nodes. Traditional MPI programs follow the SPMD (Single Program, Multiple Data) parallel programming model which was designed basically for cluster architectures built using nodes with a single processing unit, that is to say single-core nodes. For example, in a modern cluster design built with multi-core multi-processor nodes, the access to interconnect fabric is shared by locally executing processes. Either main memory accessing or CMP-SMP’s usually deeper cache memory hierarchy might also slow down inter-process communication, since bus and memory subsystem of each node is shared. Thus, in such a modern cluster, moving data around between communicating cores is a function of not only their physical distance (inside a processor socket, inside a node or inter-node) but also of shared memory and network bandwidth limitations. Our motivation concerns the importance of realizing from the point of view of an architectural designer that modern multi-core cluster designs create a different scenario for predicting performance. This urging need to understand the trade-offs between these architectural cluster designs guided our research and finally lead to a new approach for setting up more efficient clusters of commodities. Thus, an alternative approach to the utilization of non-commodity interconnects, such as Myrinet and Infiniband, is proposed in order to build economically more accessible clusters of commodities with higher performance. Abstract Multi-core technology produces a new scenario for communicating processes in an MPI cluster environment and consequently the involved trade-offs need to be uncovered. This motivation guided our research and lead to a new approach for setting up more efficient clusters built with commodities. Thus, alternatively to the utilization of non-commodity interconnects such as Myrinet and Infiniband, we present a proposal based on leaving cores idle relatively to application processing in order to build economically more accessible clusters of commodities with higher performance. Execution of fine-grained IS algorithm from NAS Parallel Benchmark revealed a speedup of up to 25%. Interestingly, a cluster organized according to the proposed setup was able to outperform a single multi-core SMP host in which all processes communicate inside the host. Therefore, empirical results indicate that our proposal has been successful for medium and fine-grained algorithms. 1. Introduction Scientific applications used to be executed especially on expensive and proprietary massively parallel processing (MPP) machines. As processing power and communication speed are increasingly becoming off-the-shelf products, building clusters of commodities [27] has been taking a large piece on high performance computing (HPC) world [5]. Not long ago, identical single-processor computing nodes used to be aggregated in order to form a cluster, also known as NoW (Network of Workstations). Such parallel architecture design demands a distributed memory programming interface like MPI [1] for interprocess communication. As each computing node has its own memory subsystem and path to the 978-0-7695-3193-9/08 $25.00 © 2008 IEEE DOI 10.1109/CSE.2008.63 33 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. Lastly, other works focus on characterizing NPB algorithms. Kim and Lilja [20], Tabe and Stout [17], Martin [15], and Faraj and Yuan [16] concentrate their study on MPI-based NPB algorithms in order to determine types of communication, size of messages, quantity and frequency of communication phases. Additionally, Sun, Wang and Zu [19], and Subhlok, Venkataramaiah and Singh [18] take into account the amount of transferred data as well as processor and memory usage in order to characterize NPB. A preliminary evaluation of the impact of multicore technology on performance of clusters has shown quite surprising results for scientific computing. When it comes to applications with small computation to communication ratios, the potentially advantageous characteristics of “many-core” hosts indicate little superiority over performance of “few-core” hosts. Furthermore, even loss of performance and efficiency is revealed depending on the specific application. We investigated intra-node and inter-node communication behavior of four distinct cluster setups as described in Table 1 with MPI micro-benchmarks. Although a hybrid programming model with MPI for inter-node parallelism and OpenMP for intra-node parallelism is often proposed as the most efficient way to use multi-core computing nodes within a cluster [3, 4], traditional MPI programming is likely to remain important for portability issues and also to cope with the huge set of existing MPI-based applications. Moreover, in order to bring this study closer to “real-world” applications environment, all five MPIbased kernel algorithms of the NAS Parallel Benchmark suite [2] (or NPB) were run on the same four cluster setups and analyzed altogether. NPB is derived of real computational fluid dynamics (CFD) applications required by NASA. This paper follows with related works in Section 2. Our proposal of cluster setups is described in Section 3 while experimental results are explained in Section 4. Conclusion and future work are presented in Section 5. Lastly, acknowledgements are found in Section 6. 3. Proposed setup approach High performance computing based on commodities has become feasible with the growing popularity of multi-core technology and Gigabit Ethernet interconnect. Besides, a computing host with more than one core offers the possibility to leave, for instance, one core idle relatively to application processing. Our proposal consists of leaving idle cores on some or all hosts of a cluster in order to process communication overhead of a running application. First, we shall define a few terms. A core is the atomic processing unit of a computing system. A socket contains one or more cores. A host or node is a singular machine containing one or more sockets that shares resources such as main memory and interconnect access. A cluster, also referred to as system, is a set of interconnected hosts. Table 1. Clusters architecture and hardware setups 2. Related work Works related to this study comprise three streams: impact of multi-core technology and also of dedicated network processors on cluster performance, and characterization of the NAS Parallel Benchmark. Impact of multi-core technology on performance of clusters is investigated in some works. Chai, Hartono and Panda [26] focus on intra-node MPI processes communication whereas Pourreza and Graham [25] also take into account advantages of resource sharing, both of them using communication micro-benchmarks only. Differently, Alam et al [21] base the investigation on characterization of scientific applications. Moreover, some other works are in relation to this study since the impact of dedicated processors for communication processing is investigated though with non-commodity interconnects such as Myrinet and Infiniband. These refer to works of Lobosco, Costa and de Amorim [24], of Brightwell and Underwood [22] and of Pinto, Mendonça and Dantas [23], all of which focus on broadly used scientific applications such as, for instance, the NAS Parallel Benchmark (NPB). Table 1 describes all four clusters, computing nodes, setups and interconnects. Xeon-based [6] cluster runs Linux kernel 2.6.8.1 and Opteron-based [7], Linux kernel 2.6.22.8. Both have SMP support enabled. All systems use at most 8 cores for application processing. Systems A and C have idles cores whereas systems B and D do not. System A has one idle core per host whereas in system C all 4 idle cores reside in the same host so that second host has no idle cores. 34 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. medium and large message lengths either in one-way or two-way modes. Moreover, results for latency of systems A and C show similar results whereas system D indicates lowest latency for all message lengths. In the specific case of two-way communication, the randomly ordered ring has no effect if it were naturally ordered. Processes on systems B and D run on the same host thus using host bus to communicate. On the other hand, processes on systems A and C run in different hosts thus using Gigabit Ethernet LAN to communicate. MPICH2 [8, 9], version 1.0.6, is used as the MPI library implementation for all systems. It is important to emphasize that Gigabit Ethernet engages all communication processing to a host processor. Either system calls or protocol and packet processing are performed by a host processor. Differently, Myrinet [14] and Infiniband [12] NIC’s are equipped with a dedicated network processor which is in charge of protocol processing. Furthermore, communication flow bypasses OS via DMA data transfer. Figure 1 presents distinct flows of Ethernet technology and VI Architecture [28]. Figure 2. Latency for one-way and two-way communication between 2 processes From Figure 3, we can state that (1) bandwidth for either one-way or two-way communication on systems B and D are greater than for systems A and C for any message length. Moreover, (2) bandwidth behavior of two-way communication for system D and of one-way communication for system B are quite similar. (3) Two-way communication bandwidth for system B is similar compared to its one-way communication pattern for small and medium-sized messages, but (4) for messages larger than 32KB its pattern becomes flat and of lower performance. (5) Communication bandwidth pattern of either one-way or two-way for systems A and C are very similar. However, (6) bandwidth is greater for two-way than for one-way communication for messages of up to 8KB for system C and up to 64KB for system A. That is in part because (7) b_eff calculates bandwidth for one-way communication based on maximum latency while bandwidth for two-way communication is based on average latency. Anyway, (8) when message length is larger, bandwidth for two-way represents up to 80% of bandwidth for one-way communication. Based on previous assertions (4) and (6), the idle cores of systems A and C seemingly do not indicate great positive effects on performance of either one-way or two-way inter-process communication. Figure 1. Dataflow of Ethernet and VI Architecture Systems are idle awaiting experiments to be run. 4. Results 4.1. Communication benchmark: b_eff In order to characterize bandwidth and latency of all four systems, we ran b_eff communication benchmark [10], which has a version as part of the HPC Challenge Benchmark [11]. However, this version only tests bandwidth and latency for messages of length 8 and 2.000.000 bytes. So we adapted b_eff to evaluate communication for a wider range of message sizes, from 2 bytes to 16 megabytes. In Figure 2, latency for 2 communicating processes is shown as a function of message length. Results show that Ethernet-based systems A and C indicate higher latency than systems B and D for communicating 35 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. bandwidth of system C indicates a decreasing pattern and its overall inter-process communication performance is quite worse than of system A. On the other hand, bandwidth of system A indicates an increasing pattern and its overall inter-process communication performance is the second best of all. This issue is explained with the fact that system A has one idle core per host whereas system C has four idles processes on only one host. It means that, for system C, overall performance is held back by the slowest host, which has all of its cores busy. For system A, the opposite result is seen. Each idle core offloads some communication overhead. Roughly speaking, one core is in charge of processing user-level MPI process and the second one is executing MPI communication operations. Figure 3. Bandwidth for one-way and two-way communication between 2 processes However, one-way and two-way communication patterns do not provide a whole behavioral overview. So, in addition, results for 8 simultaneous communicating processes are presented in Figure 4. 4.2. The NAS Parallel Benchmark The NAS Parallel Benchmark consists of 8 benchmark programs. Five of them are kernel benchmarks (EP, FT, IS, CG and MG) and the other three are considered simulated application benchmarks (SP, BT and LU) [2]. NPB version used is 2.3. This study focuses on all five kernel benchmarks only. Algorithms were compiled equally for all systems, using O3 optimization directive either with mpif77 for Fortran code as with mpicc for IS, the only algorithm in C code. All kernel benchmarks were run for Class B size. Experiments were run 5 times for each case in order to achieve a fair mean of execution time [13]. First, we present NPB algorithms which perform predominantly collective communications and then algorithms dominated by point-to-point communications. It also follows a descendent order of granularity, which is the ratio of computation to the amount of communication the algorithm performs. Greater demands for communication among processes characterize lower granularity. On the other hand, little amount of data communicated in an infrequent fashion characterizes a coarse-grained algorithm and therefore higher computation to communication ratio. The EP (Embarrassingly Parallel) algorithm consists of much computation and negligible communication. It provides an estimate of the upper limit for floating-point computational power. Communication occurs only to distribute initial data and to gather partial results for a final result. Thus, EP is coarse-grained. Although systems A and B run at 10% higher clock frequency, as of Table 1, they are outperformed by systems C and D for EP algorithm, as of Figure 5. In fact, this positive impact is lead mostly by more efficient memory subsystem of systems C and D. In Figure 4. Latency and bandwidth for 8 communicating processes in a randomly ordered ring. In Figure 4, results indicate that the best overall inter-process communication performance for system D but note that there is no need to access network for it is an SMP host. However, there is a great negative impact on bandwidth for messages larger than 64KB. The worst overall inter-process communication performance is of system B, which has no idle cores and each pair of processes compete for accessing main memory and network card because the host bus is shared. Now, an interesting issue concerns communication performance of systems A and C. Both systems are set up according to our proposal, with idle cores which can be in charge of communication processing. This gives the impression that their performance should be better and in some way similar. However, on the one hand, 36 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. conclusion, we assume higher per-core performance of systems C and D compared to systems A and B. As we can see in Figure 6, systems A and B scale better than system C and D. Requirements of systems A and B for bandwidth are lower than of systems C and D because of lower per-core performance [19]. Note that for 8 processes on systems B and D, all cores are busy running the application. So execution time on systems A and D turn out to be practically the same, despite higher per-core performance of system D. That is basically because idle cores of system A act as if they were dedicated network processors, allowing higher computing power for the application, if in asynchronous communication mode, and also decreasing accesses to main memory and overhead due to context switches between application, OS and MPI communication processing. The IS (Integer Sort) benchmark is the only algorithm of the NPB that does not focus on floatingpoint computation for it is an integer bucket sort algorithm. It is dominated by reductions and unbalanced all-to-all communications, relying on a random key distribution for load balancing. That means communication pattern is dependent on data set [15, 17]. Anyway, granularity of IS is smaller than that of EP and FT, and is characterized as fine-grained. Additionally, messages are smaller than in FT. It is mid-sized in average, ranging from small to large messages however only a few are mid-sized [20]. Figure 5. Mean execution time for EP class B. The FT (FFT 3D PDE) algorithm performs a 3D partial differential equation solution using a series of 1D FFT’s. It requires a large amount of floating-point computation as well as communication, although mostly very large messages in a low frequency, within the range of megabytes. That is because authors of NPB have put effort on aggregating messages in the application level in order to minimize message cost [15]. The result is a mid-grained algorithm with a perfectly balanced all-to-all communication pattern, which means each process sends and receives the same amount of data. Additionally, although the bandwidth required increases proportionally to per-core performance, it does not increase as the number of cores used are scaled up [19]. Figure 7. Mean execution time for IS class B. Once again, as in Figure 7, systems A and B show greater performance than systems C and D as the number of processors used is scaled up. However, note that for 8 processes, the loss of efficiency of systems C and D is even greater than that of FT algorithm, as opposed to Figure 6. This loss of efficiency is due to a higher frequency of inter-process communication although messages are mid-sized in average. Figure 6. Mean execution time for FT class B. 37 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. alternated with short computation phases, overlapping communication and computation becomes difficult. The CG (Conjugate Gradient) benchmark is a conjugate gradient method which consists of floatingpoint computations and tests frequent and irregular long distance point-to-point communication [2]. Although it is also computing-intensive, CG is characterized a fine-grained algorithm for its large number of messages communicated. Average message size is smaller than that of FT and IS with a predominant number of small messages, only a few bytes long, and the rest mostly large messages [20]. Figure 9. Mean execution time for MG class B. The better performance of FT, IS, CG and MG on system A, which has one core per host idle in relation to application processing, when compared to system D, a single SMP host with 8 cores, indicates that the proposed cluster setup is advantageous in order to gain performance on clusters of commodities. Furthermore, in Table 2, the impact of not leaving one idle core in each host is presented. NPB algorithms were executed with 16 processes on system A. Coarsegrained EP had a considerable speedup whereas medium and fine-grained algorithms resulted in loss of performance when compared to the execution with 8 processes also on system A. These results confirm and quantify the benefits from the proposed cluster setup for medium and fine-grained applications. Figure 8. Mean execution time for CG class B. Fine-grained algorithm execution does not take as much advantage of greater per-core performance as mid-grained and coarse-grained applications. Compared to FT and IS, Figure 8 shows that execution time of CG on systems A and B for increasing number of processes are closer to those of systems C and D. Besides, systems A and B even exceed systems C and D for 8 running processes. The MG (Multi-Grid) algorithm is a simplified multi-grid algorithm that approximates a solution to the discrete Poisson problem. It tests both short and long distance communication in a frequent manner with predominant point-to-point communication between neighboring processes. Communication phases are somewhat evenly distributed throughout the execution so that MG is a fine-grained algorithm with small computation phases. Message size is medium in average because processes exchange messages of many sizes, from small to large, in a uniform pattern [15, 20]. Such characteristics hold back higher per-core performance of systems C and D. As of Figure 9, execution time of MG running 8 processes on system A are even shorter than those on systems B, C and D. Besides, when frequent communication phases are Table 2. Speedup on System A Exec. EP FT IS CG MG 8 proc. 47.696 53.232 3.234 45.268 6.798 16 proc. 25.860 57.492 4.066 49.988 7.724 Speedup 45.78% - 8.00% -25.73% -10.43% -13.62% 5. Conclusion and Future Work This research allowed us to identify there is no simple relation for predicting performance of multicore clusters of commodities. Depending on the application, cluster behavior and performance may vary unexpectedly. Additionally, a detailed analysis of four distinct multi-core cluster systems helped to better 38 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. characterize the trade-offs involved. In corroboration with our preliminary evaluation, performance of multicore clusters tends to be approximated as application granularity decreases. Based on our experiments, a detailed analysis allowed us to point considerable benefits with the proposed cluster setup in which one processing core per host is idle in relation to application processing. So, the proposed approach may introduce considerable performance gains. However, if no core is left idle in one host of a cluster, this host may hold back overall performance. This could be evidenced with results from b_eff and from medium and fine-grained NPB algorithms. A cluster set up according to the proposed approach was able to outperform a single eight-core SMP host in which all communication occurs within the host bus and thus no networking is required. Since Ethernet interconnect overloads host processors with communication overhead, efficiency is penalized when all processors of a host are busy running application because of competition for communication and application processing. However, resulting performance depends on application granularity and behavior. Finally, this paper has succeeded in indicating economically more accessible alternatives, based on commodities only, in order to achieve better performance in clusters of small and medium sizes. For future work, we plan to extend this study to other benchmarking suites and also to full “real-world” applications towards a broader analysis of multi-core clusters trade-offs. Ongoing research focuses on quantifying benefits of the proposed approach compared to clusters interconnected with Myrinet and Infiniband. Besides, it is important to verify scalability issues of our proposal in opposition to currently increasing number of cores within a single host. [3] R. Rabenseifner and G. Wellein: “Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures”, International Journal of High Performance Computing Applications, Sage Science Press, Vol. 17, No. 1, 2003, pp 49-62. [4] F. Cappello and D. Etiemble, “MPI versus MPI+OpenMP on the IBM SP for the NAS benchmarks”, Supercomputing'00, Dallas, TX, 2000. [5] H. Meuer, E. Strohmaier, J. Dongarra, H. D. Simon, Universities of Mannheim and Tennessee, “TOP500 Supercomputer Sites”, www.top500.org. [6] Intel, “Intel® Xeon® Processor with 533 MHz FSB at 2GHz to 3.20GHz Datasheet”, Publication 252135, 2004. [7] AMD, “AMD Opteron™ Processor Product Data Sheet”, Publication 23932, 2007. [8] W. Gropp, E. Lusk, N. Doss and A. Skjellum, “A highperformance, portable implementation of the MPI message passing interface standard”, Parallel Computing, Vol. 22, No. 6, 1996, pp 789-828. [9] Message Passing Interface Forum. “MPI-2: Extensions to the Message-Passing Interface”, July 1997. [10] R. Rabenseifner and A. E. Koniges, “The Parallel Communication and I/O Bandwidth Benchmarks: b_eff and b_eff_io”, Cray User Group Conference, CUG Summit, 2001. [11] P. Luszczek, D. Bailey, J. Dongarra, J. Kepner, R. Lucas, R. Rabenseifner, D. Takahashi, “The HPC Challenge (HPCC) Benchmark Suite”, SC06 Conference Tutorial, IEEE, Tampa, Florida, 2006. [12] Cassiday D. “InfiniBand Architecture Tutorial”. Hot Chips 12. 2000. [13] H. Jordan and G. Alaghband, “Fundamentals of Parallel Processing”, Prentice Hall, 2003. 6. Acknowledgements [14] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su, “Myrinet: A Gigabit-persecond Local Area Network”, IEEE Micro, 1995. This research was supported with cluster environments by OMEGATEC and Epagri in collaboration with CAPES. [15] R. Martin, “A Systematic Characterization of Application Sensitivity to Network Performance”, PhD thesis, University of California, Berkeley, 1999. 7. References [16] A. Faraj and X Yuan, “Communication Characteristics in the NAS Parallel Benchmarks”, Parallel and Distributed Computing and Systems, 2002. [1] Message Passing Interface Forum. “MPI: A MessagePassing Interface Standard, Rel. 1.1”, June 1995, www.mpi-forum.org. [17] T. Tabe and Q. Stout, “The use of MPI communication library in the NAS parallel benchmarks”, Technical Report CSE-TR-386-99, University of Michigan, 1999. [2] D. Bailey, H. Barszcz, et al, “The NAS Parallel Benchmarks”, International Journal of Supercomputer Applications,Vol. 5, No. 3, 1991, pp. 63-73. [18] J. Subhlok, S. Venkataramaiah, and A. Singh, “Characterizing NAS Benchmark Performance on Shared 39 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply. Heterogeneous Networks”, International Parallel and Distributed Processing Symposium, IEEE, 2002, pp 86-94. [19] Y. Sun, J. Wang and Z. Xu, “Architetural Implications of the NAS MG and FT Parallel Benchmarks”, Advances in Parallel and Distributed Computing, 1997, pp 235-240. [20] J. Kim, D. Lilja, “Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs”, Communication, Architecture, and Applications for Network-Based Parallel Computing, 1998, pp 202-216. [21] S. R. Alam, R. F. Barrett, J. A. Kuehn, P. C. Roth, J. S. Vetter, “Characterization of Scientific Workloads on Systems with Multi-Core Processors”, International Symposium on Workload Characterization, IEEE, 2006, pp 225-236. [22] R. Brightwell, and K. Underwood, “An Analysis of the Impact of MPI Overlap and Independent Progress”, International Conference on Supercomputing, 2004. [23] L. C. Pinto, R. P. Mendonça and M. A. R. Dantas, “Impact of interconnects to efficiently build computing clusters”, ERRC, 2007. [24] M. Lobosco, V. S. Costa and C. L. de Amorim, “Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster”, International Conference on Computational Science, 2002, pp. 296-305. [25] H. Pourreza and P. Graham, “On the Programming Impact of Multi-core, Multi-Processor Nodes in MPI Clusters”, High Performance Computing Systems and Applications, 2007. [26] L. Chai, A. Hartono and D. Panda, “Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters”, IEEE International Conference on Cluster Computing, 2006. [27] Anubis G. M. Rossetto, Vinicius C. M. Borges, A. P. C. Silva and M. A. R. Dantas, “SuMMIT - A framework for coordinating applications execution in mobile grid environments”, GRID, 2007, p. 129-136. [28] D. Dunning et al., “The Virtual Interface Architecture”, IEEE Micro, 1998, pp. 66-76. 40 Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on February 24, 2009 at 17:38 from IEEE Xplore. Restrictions apply.