COSC6365 Lennart Johnsson 2013-04-02 Introduction to HPC Lecture 21 Lennart Johnsson Dept of Computer Science COSC6365 Lennart Johnsson 2013-04-02 Sample Sort • Randomly select a set of splitters (at least N-1 for N processors) locally • Sort the splitters globally • Assign elements to buckets locally • Permute • Local sort 1 COSC6365 Lennart Johnsson 2013-04-02 Sample Sort Bucket expansion for sample sorting 106 keys on 1024 nodes as a function of oversampling ration s. The two dashed curves show bucket expansion not to be exceeded by a probability of 0.999 and 0.999999 respectively. The solid curves show the maximum and average observed expansion over 1000 trials. COSC6365 Lennart Johnsson 2013-04-02 Sample Sort Sample sort time as a function of the oversampling ratio for 16384 keys per node of a 1024 node CM-2. 2 COSC6365 Lennart Johnsson 2013-04-02 Sample Sort Sample sort execution time on a 1024 node CM-2 for 64-bit keys. Note broadcast time for the splitters is independent of the input size. For 4k keys per processors and beyond the oversampling ratio is increased form 32 to 64 in order to reduce the time for the local sort, which improves with reduced bucket expansion. The per key times for send and binary search remain constant COSC6365 Lennart Johnsson 2013-04-02 Sample Sort Comparing 64-bit key sorting times on a 1024 node CM-2. Memory denotes the memory used by the algorithm relative to the original data. Rank denotes the time for rank relative to the time for sort. 3 COSC6365 Lennart Johnsson 2013-04-02 Bitonic merge COSC6365 Lennart Johnsson 2013-04-02 Min Compare Min Max Max 4 COSC6365 Lennart Johnsson 2013-04-02 Bitonic Merge a1 A L A L c1 a2 B H B H c2 b1 A L A L c3 b2 B H B H c4 COSC6365 Lennart Johnsson 2013-04-02 Bitonic recursive merge 2-Shuffle 2-Unshuffle 3-Shuffle 3-Unshuffle a1 A L A L A L c1 a2 B H B H B H c2 a3 A L A L A L c3 a4 B H B H B H c4 b1 A L A L A L c5 b2 B H B H B H c6 b3 A L A L A L c7 b4 B H B H B H c8 Complexity: N/2log2N 5 COSC6365 Lennart Johnsson 2013-04-02 Bitonic recursive merge 2-Shuffle 2-Unshuffle 3-Shuffle 3-Unshuffle a1 A L A L A L A L A L A L c1 a2 B H B H B H B H B H B H c2 a3 A H A L A L A L A L A L c3 a4 B L B H B H B H B H B H c4 a5 A L A H A H A L A L A L c5 a6 B H B L B L B H B H B H c6 a7 A H A H A H A L A L A L c7 a8 B L B L B L B H B H B H c8 Complexity: N/2log2N COSC6365 Lennart Johnsson 2013-04-02 Bitonic Sort Complexity: Bitonic merge: Nlog2N Bitonic Sort: N(log2N)2 6 COSC6365 Lennart Johnsson 2013-04-02 Bitonic Sort Multiple elements per core Cyclic allocation Compare 4 4 4 5 5 Outcome 4 5 2 1 0 3 6 3 2 7 6 1 7 0 0 2n 2n+1 2k-1+n 2n+7 2k-1+n 0 2n 2 0 0 2n 6 7 0 5 3 3 1 1 0 2n 7 0 2n 0 2n First k steps local: 2k-1 comparisons per core per step. Last n steps local: 2k bitonic sequences each with one element per core. COSC6365 Lennart Johnsson 2013-04-02 Bitonic Sort Multiple elements per core Cyclic allocation Compare 4 4 4 5 5 Outcome 4 5 2 1 0 3 6 3 2 7 6 1 7 0 0 2n 2n+1 2k-1+n 2k-1+n 2n+7 0 2n 2 0 0 2n 0 6 7 5 3 3 1 1 0 2n 0 7 2n 0 2n Time: k2k-1 +2kn (or ~P/Nlog2(P/N) + P/Nlog2N = P/Nlog2P)). 7 COSC6365 Lennart Johnsson 2013-04-02 Sequential Merge Complexity: P for P elements Bitonic Merge: Plog2P Inefficient by a factor of log2P COSC6365 Lennart Johnsson 2013-04-02 Hybrid Merge 8 COSC6365 Lennart Johnsson 2013-04-02 Valiant’s Parallel Merge Sort • Merge two sorted sequences P and R in parallel by splitting each sequence evenly in sqrt size chunks. • Sequence P: chunk size √P, √P chunks • Sequence R: chunk size √R, √R chunks • Merge the √P and √R splitters • Insert the √P splitters into the proper chunk of R • Repeat the process recursively for each sublist created by the insertion COSC6365 Lennart Johnsson 2013-04-02 Valiant’s Parallel Merge sort • Merge the √P+√R splitters requires √P x √R comparisons which can be carried out on √P x √R cores in unit time. • Second step the same … • And the third …… see notes 9 COSC6365 Lennart Johnsson 2013-04-02 References • Sorting networks and their applications. Kenneth E. Batcher, In Spring Joint Computer Conference, pp. 307-314, IEEE, 1968, http://dl.acm.org/citation.cfm?id=1468121 • The Art of Computer Programming, Donald E. Knuth, Vol. 3: Sorting and Searching, Addison-Wesley, 1973, http://dl.acm.org/citation.cfm?id=280635 • Parallelism in comparison problems, Leslie Valiant, SIAM Journal on Computing, 4(3), pp. 348 – 355, September 1975. http://epubs.siam.org/action/showAbstract?page=348&volume=4&issue=3&journalCode=smjcat& • Sorting on a mesh-connected parallel computer, C.D. Thompson and H.T. Kung, CACM, 20(4), pp. 263 – 271, 1977, http://dl.acm.org/citation.cfm?id=359481 • Optimal sorting algorithms for parallel computers, Gerald M. Baudet and D. Stevenson, Trans. Computers, C27(1), pp. 84-87, 1978, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=1674957 • Fast parallel sorting algorithms, Daniel S. Hirschberg, Communications of the ACM, 21(8), pp. 657- 661, 1978, http://dl.acm.org/citation.cfm?id=359582 • Bitonic sort on a mesh-connected parallel computer, David Nassimi and Sartaj Sahni, IEEE Trans. Computers, C-28(1), pp. 2 - 8, January 1979, http://dl.acm.org/citation.cfm?id=1309426 • An O(Nlog2N) sorting network, Michael Ajtai, J. Komlos, and E. Szemeredi. In Proceedings of the Fifteenth Annual ACM Symposium on the Theory of Computing (STOC), pp. 1-9, ACM Press, 1983, http://dl.acm.org/citation.cfm?id=808726 • An efficient implementation of Batcher's bitonic odd-even merge algorithm and its application in parallel sorting schemes, M. Kumar and Daniel S. Hirschberg, IEEE Trans. Computers, 32(3), pp. 254-264, 1983, http://www.computer.org/csdl/trans/tc/1983/03/01676217-abs.html • Combining parallel and sequential sorting on a Boolean n-cube, S. Lennart Johnsson, In 1984 International Conference on Parallel Processing, pp. 444-448, IEEE Computer Society, 1984. • Some parallel sorts on a mesh-connected processor array and their efficiency, K. Sado and, Y. Igarashi, Journal of Parallel and Distributed Computing, vol. 3, pp. 398-410, September, 1986, http://dl.acm.org/citation.cfm?id=19661.19668 • Shear-sort: A true two-dimensional sorting technique for VLSI networks, I. Scherson, S. Sen, and A. Shamir, In IEEE-ACM International Conference on Parallel Processing, pages 903-908, 1986. COSC6365 Lennart Johnsson 2013-04-02 References (cont’d) • Design and Analysis of Spatial Data Structures, H.Samet, Addison-Wesley, 1990, http://dl.acm.org/citation.cfm?id=77589 • A Comparison of Sorting Algorithms for the Connection Machine CM-2, Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith, Marco Zagha, Proceedings of the third Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 3 – 16, 1991, http://dl.acm.org/citation.cfm?id=113380 • Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, F. Thomson Leighton, Morgan Kaufmann, 1992, http://dl.acm.org/citation.cfm?id=119339 • Parallel Sorting Patterns, Vivek Kyle, Edgar Solomonik, Proceedings of the 2010 Workshop on Parallel Programming Patterns, ParaPLoP, ACM, 2010, http://dl.acm.org/citation.cfm?id=1953621 • Highly Scalable Parallel Sorting, Edgar Solomonik and Laxmikant V. Kale, IEEE International Parallel and Distributed Processing Symposium, 2010, http://charm.cs.uiuc.edu/media/09-10, slides http://charm.cs.illinois.edu/talks/SortingIPDPS10.pdf 10
© Copyright 2024