Introduction to HPC Lecture 21 Sample Sort

COSC6365
Lennart Johnsson
2013-04-02
Introduction to HPC
Lecture 21
Lennart Johnsson
Dept of Computer Science
COSC6365
Lennart Johnsson
2013-04-02
Sample Sort
• Randomly select a set of splitters (at least
N-1 for N processors) locally
• Sort the splitters globally
• Assign elements to buckets locally
• Permute
• Local sort
1
COSC6365
Lennart Johnsson
2013-04-02
Sample Sort
Bucket expansion for sample sorting 106 keys on 1024 nodes as a function of
oversampling ration s. The two dashed curves show bucket expansion not to
be exceeded by a probability of 0.999 and 0.999999 respectively. The solid
curves show the maximum and average observed expansion over 1000 trials.
COSC6365
Lennart Johnsson
2013-04-02
Sample Sort
Sample sort time as a function of the oversampling ratio for
16384 keys per node of a 1024 node CM-2.
2
COSC6365
Lennart Johnsson
2013-04-02
Sample Sort
Sample sort execution time on a 1024 node CM-2 for 64-bit keys. Note broadcast
time for the splitters is independent of the input size. For 4k keys per processors
and beyond the oversampling ratio is increased form 32 to 64 in order to reduce
the time for the local sort, which improves with reduced bucket expansion. The per
key times for send and binary search remain constant
COSC6365
Lennart Johnsson
2013-04-02
Sample Sort
Comparing 64-bit key sorting times on a 1024 node CM-2.
Memory denotes the memory used by the algorithm
relative to the original data.
Rank denotes the time for rank relative to the time for sort.
3
COSC6365
Lennart Johnsson
2013-04-02
Bitonic merge
COSC6365
Lennart Johnsson
2013-04-02
Min
Compare
Min
Max
Max
4
COSC6365
Lennart Johnsson
2013-04-02
Bitonic Merge
a1
A
L
A
L
c1
a2
B
H
B
H
c2
b1
A
L
A
L
c3
b2
B
H
B
H
c4
COSC6365
Lennart Johnsson
2013-04-02
Bitonic recursive merge
2-Shuffle 2-Unshuffle
3-Shuffle 3-Unshuffle
a1
A
L
A
L
A
L
c1
a2
B
H
B
H
B
H
c2
a3
A
L
A
L
A
L
c3
a4
B
H
B
H
B
H
c4
b1
A
L
A
L
A
L
c5
b2
B
H
B
H
B
H
c6
b3
A
L
A
L
A
L
c7
b4
B
H
B
H
B
H
c8
Complexity: N/2log2N
5
COSC6365
Lennart Johnsson
2013-04-02
Bitonic recursive merge
2-Shuffle 2-Unshuffle
3-Shuffle 3-Unshuffle
a1
A
L
A
L
A
L
A
L
A
L
A
L
c1
a2
B
H
B
H
B
H
B
H
B
H
B
H
c2
a3
A
H
A
L
A
L
A
L
A
L
A
L
c3
a4
B
L
B
H
B
H
B
H
B
H
B
H
c4
a5
A
L
A
H
A
H
A
L
A
L
A
L
c5
a6
B
H
B
L
B
L
B
H
B
H
B
H
c6
a7
A
H
A
H
A
H
A
L
A
L
A
L
c7
a8
B
L
B
L
B
L
B
H
B
H
B
H
c8
Complexity: N/2log2N
COSC6365
Lennart Johnsson
2013-04-02
Bitonic Sort
Complexity:
Bitonic merge: Nlog2N
Bitonic Sort:
N(log2N)2
6
COSC6365
Lennart Johnsson
2013-04-02
Bitonic Sort
Multiple elements per core
Cyclic allocation
Compare
4
4
4
5
5
Outcome
4
5
2
1
0
3
6
3
2
7
6
1
7
0
0
2n
2n+1
2k-1+n
2n+7
2k-1+n
0
2n
2
0
0
2n
6
7
0
5
3
3
1
1
0
2n
7
0
2n 0
2n
First k steps local: 2k-1 comparisons per core per step.
Last n steps local: 2k bitonic sequences each with one element per core.
COSC6365
Lennart Johnsson
2013-04-02
Bitonic Sort
Multiple elements per core
Cyclic allocation
Compare
4
4
4
5
5
Outcome
4
5
2
1
0
3
6
3
2
7
6
1
7
0
0
2n
2n+1
2k-1+n
2k-1+n
2n+7
0
2n
2
0
0
2n
0
6
7
5
3
3
1
1
0
2n
0
7
2n 0
2n
Time: k2k-1 +2kn (or ~P/Nlog2(P/N) + P/Nlog2N = P/Nlog2P)).
7
COSC6365
Lennart Johnsson
2013-04-02
Sequential Merge
Complexity: P for P elements
Bitonic Merge: Plog2P
Inefficient by a factor of log2P
COSC6365
Lennart Johnsson
2013-04-02
Hybrid Merge
8
COSC6365
Lennart Johnsson
2013-04-02
Valiant’s Parallel Merge Sort
• Merge two sorted sequences P and R in parallel
by splitting each sequence evenly in sqrt size
chunks.
• Sequence P: chunk size √P, √P chunks
• Sequence R: chunk size √R, √R chunks
• Merge the √P and √R splitters
• Insert the √P splitters into the proper chunk of R
• Repeat the process recursively for each sublist
created by the insertion
COSC6365
Lennart Johnsson
2013-04-02
Valiant’s Parallel Merge sort
• Merge the √P+√R splitters requires √P x √R
comparisons which can be carried out on
√P x √R cores in unit time.
• Second step the same …
• And the third …… see notes
9
COSC6365
Lennart Johnsson
2013-04-02
References
• Sorting networks and their applications. Kenneth E. Batcher, In Spring Joint Computer Conference, pp. 307-314,
IEEE, 1968, http://dl.acm.org/citation.cfm?id=1468121
• The Art of Computer Programming, Donald E. Knuth, Vol. 3: Sorting and Searching, Addison-Wesley, 1973,
http://dl.acm.org/citation.cfm?id=280635
• Parallelism in comparison problems, Leslie Valiant, SIAM Journal on Computing, 4(3), pp. 348 – 355, September
1975. http://epubs.siam.org/action/showAbstract?page=348&volume=4&issue=3&journalCode=smjcat&
• Sorting on a mesh-connected parallel computer, C.D. Thompson and H.T. Kung, CACM, 20(4), pp. 263 – 271,
1977, http://dl.acm.org/citation.cfm?id=359481
• Optimal sorting algorithms for parallel computers, Gerald M. Baudet and D. Stevenson, Trans. Computers, C27(1), pp. 84-87, 1978, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=1674957
• Fast parallel sorting algorithms, Daniel S. Hirschberg, Communications of the ACM, 21(8), pp. 657- 661, 1978,
http://dl.acm.org/citation.cfm?id=359582
• Bitonic sort on a mesh-connected parallel computer, David Nassimi and Sartaj Sahni, IEEE Trans. Computers,
C-28(1), pp. 2 - 8, January 1979, http://dl.acm.org/citation.cfm?id=1309426
• An O(Nlog2N) sorting network, Michael Ajtai, J. Komlos, and E. Szemeredi. In Proceedings of the Fifteenth Annual
ACM Symposium on the Theory of Computing (STOC), pp. 1-9, ACM Press, 1983,
http://dl.acm.org/citation.cfm?id=808726
• An efficient implementation of Batcher's bitonic odd-even merge algorithm and its application in parallel
sorting schemes, M. Kumar and Daniel S. Hirschberg, IEEE Trans. Computers, 32(3), pp. 254-264, 1983,
http://www.computer.org/csdl/trans/tc/1983/03/01676217-abs.html
• Combining parallel and sequential sorting on a Boolean n-cube, S. Lennart Johnsson, In 1984 International
Conference on Parallel Processing, pp. 444-448, IEEE Computer Society, 1984.
• Some parallel sorts on a mesh-connected processor array and their efficiency, K. Sado and, Y. Igarashi,
Journal of Parallel and Distributed Computing, vol. 3, pp. 398-410, September, 1986,
http://dl.acm.org/citation.cfm?id=19661.19668
• Shear-sort: A true two-dimensional sorting technique for VLSI networks, I. Scherson, S. Sen, and A. Shamir, In
IEEE-ACM International Conference on Parallel Processing, pages 903-908, 1986.
COSC6365
Lennart Johnsson
2013-04-02
References (cont’d)
• Design and Analysis of Spatial Data Structures, H.Samet, Addison-Wesley, 1990,
http://dl.acm.org/citation.cfm?id=77589
• A Comparison of Sorting Algorithms for the Connection Machine CM-2, Guy E. Blelloch, Charles E. Leiserson,
Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith, Marco Zagha, Proceedings of the third Annual ACM
Symposium on Parallel Algorithms and Architectures (SPAA), pp. 3 – 16, 1991,
http://dl.acm.org/citation.cfm?id=113380
• Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, F. Thomson Leighton,
Morgan Kaufmann, 1992, http://dl.acm.org/citation.cfm?id=119339
• Parallel Sorting Patterns, Vivek Kyle, Edgar Solomonik, Proceedings of the 2010 Workshop on Parallel
Programming Patterns, ParaPLoP, ACM, 2010, http://dl.acm.org/citation.cfm?id=1953621
• Highly Scalable Parallel Sorting, Edgar Solomonik and Laxmikant V. Kale, IEEE International Parallel and
Distributed Processing Symposium, 2010, http://charm.cs.uiuc.edu/media/09-10, slides
http://charm.cs.illinois.edu/talks/SortingIPDPS10.pdf
10