DIRECT AND ITERATIVE METHODS FOR BLOCK TRIDIAGONAL LINEAR SYSTEMS Don Eric Heller April 1977 Department of Computer Science Carnegie-Mellon University Pittsburgh, PA 15213 Submitted of Doctor in partial fulfillment of the requirements for the degree of Philosophy at Carnegie-Mellon University. This research was supported in part by the Office of Naval Research under Contract N00014-76-C-0370, NR 044-422, by the National Science Foundation under Grant MCS75-222-55, and by the National Aeronautics and Space Administration under Grant NGR-47-102-001, while the author was in residence at ICASE, NASA Langley Research Center. ABSTRACT Block scientific lems. tridiagonal computations, Numerical emphasis methods demonstrating often methods on efficient for direct systems of linear forming for solution methods methods. Block elimination reduction is quadratic, matrix, and are the only quadratic gence of odd-even Semi-direct reduction tions for an approximate line computer effect solution. is given, with of machine prove constraints Newton methods methods useful based dominance iteration on vector cyclic in combination theory of direct odd-even for the inverse reasonable on the quadratic with requirements conver- linear time analysis operations. with The odd-even a certain to storage prob- is developed, properties exist. An execution attention complicated is linear, within in A convergence diagonal methods of the quadratic frequently are studied and approximation and higher-order are variations of algorithms. computer. (LU factorization) methods class of such systems of block convergence occur the core of more for a vector under conditions stability, equations itera- for a pipeand the PREFACE It is with many my introduction his support to numerical in residence R. G. Voigt were must I acknowledge mathematics Enlightening while in the thesis. at ICASE, NASA most helpful A number Langley in clarifying at the Pennsylvania be thanked for his forbearance. Of course, the most Molly, important who has put up with more State Professor and parallel H. T. Kung and D. K. Stevenson the ideas presented completed that and encouragement. S. H. Fuller, while thanks computation, conversations have Research with contributed of early Center; University, and for G. J. Fix, were obtained J. M. Ortega and The thesis was and P. C. Fischer has been than her fair share. for to some of results these results. contribution ii J. F. Traub, that of my wife, CONTENTS I. 2. 3. Introduction A. Summary B. Notation Some A. Analytic B. Models A. 4. 5. of Main Results ................... Initial Linear page I .......................... .......................... Remarks Tools Methods 8 ...................... ....................... of Parallel Computation ............... ......................... The LU Factorization 5 I0 I0 14 18 .................... 18 I. Block Elimination .................... 18 2. Further 25 3. Back Substitution 4. Special B. Gauss-Jordan C. Parallel Quadratic Properties of Block Elimination ......... .................... Techniques ................... Elimination Computation .................. .................... Methods ........................ 30 31 33 36 37 A. Odd-Even Elimination .................... 37 B. Variants of Odd-Even Elimination 45 C. Odd-Even Reduction .............. ..................... 48 I. The Basic Algorithm ................... 48 2. Relation 51 3. Back Substitution 4. Storage Requirements 5. Special Techniques to Block .................... .................. ................... D. Parallel E. Summary ........................... Unification: Computation Elimination .............. Iteration .................... and Fill- In ............... iii 56 58 59 60 64 65 6. 7. 8. 9. I0. Higher-Order Methods ....................... 71 Reduction ....................... 71 Rates ....................... 75 A. p-Fold B. Convergence C. Back Substitution Semidirect Methods ....................... ........................ A. Incomplete Elimination B. Incomplete Reduction C. Applicability Iterative 80 83 .................... ..................... of the Methods ................. Methods ......................... A. Use of Elimination B. Splittings C. Multi- line Orderings D. Spiral E. Parallel Gauss Applications 86 90 92 ...................... 92 .......................... Orderings 84 93 ..................... ....................... 94 97 ........................ I01 .......................... 105 A. Curve Fitting ......................... 105 B. Finite 106 Elements ........................ Implementation A. Choice B. Storage C. Comparative of Algorithms of Algorithms (Generalities). Requirements Appendix A. Vector Appendix B. Summary Timing ................... ............. ..................... Analysis .................. Computer Instructions ............... of Algorithms and Timings References .............................. iv ............ Ii0 II0 113 120 131 136 156 I. INTRODUCT ION Block tridiagonal in a variety numerical matrices of scientific solution and engineering of differential say that the matrix looks bI 2 b cI b2 equations. is an b3 X n i • linear vector-oriented matrix and i analysis primarily discussion this reason rithms. many methods tion, and we make methods. simplified basis. inherent analytic considerable Algorithms model _-I aN bN/ nj) × ( methods conformaIly• for solution on efficient properties theme cases of the CDC STAR-100 using llliac Our with execution For algo- of direct of a general theory and matrix itera- for direct time estimates and recommendations tri- for a IV). sequential variety use of a convergence are compared methods of application• to standard is that a wide as special of a block of the algorithms, and areas also apply Thus nj). (e.g., CDC STAR-100, parallelism can be viewed bN-I dimensioned N emphasis numerical of our results A unifying iterative computer concerns of their to , a c are N i' i system Ax = v, with parallel in the For now it is sufficient • the full matrix A has dimension (_j=l In this thesis we study numerical diagonal typically arise c3 • n which c2 _ i of matrices computations, aN-I where class like a3 A are a special are made for a on this There are two important depending on whether A computational into account storage the blocks method the linear structure and a low operation tant applications in which of block count. an approximate often motivate systems, when a direct method Since direct 1965, however, solution partial differential ence approximation and the b.i blocks structure. ture will was Hockney's use of Fourier was able matrix methods and equation required proved discovery of cyclic reduction for large the seminal more struc- elliptic equation. suggestion reduction by Golub, algorithm operations, Other while O(n 3) methods larger asymptotic from applications on the grid Hockney ordinary band- for Poisson's constants, and the in practice. and Buneman's This for the solu- For an n X n square (O(n2 log n) operations) attention sparse possible Transform much are diagonal of the impor- of the Fast Fourier that attracted differ- for many less time-consuming fast finite equation to be significantly created elliptic paper on an original had considerably in the a very regular the differential operations. interest the ai, c i blocks on a rectangle. O(#) sparse expensive. five-point so A possesses the cyclic there are impor- be far more a standard the simplest [H5], based then extant new method methods dealt with to obtain from two-dimenslonal to solve Ax = v in O(n 3) arithmetic equation sequent Indeed, take is satisfactory. methods would grid, specializing transforms tion of Poisson's With are tridiagonal, work solution derived on a rectangular be available. tant developments in order Moreover, iterative systems equations. By further or large and sparse. there has been an increased of special matrices, system Ax = v should of the blocks Such considerations especially tridiagonal are small and dense to solve the internal economy subclasses S_b- stable version and accurate programmers and numerical been analysts extended and to to general [B20], Poisson's separable Fortran subroutine recent work these and by Bank's controlling maintain a the is still general for to present practice iteration, the has (e.g., [$6], and The great care and [$8]). Fourier-Toeplitz [B5]. since regions a rectangle available method, recently Rose the grids. computation and undoubtedly play References hard an are latter must well- Other methods IF2], methods be work taken to partial Ortega and case is still [C5]). combining discusses work differential Voigt [BI9], not in been future equations [DI], [$5], [$7], of [BI3] and [$9]. and vector in by the techniques a rather a sequence here, The an separable are suitable studies. [O1]. tech- Schultz solved considered on theoreti- adaptive Brandt methods available Nevertheless, solution iteration several has role by to a interesting Eisenstat, multi-level developed applies is frequently direct The which others. problems method Some by among [C4], his [GI], [R5] that direct implement. on specialized method best discussed based important of to methods successfully The dissection considerably although include nested for direct equation. Whitten Brandt methods satisfactory nonseparable and cretization by and (e.g., approach, direct situation excellent analyzed solution fast apparently this appeared different of elliptic but and and have marized are algorithm nonrectangular on algorithms unstable is George's improve [Eli for certain includes completely problem Sherman case no attractive niqJes packages success nonseparable this cally on Buneman equations marching fast The stability. Despite there equation problems generalized [H6]. elliptic tested on [D2], for of dis- parallel it will state-of-the-art computers is sum- With the application the major block research elimination methods, measure, cases where the blocks plicit enough tion, the cyclic Sweet's generalized cyclic tions on these basic sequence block of matrix diagonal dominance, the diagonal at higher form. matrix are naturally cyclic reduction semi-direct algorithm, by their methods the single discussed computer tion of parallel computers, convergence here appeared that supports the algoFor the to the definiof Ax = v. in [H2]. vector varying ASC. data many degrees operations. IV, or a pipeline Instruments to or decrease properties. can be run, with sufficiently diagonal Thus leads it to relative the solution stream-multiple and have blocks convergence as a to reduce of block and varia- described respectively. previously here) quadratically, such as llliac or the Texas instruction conditions to approximate in the area have an array processor, easily ellip- LU factoriza- reduction to A in order decrease im- efficiently. [SI0], along with the off-diagonal quadratic on a parallel odd-even to as full such systems the block are most applied that, under classified such as the CDC STAR-100 fall within The methods not increase, All of the algorithms this we mean algorithms algorithms, Some of our results of efficiency, reduction for each basic rithms tion of useful will includes As restricted for nonseparable to solve (called We show of the blocks. are best methods algorithm norms describing blocks rates, methods transformations of direct by some storage-efficient on the ability themes. analysis in mind, to be stored explicitly iterative of direct equations of the sparsity to be handled be based reduction is a seneral enough Proposed Our investigation differential the direct methods are small representation. will regardless however, or special tic equations of partial $oal 0 f this thesis a practical matrices, area These stream common By processor, machines classificafeatures to make a general also many when special designing a simplified made comparison restrictions programs model of STAR indicate every on a particular general method naively the costs of tailoring ior of the general to move I.A. applied survey are comparisons for should be of parallel methods in [H3] is recommended. method tailored to that problem high. can be used there Our comparisons can be expected can be quite method machine. material, computer and we believe to be superior and machine. Observations to guide that the results to a particular However, about the tailoring presented to a the behavfor a later here will allow us in that direction. Summary Direct discussed methods of Main methods in Section in Section discussed 9. 8. 3-6, Preliminaries Implementation We consider iterative (analytic linear in Section tools systems 7, and iterative and models on certain on a parallel are of paral- applications computer are is I0. as a sequence system Ax = v with the intention B[A] = I - (D[A])-IA, substitutions, methods. methods 2 and remarks of A, is shown to be important tion in back tridiagonal of algorithms direct methods The matrix of block semidirect are in Section in Section to the linear Results for solution in Sections lel computation) part For a general However, that must be considered how similar case, a solution problem solution, or capabilities plus background In almost possible. for a particular for other machines. linear algebra ture. of algorithms where of simplifying DIAl for stability and approximation We appear of transformations to be the first its struc- is the block analysis, errors applied diagonal error in semidirect to systematically propagaand exploit B[A] in the analysis been used to analyze Under and semidirect iterative methods. the assumption seq_aence of matrices as linear, of direct M.lwith The _-norm be desirable quite throughout; to be linear methods, tion 3. Sample results IIB[A]II < i, bounds convergence inherently parallel is seen a the method say that to other it is norms would numbers closely to be a variation when blocks, in the Gauss-Jordan and odd-even related and are discussed are in Sec- factorization of the pivotal of the matrix elimination elimination are investigated of the block on the condition algorithms Gauss-Jordan properties are stability are shown to be quadratic, ination and their of odd-even IIB[Mi]I 12we extension and block to zero of portions The methods generates difficult. The block LU factorization shown it has long IIB[Mi+I]II _ IIB[M i]l, we classify is used but proves though a direct method IIB[Mi+I]II < IIB[M i ]2 IIor and if quadratic. IIB[A]II < I, when methods, algorithm. reduction, which to the fast Poisson in Section of the quadratic 4. Newton and are solvers, Odd-even iteration elimfor -i A , and odd-even and Varga's block reduction general elimination cyclic can be considered reduction on a permuted elimination. We also as variations of the odd-even In Section methods cludes within technique system, certain aspects a certain is also discussed. of Hageman to form of odd-even of the fast Poisson solvers methods. reasonable methods The relationship case [HI], as equivalent and as a compact show that the odd-even all the previous iteration. gence 5 we discuss as a special class methods of algorithms. as particular between block are the only quadratic cases fill-in This class of a general and q Jadratic in- matrix conver- In Section families We note 6 the quadratic methods of higher-order methods, that the p-fold reduction equivalent to a 2-fold are shown characterized to be special (odd-even) using of ]] B[Mi+ I]]] < ]] B[M i]]]p. by (Sweet's generalized reduction cases reduction a coarser IS10]) is partitioning of A. Semidirect algorithms rounding error Fast and spiral elimination equations, some is a an approximate reduction. vis-a-vis Practical are also considered. and general methods for systems their utility algorithms for iterations and the applicability method and its propagation iterative of a rectangular producing of odd-even emphasizing solvers form the basis orderings error of the odd-even a semidirect completion, methods 8 we consider Poisson small blocks before properties Briefly, substitution of the semidirect from differential 7. the approximation in the back In Section on the quadratic in Section that is stopped We analyze limitations puters. based are discussed direct method solution. methods using grid. We for parallel for systems the natural, also remark of the Parallel arising Gauss com- with multi-line on the use of iteration [T2], [H4]. A brief section the assumption ments, on applications I_A]II < 1 is not met, Methods obtain for solution that a combination powerful direct method, may provide two situations curve fitting a modification in which and finite ele- to the system IIB[A]I I < I. of block tion i0 using a simplified methods namely and in the latter case we describe Ax = v that will elude deals with model tridiagonal of a pipeline of odd-even reduction and that a combination limited savings systems are compared (vector) computer. We and LU factorization of semidirect if an approximate and solution in Seccon- is a iterative is desired. A general discussion of storage constraints on vector siderations for practical I.B. requirements operations and the effect is included; of machine these are essential con- use of the algorithms• Notation Let the block tridiagonal matrix \ a2 A = b2 c2 a3 b3 c3 = (aj, bj, cj)N. aN- I Here b.1 is an n.1 × ni matrix triplex notation will often bN- I CN- I aN bN and a I 0, cN be omitted 0. The subscript for simplicity• N in the Define L[A] = (aj, 0, 0), DIAl = (0, bj, 0), U[A] = (0, 0, cj), . a., 0, -b- I B[A] = I - (D[A])- IA = (-b-I J .] 3 cj) , C[A] = I- A(D[A]) i= Use of these notations for general nonsingularity is assumed the matrix of DIAl associated tion of Ax = v. semidirect methods with It also , partitioned 0 , -c Jacobi linear an important be central . matrices in the definitions the block assumes and will (-a.bj j_ role should be evident; of B and C. iteration is for the solu- in the study to our analysis B[A] of certain of direct methods• We further define n. × n. and J J occurs tiplication by for example, For n×n in the diag(0,...,0, jth block the jth E. selects J N D[A] an E. = J I, diagonal block row 0,...,0), where position. Pre- (column) of the I is (post-) a matrix, mul- so that, E = N = E.AE _i= I i i' matrix M and an D[A]Ej = E D[A], J v, let n-vector and _]=i E i L[A] j 0. n 0i(M ) = _j(M) _j=l n = Imij I , _i=l Imijl - IIMII = max i 0 i(M), IIMlll=_max o YO (M) 0j(MT), the = o_-norm, IIMTII, IIvlJ = maxilvil, n II vli = ..... l il, i=l IMI = 0(M) The tral The reader should take (Imijl), = care radius (0(M)) " When A condition X(A) _ i will and the spectral not to = be radius confuse row (aj ' bj ' cj) ' let used as an of M. sums X(A) alternative (0i(M)) = maxj(4 to with the specJ -1 I_i i aj II J_ j-I IIB[A]II < I. cj_ I I_ " i0 2. SOME 2.A. IN IT IAL REMARKS Analytic Much B[A]. Tools of the analysis presented Why should we consider First, conforming this matrix it is a reasonably onal dominance of A. with That invariant sections involves of A, = I - D-IA = B. quantity which let A* =EA Thus B[A] ings of the equations Ax = v. a change y = Ex, which the matrix at all? is, if E is any nonsingular the partitioning = I - (ED)-I(EA) in later On the other measures block the diag- diagonal matrix , so B* = B[A*] = I-D*-IA * is invariant hand, under block scal- B is not invariant under -I of variables induces A' = AE . We have -i B' = B[A'] = EBE is invariant. columns, , and for general Similarly, and is invariant under block scaling Secondly, can be created some matrix represents C[A] measures under block various one simple class by transforming M, where the relative Thirdly, of the block the diagonal diagonal of parallel radius dominance change o(B) of A by of variables but not stages error. semidirect methods (Section Ax = v into A*x = v*, A* = MA, v* = Mv IiB[A*]II << the initial matrix the spectral of the equations. IIB[A]II• This transformation of some direct method. tion y = D[A*]-Iv * is then computed measures A and E only with Thus we An approximate the behavior for typically error x - y = B[A*]x, study 7) so solu- I_[A*]II of B[A] under transformations. B[A] is the matrix Jacobi iteration tion, as are the faster which determines for Ax = v. semi-iterative tive definite. It is known the convergence rates of an iteration, This is a natural methods that various the rate of convergence matrix based parallel on it when transformations so we study itera- A is posican affect this property. II Finally, ination B[A] and C[A] arise methods to solve Ax = v. tion, and B represents order similar to place upper bounds the effects of elimination Since we will discuss make some conditions Lemma 2.1. satisfies naturally Let J- C represents quantities multipliers used in back of block used on in eliminaIn we consider JJB[A]JJ. use of the condition on A which will Jl + J2 be order Then elim- substitution. on the size of these quantities frequent K = JlK + J2" in the analysis guarantee n, JJj = JjB[A]JJ < i we now that it holds. jJlJ + jJJJJ< 1 implies Pi(K) JJ2J , and suppose _ pi(J) K for 1 _ i _ n, and II KI II Jll Proof. From JJlJ + jJ2J = JJj we have pi(Jl) + pi(J2) = pi(J) N Pi(Jl ) jjKJJ+ Pi(J2 )" for 1 _ i _ n. From K = JlK + J2 we have Pi(K) Then Pi(K) _ pi(Jl) JJKJJ+ pi(J2) JJKJJ = pi(j) JjKJJ _ JJjjJ JJKJJ, which JJKJJ _ jJJjJ JJKJJ and ! _ JJJjJ, a contradiction. Thus Suppose JJKJJ _ I. implies JJKJJ < I and Pi(K) < Pi(J I) + Pi(J2) = pi(J). We note if pi(Jl) a nonzero then JJKJJ < Theorem Jl and J2 nonnegative, p(K) < p(J) and P(J) = I implies respondences Lemma element of the Stein-Rosenberg J is irreducible, implies if pi(Jl) # 0 then we actually = 0 then Pi(K) = pi(J2) = pi(J). Jl contains iniscent that QED. 2.2. to the theory of regular Suppose Pi(S) _ Pi(T), and so H KJJ _ JJJJJ. It follows JJJJJ. These (IV3], Then Pi(K ) < pi(J ) and that if each row of conditions [H8]) which are rem- states that if J2 # 0 and p(Jl ) < I then P(J) < I o(K) = P(J)- splittings J = (I - S)K + T is order I _ i _ n. have n, JJJJJ< I implies There are also cor- IV3]. JJj = J(l - S)KJ + Pi(K ) _ pi(J), JTJ, I _ i _ n, 12 Proof. For I < i < n, so Pi(T) < I. Now, Pi(T) Thus S)K) = 0i(J) < JJJJJ< I, I > pi(J) = pi((l - S)K) + Pi(T) + Pi(T) > Pi(K ) + Pi(T)(I I > oj(K) + pj(T)(l K Pi(T) + pi((l- - JjKJJ ). - pj(K)), Suppose and we have > Pi(K) - Pi(S)JJ KJJ jJKJJ = pj(K) > I. Then I < Oj(T), a contradiction. jJKJJ < I and Pi (J) _ Pi(K)- QED. If S = r in Lemma 2.2 then K = SK + = JJ - S J + iS J, so by eemma Similar Lermna 2.3. results hold Then Since jT = Jl T + J2' r JjMJJI = JjMTJJ, we have i _ i < n. Then JJjTjj = JJKJJ < JJJJJ. JJ2 j' and suppose K satisfies JJKJJI _ JJJJJl" jJ2 T j' KT = Jl r KT + J2 T and jjJJJl < i, and by Lemma J = K(l- S) + T, JJJJJl < I implies 2.1, JJKJJI = JJKTjj JJj = JK(I- S)J + JTJ, _i(S) diagonally Indeed, follows. the condition dominant if we then first note JJB[A]JJ < I under take J to be the point Jacobi for A, J = I " diag(A)-IA ' Jl = D[J] , J2 = J- that if A is any partition- linear iteration Jl , then jJmJ and B[A] = (I - Jl)-iJ2 , so jJB[A]JJ < lJJJJ < I by use of Lemma 2.1. 2.2 QED. JJB[A]JJ < I, we by rows JjJlJj < JJJJJ< i, JJj = JJlJ + _ _i(r), JJKJJI _ JJJJJl" _i(M) = Pi(M T) , jT = (I - sT)KT + TT , so by Lemma To investigate matrix JJl j + jjT J = JJl rj + jjKTJj _ jjjrjj, and the result ing of A. JSJ QED. Suppose We have strictly j(l - S)KJ + jjJJJl" 2.4. Proof . jJj = JJJ JJl< i implies Proof " eemma conclude JJJ = for the 1-norm. Let J = Jl + J2' K = KJ I + J2" < jjjTjj = 2.I we again (J - S), 13 Now, if we denote then the above resents another the partitioning of A by the vector remark may be generalized. partitioning of A which Suppose refines H = (nl,n2,...,nN) _' = (m l,m2, ...,m M) rep- H. That is, there are inte- .-1 geTs Pl that n. = 1 .... j=n.-- I = P0 < Pl < "'" < PN = M+I such B'[A] are defined by the two partitions < [[B'[A][ I. This is an instance methods that "more means and will be useful in Section decrease [[B[A][[ is to use a coarser Let us consider finite a simple differences, % _> 0, with 6. example boundary better Actually, estimate will problems frequently certain needed finite Section which will produce tion be quite close element Varah - u yy ideas. Using + %,i= f on [0,112 , rise to the matrix 2.1, [[B[A][I < [[J[A][[ = determined [Wl]) [VI], another we can give the we should to the original for block analysis on a change than strict block allows consider will matrices based In of variable. dominance is the condi- The general the use of different theorems be problem. diagonal [Icj[J ) < I, I < j _ N. dominance for of this type of continuous tridiagonal [[B[A]] I need not hold so a different based boundary [IB[A][I < I [V3]; [V2] has examples condition also to elliptic This condition approach [[bill[ ([[aj[ + of block diagonal so for completeness A such that to I. related to the _-norm, which definition these approximations approximations, are closely 9.B we consider ([FI], that one simple way to xx gives is easily matrices [[B[A][] < I is a weaker relative -u By Lemma B[A] finite difference in those cases. analysis (cf. [H8, p. 107]) [[B[A][[ = 2[[ M-Ill < 2/(2+_h2). Indeed, many value it shows equation conditions, since for iterative convergence" to illustrate A = (-I, M, -I), M = (-i, 4+_ 2, -i). 4/(4+_h 2) < I. rule [[ B[A]/l partition. the differential Dirichlet general faster Briefly, If B[A] and II B'[A][I_<IiI _hen and of the more implicitness m.. ] on this norms, 14 assumption. However, this will not always onal dominance nation, be done. is the right but we have deal with in order found If J = I- assumption convenient related to diagonal II(I- ILI)-IIuI II< i, and an H-matrix diagonally shows that A is an H-matrix is strictly recent results on H-matrices. methods eral and will 2.B. Models oriented needed only consider to read parallel the references Parallel sider elimi- general to exists lower IIJIl< I, a G-matrix Ostrowski a nonsingular Varga JR3] generalizes if [02] diagonal matrix [V4] summarizes these definitions G- and block H-matrices convergence (upper)trian- using of the classical vectoriterative we will not be completely gen- results. 3-9 only a very is necessary. time comparison details A more detailed of methods information may loose conception be found description in Section for the initial in Appendix A, of a vector- 10. part [H3], is This of the [$3], and given below. computers the vector and capable if some representative sufficient text, while additional block Computation computer contains diag- are G- and H-matrices. if p(IJl) < I. for simplicity Sections for the execution subsection block dominance dominant. Robert and proves Again, of Parallel In order diagonally dominant, for Ax = v. that block and often more dominant iff there E such that AE norms, concise analyzing L (U) is strictly then A is strictly ial and matricial reasonably IIB[A]II < I. diag(A)-IA = L + U, where diagonally stated to use when gular, to block the discussion It is sometimes it more the assumption Other concepts to keep fall into several computers, of performing namely efficient those general operating floating point classes, of which in a synchronous vector we con- manner operations. 15 Examples are the array processor elements [B6], [BI2]; and the pipeline [Wl], which machines can handle and there is much machine the Cray-l, processors have become many of important interest eight features 64-word (essentially) tools vector able of operation [C8]; length. These computations, good use of special vary widely, a general registers ASC scientific to make processing Instruments arbitrary for large to make 64 parallel [C2] and Texas in algorithms Details common with IV, with CDC STAR-100 vectors capabilities. ficiently llliac but there are suf- comparison of algorithms possible. An important chronous parallel Carnegie-Mellon processors. Baudet for linear computer not being University C.mmp [B7] discusses and nonlinear systems considered [W4], with the theory here is the asyn- up to 16 minicomputer of asynchronous of equations, plus results iterations of tests on C .mmp. Instructions scalar will operations be described simply of a vector and a special set of vector shortly. an ordered the details for our model set of storage of storage. locations valid locations in arithmetic sion to have for use increment data manipulation We shall in vector in a vector progression. = I, which order consider locations, A machine computer consist operations, some of which a conceptual with vector no particular is a more instruction, usually to organize forces programs conceptual to be regard specialized of the progres- to perform vectors to set of consisting The CDC STAR restricts often of the usual extra into machine vectors. We denote the vector vector operations sum w = x+y is by a parenthesized list of indices. Thus 16 wi x.l + yi' (I _ i _ N) N are in R , the componentwise if the vectors w.l = xi X Yi' and a matrix multiplication product would be (i _ i _ N), C = AB would be N cij = _k=l Execution aik bkj , (I < i _ N; time for a vector and the internal mechanism written in the form • computer, 1/T tion startup op cost. N + _ operations in parallel vector case operation times are given for the CDC STAR-100 are given << top operation add, subtract op cost in terms is the vector either time 13 instruc- k parallel be t op for k t FN/k_ + overhead. op operations; instruction scalar execution times of the 40 nanosecond cycle << O'op. scalar be On a pipeline time would for vector Simplified below, is N. (one with instruction is adequate as top. length the basic of the vector but can usually rate and _ computer registers) the TN + _ model 'top result on the length computer, if the vector and the vector In either Usually op On a k-parallel or k-word depends of the vector is the asymptotic op processors time. operation 1 _ j _ N). vector time I _N + 71 multiply 17 N + 159 divide 46 2N + 167 square root 74 2N + 155 transmit II 1 _N + 91 maximum - 6N + 95 summation - 6 °5N + 122 inner product - 6N + 130 17 The use of long vectors computer, the vector in order to take full advantage operations. severe penalty to design is essential The rather for operations algorithms which tions can also be a source sists of vector sideration, operations. especially tors. These issues rithms in Section when of increased large vector on short vectors, keep startup costs of inefficiency there use of a vector result startup to a minimum. most is often are strong restrictions are illustrated is based costs and one must even when Data manipulation and others I0, which for effective rates in create a be careful Scalar of the code con- an essential con- on machine vec- in our comparison on information opera- of algo- for the CDC STAR-100. 18 3. LINEAR METHODS The simplest approach to direct system Ax = v is block elimination, ization of A. We discuss general partitioned matrix under such as pivoting of the pivotal special blocks, techniques tant odd-even error properties, elimination of a block effectively in terms Various systems. which bounds and reduction an LU factor- elimination of block on a elimination on the condition in the back substitution, Gauss-Jordan elimination foreshadow linear IIB[A]II < I; the process properties propagation tridiagonal computes of block the assumption requirements, for special for its numerical which the process shown to be linear and stable. considered, solution our analysis is are numbers and is studied of the more impor- algorithms. 3.A. The LU Factorization 3.A.I. Block Elimination The block tridiagonal LU factorization A = (aj,bj,cj) N = (£j,l,0)N(0,dj,cj) is computed N by d I =b I I dj = bj - ajdi_ I Cj_l, _j = a.d J -I J-l' where For the existence the moment j = 2,...,N, j = 2, .. .,N, of d-I j-I is assumed we are not concerned and will with the particular -I or avoid computing either be demonstrated _j or dj_ 1 cj_ 1• methods when IIB[A]II < i. used to compute 19 Let A I = A and consider block elimination step. the matrix A 2 which We have A2 = (_j,_j,Cj)N, 92 = d2, _j = aj and _j = b°j for 3 _ j < N. Note that B I and B2 differ (2,3) positions. Theorem 3.1. IIB III< I then Proof. Let T = BIE I and let S = TB I. struction, Also, B I = (I- the first E1 = d I = bl, _2 = 0, block B2 = B[A2 ]- row; more specifically, IIB211 _ IIB III • We have IIb21a2bllclll = I SII _ IIBill IIEIII IIBill = is well-defined. after Let B I = B[AI] only in the second in the (2,1) and If is obtained S)B 2 + T, and 0j(S) _ _o(T)II BIll < Pj(T). d2 = b2(l - b_la2b]Icl ),--__ IIBII_ < I, so d21 IBII = Thus I(I- eemma exists S)B21 + and Irl by con- 2.2 applies, yielding IIB211 < IIB III • Now, QED for any partitioned block pivoting) B2 matrix can be described by the N stages of block elimination (no the process AI=A for i = I,...,N-I Ai+l We transform diagonal; Ai into Ai+ I by eliminating when A is block (0,dj,cj) N. tridiagonal The same effect by the process = L[C[Ai]]Ei)A i. = (I (I-+ L[Ai]EiD[Ai]-I)Ai all the blocks AN is the upper can be had by eliminating of column block i below triangular the blocks the matrix individually 20 AI=A for i = I,...,N-I A (i) i+l =A. i for j = i+l,...,N iLA j) - (I (j-l) (j-l) i+l I .(j-l) )Ai+l i+l- = (I +- EjAi+I (j-1)]Ei) (j-l) EjC[Ai+ IEiD[A Ai+ I ]" | A i+ I = A (N) i+l" Each minor stage affects only block row j of the matrix. It is seen that block block from which row j of A'J-lli+l_i_ = block row j of Ai, I row i of A i+ (j-l) = block row i of A i it follows that j i.l so the major stages EiD[A indeed ] generate = EjAiEiD[Ai ] I, the same sequence of matrices A.. l Let B i = B[Ai]. Corollary I _i 3.1. If A is block tridiagonal and IIB III< I then IIBi+lll < IIBill, <N-I. The proof of Corollary 3.1 would seem to indicate; We should 3.1 requires also note a bit more this difficulty the trivial "error care will than the proof be resolved relations" IIAi+ I - ANII < IIA i - ANII, IIBi+ I - BNII _ IIB i - BNII. of Theorem in Theorem 3.2. 21 These follow from the fact that, for block i A which holds regardless Wilkinson to Corollary if A is strictly This actually two theorems, and that is similar dominant by columns IIAi+ I (reduced)II I to arbitrary matrices [B14] anticipated generalizing to the characterization projection method elimination , and proves is stable when tridiagonal matrices matrix, so the result will Theorem 3.2. Proof. elimination diagonally matrices applies Brenner tion, lead I _i for Gaussian and not just the result but in context. Our next block E.A J I' libIII. a result that case. z_j=i+I of is true for all reduced tridiagonal a different EjA N + L-j=I [W3] has given IiA i (reduced)III. matrices, N of the value 3.1, namely then the same the block = i tridiagonal If Corollary of block by bounding lib[A]II< I. libiiI< I then elimination the growth to block in some many immediate that block special elimination IIBi+ III < IIBill and observa- as a norm-non-increasing factors In addition, are equivalent be useful 3.1 and Wilkinson's schemes for on a permuted applications. IIAi+ Ill < IIA ill for <N-I. Suppose IIBill< i. Ai+ I = (I- We have L[Ai]EiD[Ai] -I)A i = A i - L[Ai]E i + e[Ai]EiB i, pj(Ai+l )_ pj(A i - L[Ai]E i) + pj(L[Ai]EiB i) _ pj(A i - L[Ai]Ei) The projections are onto spaces + 0j(L[Ai]Ei) of matrices llBill with zeros in appropriate positions. 22 0j(A i - L[Ai]Ei) + 0j(L[Ai]E i) = Pj (Ai), SO IIAi. III_ IIA ill " Now let Q = D[Ai]-IL[Ai]EiD[Ai]-IAi so Ai+ 1 = A i - D[Ai]Q, Let S = D[Q]. D[Ai+ I] -D[Ai](I- D[Q]). Since Q = (-L[Bi]Ei)(l- S = D[L[Bi]EiBi] -I D[Ai+I] exists (I - S)Bi+ I , and + L[Bi]EiB i, IIEll < IIL[Bi]ll llEill IIBilI _ IIBill 2 < I. and Bi+ I is well-defined. B l - S + Q. It may be verified Thus that Let T = B i - S + Q, J = S + T = B I + Q. D[Bi ] = D[Q - S] = 0, we have IIJll< I implies Bi) =-L[Bi] bIT] = 0 and IIBi+ III _ IIJll • IJl = But we have, ISI + after ITI • Thus, Since by Lermna 2.I, some algebraic manipula- tion, J=B. l +Q = _,k_ 1 EkB i + and it follows Although from this expression expressed in terms it is clear that we also have, IIA(j)i II-< IIA(J-l)i II" Moreover, eliminated EkBi(I =i+l such inequalities - E that i + E Bi) i ' IIJll _ IIBill • of the N-I major for Bi(j) -- B[A(j)]' i as long as blocks will hold regardless stages QED of block elimination, llBi(J)II-_ IIBi(J-1)IIand below the diagonal of the order are of elimination. 23 We shall adopt a sequence of matrices IIB[M I] II< I. A result Definition. rows Proof. method" that Let_ elimination be the reduced 1 of A.. (_ for any process which IIB[Mi+I]II _ IIB[Mi]II, given that 3.2 is the following. matrix consisting of the last N-i+l block = A 1 = A) If [[ C[A][[1< 1 then [[ C[_i+l][[ 1 _ [[ C[_i]l[ 1, [[_i+1[] 1 _ [IQi[]1, _ It is not II C[A]][l, Ci : C[Ai]. true in general Let the blocks that [_i+lI[l < I]Cil[I of C.l be Crop, i _ m, p _ N. if IIciIlI < i. Then C[_i] -(Cmp)i_n,p_N. Let K = (Crop+ CmiCip)i+l_m,p_N - (Cmp) i+l__ra,pNg + . (ci,i+l ... \_Ni Using Lemma senerates is a linear method. that is dual to Theorem II n[Ci]Ei[[ I Remark. M.I such Thus block and columns Theorem 3.3. and the name "linear 2.4, I[ c[_i+l][[ 1 _ [[ Ell 1 But if [[ KIll < 1. N yq (K) _ yq (_k= i+l mk C_i + [I " ]) [11 Yq (El C[_i ]) \CNi N < yq (_ + 5'q k=i+l E k C_]) I[C[_i][[ I _q(EiC[_i]) =i+l E k C[_ i ]) CiN). 24 + = Vq (C_ Thus IIKil I < IIc[ai]ll I < I and this implies llL[Ci]Eilll < be amp , I < m, p < N. _q (E i C_ i]> i])" IIC[_i+l]ll IIC[A]II I for Then_i+ I = (amp I < IIC[_i]ll I • I _ i _ N. + Now, By let Cmiaip)i+l_m,p_N construction the blocks of A.i , and N q(_+ I) _ 7q ( E ._ ) i l %¢=i+I + II " lll_/q<Ei_2 i) N meal) + q(mi i) _-i+l II_i+llll< ll_illl" QED We become When note that nearly A those factorization, consequence The norms IIB[AI]III< makes the definite, the Cline's statements were of used. I then extension symmetric Of of _. l-i . but of that is identical. is positive interlace if A course, Cline Actually reduced we Theorems other in 3.3 quite B[A] T and C are always shown proof that depends the same < depend general, IIB[AI]II I or norms = ll_i+ II12 and true has are have is not IIB[Am]ll I < to [C3] the 3.2 C[A] B and matrices result It then even difficult Theorems similar the 3.2 since and C = eigenvalues on use in both of the 3.3 DBD -I. of_ i Cholesky methods. In II_iI12 " heavily for on the example, II B[_2]ll I _ particular that if IIB[_l]ll I" to determine. This 25 3.A.2. Further Properties It is well-known of Block Elimination that the block factorization A = LAU is given by N L = I- _ L[Cj]E j=l A = D[AN], J' U = IWe have then matrices shown 3.2 and 3.3 that if lIBl II< 2, IIUill the factorization The purpose _ 1 + Nnll BIll • interesting related it is shown that partial number of L, though large in the general looking pivoting the bound case. at some matrix has for arbitrary elimination. some more comments under of In particular the condition can actually that are invariant rows is to a number of bounding number the block matrices [BI5] contains the effect We will have tridiagonal permuting in Gaussian and condition properties without Broyden to pivoting (For block IIB Ill< I IILil _ 1 + nll C llil, IIUilI _ I + rillB Ill.) pivoting IIL[Cj]Ejll _ i. results so can be completed of block partial force the inequality IIClllI < I then IIEll _ I + Nnll C IllI, n = max.] n.,j and if the factor N can be removed, Moreover, (cf. /'j=l Ej U[Bj]. in Theorems IIC IllI < 2, IIUil _ I + of A. B N = I- therefore IILill _ I + N be quite on pivoting block after elimination [BI1]). Corollary Proof. diagonal 3.2. It follows dominance: A i and are about block elimination produce Block elimination from Theorem simply to generate diagonally partition Ai+ I. A into This exact diagonal IXI blocks. elimination, arithmetic then so is Ai+ I. dominance. elimination preserves Now suppose can be done either of Gaussian using dominant, strict 3.2 that Gaussian or by n.l stages the same result when is strictly preserves strict we have by one stage of for both processes [H8, p. 130]. Thus if A.i QED 26 It is shown G-matrices. This can be extended elimination. Corollary Proof. elimination in an obvious maps fashion G-matrices into to the case of block We also have 3.3. Block Recall diagonal in [B4] that Gaussian elimination maps that A is an H-matrix matrix E such that A' = AE H-matrices if and only is strictly into H-matrices. if there exists diagonally a nonsingular dominant. Con- sider the sequence A{ = A' , A'i+l = (I- Clearly A I'= AIE; suppose L[A_.]Ei D[A_.]'I)A_. that A'.l= AiE is strictly diagonally dominant. We then have A'i+l = (I - L[A i ]E E i E-ID [Ai ]-I) A °E l = (I- L[Ai]E i D[Ai]-I)Ai E = Ai+IE, ! so Ai+ I is an H-matrix since Ai+ I is strictly diagonally dominant by Corollary 3.2. QED We have chosen it allows IIMIIE = to give us to estimate IIE-IMEII• generally another remain then unknown The proof of Corollary Theorem 3.2. norm Then since A' = AE that if A is an H-matrix will this more concrete implies If block elimination B[A'] = E-IB[A]E < IIB[Ai]II E. is in the nature 3.2 brings of Corollary 3.3 because of B[A] when A is an H-matrix. IIB[Ai+I]II E this proof is actually it follows Since the matrix of an existential out an important performed interpretation then Define E proof. of our result 27 gives direct information and ordinary Gaussian vides information assured about elimination about that pivoting Section after storage be limited then Theorem each n.z steps. we do know to elements LU factorization. hierarchies matrices. When this will If not, 3.2 pro- While we are not that pivot within the d. block l A has been be important selec- partitioned knowledge to (see 10.A). One more Corollary cond(dj) Proof. instead, will not be necessary, tridiagonal deal with mass of the reduced is used the process tions for these n. steps will i for the block properties consequence 3.4. = of Theorem In the block JldjlJ IIdilll< Letting 3.2 is tridiagonal 2 cond(bj) LU factorization, / (I- IIB llj< I then IIBIJI2). _j = b-la.d j j -I j-lCj -I' we have IIBill IIBNIJ _ JlBIJ_ < i, so that if dj = b j (I - c;J ) , iJ_jlJ < IIdjJl _ Ilbjll (i + II_jIl) < 211 bjll, and QED It is important must be solved during to have bounds the factorization. IIdilll may be computed cond(A) possible / for the individual IIBIIJ < I. This are identical. different at a moderate < max.j 2 cond(bj) The inequalities (I - However, cost. involving a posteriori We note also bounds d.J on that IIB III< i, and it is entirely 3.2 are the best for example, B I. systems b. and d. to be better-conditioned J 3 it is possible about since Of course, lJBIll ) when blocks in Theorem is because, assumptions on cond(dj) possible the first block to derive stronger under rows than A. the condition of B 1 and B2 results by using 28 Theorem 3.4. [H4] J = b-ld.. J J e. For A = (aj,bj,cj), If X(A) _ I then Some useful consequences IIBNII and a smaller upper let %(A) = maxj(411 bjlajll l_]_llCj_lll), I III - ejlI < (I - _)/2 of this theorem bound include on the condition have BN = (0,0,-d-lc.)j J = (0,0,-e-lb-lcj)'3 J which IIBNII _ 211U[BI]II / (i + _f_). 3.2 predicts only IIBNII _ 2(I/3) / (I + ^/1-4/9) - 0.382. I_iI] l_jlll _21_jlll, and ilboll III - ej]1+ 3 con_(Dj). More This is further Theorem block matrix are available. and of time-dependent and has several remarkable that if A = (1,3,1) then Theorem 3.4 we obtain t_&t co_d(dj) (I + 2_(A)) of the block problems, cond(bj), where _ = (I-^_)/(I+_). LU factorization. to compute are potentially good is designed properties; = IIdilll l_jll < of an iteration Such methods iteration For we initial the useful approximations for use on a parallel for details, see [H4] and 8.E. previously Theorem [H4]. implies of the d.'S. J for l_jll = l_j( I " e-_ll+ l_jll _ in the analysis (0,dj,0)N This particular Two results were of the stability numbers estimates IIdilll < IIdilbjll llbjIII= It follows cond(dj)_ 3.4 has been used in the treatment Section precisely, IIei II_ 2/(I + I_I-_) better by Theorem Also, l_jll <- l_j- doll+ IIb011 _ 311 bjl_2- proof diagonal computer As an example, IIBNII < IIB III= 2/3, while and 3.5. on the block obtained [VII ]]cj]] _ 0, then IIdoll < IIboil + LU factorization tridiagonal matrices by Varah. If A is block diagonally IId_ll] < 1]cj]r I, IIajl" for block dominant ]]£j]] < ]]ajl]/ b]111(l Ijll+ 011)i) (II ]]Cj_lll, and 29 In comparison bound to Theorem IIdjlcjll < I, which follows corresponds to the bound only allows us to conclude 3.6. [Vl] pose _. # 0. J Then and that on IIdilll' corresponds to the J IIBNII < I, and the bound to Theorem [I djll on 3.2, IIB III< I IId]l[l _ IIbjlll / (I - IIB III2), and we can only form is the block II _jCj_ll I < IIajll IId]_llCj_llI _< IIajlJ. LU factorization a constant (_j,l,_j_ I) is positive that from on Let _. I cj_lll ) 1/2 , 2 _ j _ N, and supj = (IIbila j ll IIbi_l (i.e., there exists We note the bound IIAN[I _ flAil- In contrast estimate II_jll in the weak Varah's second result Theorem 3.2, K(A) of A is numerically stable such that max(l I _jll ,IIdjll ) _ K(A)) if semidefinite. (_j,l,_j.l) X(A) _ 1 implies is a symmetrization that it is positive II%jll, IIdjll given by Varah of (I[bi laj[l,l, IIbi Icjll ) definite. IV1] for this case will The actual bounds not be important for this discussion. Since many differential related if the system principle related equation. of results condition continuous and see [BI8], problem. there difference under [D2]. exists bounds consideration. of assump- [R7] shows maximum (each d. is nonJ on BN are approximation is the remarkable 17/3. under Rosser a certain Improved of d. is less than J matrix [CI], been obtained satisfies IIBN II_ I. by the nine-point number have from the discretization the LU factorization In particular on the particular work, arise Ax = v, A = (aj,bj,cj), BN is non-negative the spectral systems to the original when A is defined Poisson's strongly a number then A is nonsingular, singular), obtained tridiagonal equations, tions closely that block to observation that Such results depend For examples of 30 3.A.3. Back Substitution So far we have Ax = v. For block only considered tridiagonal the elimination matrices phase A = (aj,bj,cj) N this consists the process dl =b I' fl=Vl j j-lCj -I' fj = vj - ajd i II fJ- I' dj = bj - a.dl I j = 2,...,N. The back substitution phase consists of the process cjxj+l), j = N-I,...,I. -I XN = d N fN' = d-I J (fj- xj In matrix terms these are the processes A I = A, f(1) = v, Ai+ I = (I + L[ Ci]Ei)Ai, f(i+l)=(l + L[Ci]Ei)f (i) i = I,...,N-I, i = i N-I g = x(N) = D[AN I-If(N), x (i) = g + EiBN x (i+l) , X An alternate = X i = N-I, ..., I , (I) form of back substitution is x (N) = D[AN]-If (N) ' X (i) X -" X = (I + BNE i+ l)x (i+ l) (1) of the solution , i = N-I,...,I, of of 31 In either case it is seen that BN plays back-substitution the matrix 3 have A = (aj,bj,c.) where where a band elimination BN maintains (upper) #% errors substitution the band and block the computed Suppose for back substitution. be We phase generates xN = UN fN ^_ I_ - = dNIfN , computes ^ Let us briefly tion; the for Ax = v might The elimination x. = u. (f - c.x ) = d'l _ ] ] j J j+l j (fj cjxj+l). roundoff method its importance triangular. f ]. = %.f ^] J , and the back ^--1 in studying ^j,c^j) ' d j = _.u ^^ _-I = (a^j, _j ,0) (0 ,u J j' a^ j = a.u ]^-I j-l' c j = _. ] c., ] _.j(uj ^ ) is lower vectors role process. For those situations preferred, an essential consider solution the computed back the effect Except for the specific substitutions of rounding must errors behave form of identically. in back substitu- will be yj = xj + _j and we require bounds on _j. * (I) * elimination yields d. = d. + 6. ] _I .I ' f" ] = f J +q0j " forward 2-Let _j represent * satisfies is used the errors (2) (d.j + 6.j )yj = fj to solve we have errors 3.A.4. tend a uniform we obtain upper bound the matrix that may be important or zero; to be damped Gaussian It follows II_jll _ IIBNII II_j+lll + out in later computations for the ¢.'sj, IIejll _ ¢, and II_jll < (I - so that yj IIBNI_-J+I)_/(I elimination that 6(2))yj)j, which IIcjll, and the if I;BNII< I. If IIBNII < I, then - I_NI_ < e/(l-l_Nl _. Techniques In some cases criterion. - cjyj+l, cj = d-l(_jj + _j - (8! 1)J + Thus II_jll _ (N - j + i)¢, Special fj - cjyj+l + _j if ordinary + Cj' where _j = ej - d-lc j j_j+l" previous in forming for yj in dj yj = f.j - cjyj+ 1 + _j. yj = d-l(fjj - cjyj+l) implies introduced * to preserve For example, this requires A = (aj,bj,cj) we might in order possesses to satisfy special properties a strong stability have b.] = t.]- a.j - c.] where that all blocks be n × n. Scalar IIt011 is small tridiagonal systems 32 with this property value problems. arise from the discretization Babuska tion for such a matrix which to the ODE and a stable bation problem [BI-3] has discussed allows numerical posed by Dorr of second a combined a very effective solution [D3]. order ODE boundary LU - UL factoriza- backward of a difficult error singular (See [H7] for a symbolic analysis pertur- approach to The first is this problem.) There are two essential to express features b. and d° indirectly J J to Babuska's by starting with method. t. and computing J quence s I = tl, sj = tj - ajd_ _lls j_I .... By this we have second feature ination is to augment and dispense of elimination. pj,qj,tj with the forward the back Specifically, elimination substitution suppose 0 , Pl = qN = 0, A nonsingular. with + pj The algorithm = tN' fN Sl = tl' fl = Vl' SN dj = sj - cj. a backward by combining A = (-pj,tj the se- The elim- the results + qj'-qj)N' for Ax = v is = VN' -i sj tj + Sj_l } fj = = vj + pj(sj_ pj(sj_ II + + qj_l qj-I ) )-Ifj-I * * -I sj = tj + qj(sj+ I + Pj+l ) J = 2,...,N, * sj+ I J * + Pj+I )-I fj+l * fj* = vj + qj (sj+l * -I x°j = (sj + s.j - tj) N-I,...,1, I * (fj + fj - v.),j j = I,...,N. In matrix terms, we factor A = AL + A D + AU = (I + L)(D + A_ = (I + U )(D and solve (I + L)f = v, (I + U )f and -ADX -- -ADX we obtain (D + D = v. Adding - AD)X_ = f + f (D + Au)X = f, (D - Ax = f + f + AL)X = - v. + AL) f 33 Babuska's and underflow algorithm has some unusual is a problem though correctable row sums by taking account of dependent and thus fits notion condition" into Kahan's round-off properties by scaling. error It handles distributions that "conserving (cf. [M3]) small in the data, confluence curbs ill- [KI]. To see how the method since pj,qj,tj sums can only _ 0 we have increase fits in with our results, we note that, s I _ 0, sj = tj + pj d-lls j_ j-I _ t.. ] during the factorization. On the other for n = I, Thus the row hand, they -I will not increase much These are analogues 3. B. Gauss-Jordan An algorithm Jordan since sj = tj + pj(sj_ I + qj_l ) of the results in Theorem sj. I < tj + Pj" 3.2. Elimination closely related to block elimination is the block Gauss- elimination 21 = A, Ai+l = (I + C[Ai]Ei)Ai, This avoids diagonal. back substitution The rest i = I,...,N. in the solution of the Gauss-Jordan of Ax = v since _+I process is block is 2(1)= v, _(i+l) = (I + C[Ai]Ei)f (i) i = I, ,N --1 _(N+I) x = AN+ I Theorem 3.7. Let Bi = B[Ai]" If IIBi+lll _ IIBill for i = I,...,N. II_II < I then IIAi+lll < IIAill and 34 Sketch of Proof. suppose The proof IIBill < I. This will Q = D[Ai]-IFimiD[Ai]-iEi and We have show 3.2. the first part IISII _ IIBill2 < I. Also, J = Bi(l - E.z + E.B.)z z ' so D[Ai+l] I = B = D[Ai]( I _ S), ITI, and eemma2.1 I]Jl] < IIBill" From Let T = B i - S + Q ' applies eemma if IIJll < I. to block many more tion, and cannot ical properties reduction QED applied rithm requires But 2.I we conclude IIBi+lll _ ]IJll" When 1; Bi ) = - BiEi + BoE i iBi' so S = D[BiE-i i ] (I - S)Bi+ I = B. i - S + Q. IJl = ISl + Note B Now define , so that Ai+ I = E i - D[Ai]Q, Q = (- BiEi ) (I- J = S + T, so that that of Theorem _ -l)i. + U[Ai] , so that Ai+ I - (I - FiEiD[Ai] Let F i = L[Ai] = A.i - F.E. i i + F.E.B.. i i i S = D[Q]. parallels tridiagonal arithmetic be recommended that foreshadow matrices operations Gauss-Jordan than the block on this basis. our analysis the block However, of the more algo- LU factoriza- it has some numerimportant odd-even algorithm. The Gauss-Jordan the following elimination on block tridiagonals way: dI = b I fo_ j = I,...,N-I -I u. =-d. c. J ] ] dj+ I = bj+ I + aj+lU j for k = l,...,j-I Lek,j+ I = ekju j ej,j+ In this form the matrix A. is ] I = c. ] may be expressed in l 35 elj d2 e2j d3 e3j dj-I ej-l,j 0 d. J c. J aj+l -- m , with AN+ I - diag(d I d2 _ • • • "'. m ,dN) = D[AN] , BN+ I bj+l -. aN...Cj+l DN''"// 0. -I We observe which that, for i < j <j +k k of BN. is the (j,j + k) block by giving closer Corollary 3.5• j with bounds on the first If JIBiij < I and < N, -dj ej,j+ k = ujuj+l...Uj+k_l, Thus we are able i-i block (point) row rows to refine Theorem 3 7 of B.. l _ is contained in block row I _ j < i-l, then P_(Bi ) < II-d-le j ji II< IIBNi-jll • If (point) row £ is contained in block row j with i < j < N, then 0z(Bi ) = P_(B i) < IIBill • This matrix is our first result during the reduction a rather different pivoting applied above the main situation on the convergence to a block from Gauss-Jordan to an arbitrary diagonal diagonal matrix can be quite [PI]. to zero of portions matrix• elimination It also with of a provides partial In that case the multipliers large, while we have seen for block 36 tridiagonal matrices II B 1 II < 1 that the multipliers with decrease actually in size. 3.C. Parallel We close Computation this section ent parallelism parallel angular operations operations we often cannot make tor lengths might Our comparison therefore note of algorithms diagonal system, On the other hand, methods when the blocks systems derived general elimination can be used executed LU-UL problems equations sparse this the vec- startup (Section costs. 10.C) will in scalar mode. We also is the solution parallel. created by fill-in are well-known. is often when sizes as full matrices, instead. However, since the vector algorithms. the block in (dj l-cj), and processor factorization to tri- expressed in parallel. as competitive directly (or reduction processor is entirely are originally be useful they can be represented representation which from elliptic will efficiently for a pipeline the storage methods the solution to overcome of Babuska's of and band LU factorizations use of a pipeline the LU factorization and inher- a small number of the augmented matrix can be executed that the final stage tion methods on rows efficient When d.u. = -c. is naturally J J J not be long enough consider of a block the block system on the applications [H3]. This is because form) of the block these vector see also are available approaches. terms of vector some remarks of the methods; processors are natural with enough with For or when large to disqualify The block are small these enough eliminaso that a stable implicit 37 4. QUADRAT IC METHODS In Section ization 3 we have outlined and block Gauss-Jordan ized by the generation Now we consider by the inequality elimination. of a sequence IIB[Mi+ I]II _ IIBIN i]ll" From methods". reduction. LU factor- are character- M.l such that we adopted methods, which the name "linear are characterized IIB[Mi+ l]II < IIB[M i]211" methods, The latter usually and can be regarded odd-even of matrices some quadratic of the block The properties this inequality There are two basic quadratic even some properties elimination as a compact goes under version is a variation odd-even elimination the name of the former. of the quadratic cyclic and oddreduction, We will Newton show that iteration for -i A , while reduction odd-even technique elimination reduction of Hageman on a permuted computer, independently of the others 4.A. Odd-Even We case of the more [HI] and is equivalent Both methods are ideally of the quantities [H3]. in Section general Semidirect involved methods based cyclic to block suited for use may be computed on the quadratic 7. Elimination first consider equations as many are discussed and Varga system. on a parallel properties is a special odd-even elimination. Pick three consecutive block from the system Ax = v, A = (aj,bj,cj): ak-IXk -2 + bk-lXk -I + Ck-lXk (4.1) = Vk-I akXk_ I + bkX k + CkXk+ I = vk ak+iXk + bk+iXk+ I + Ck+iXk+ 2 = Vk+ I If we multiply result is equation k-I by -akbkl-i' equation k+l by -Ckb kI I' and add, the 38 (-akb il Iak- i)Xk- 2 + (bk - akbillCk_ I - Ckbkllak+l)X k (4.2) + (-Ckbkl iCk+ I)Xk+ 2 = (vk - akbkllVk_ I - CkbkllVk+l)" For k = I or N there are only modifications that should be obvious. if k is even ones. this effect, a matrix (2) involved, odd-even comes differ from the fact the odd unknowns is not the only possible row operation only by scaling and possible to use of condition. the new equations has only three A similar and the necessary has eliminated into the block it is seen that the row eliminations that the matrix apart. This but others commutativity By collecting H2x = v The name then the new equation and left the even achieve two equations non-zero set of operations tions k-2, k and k+2 of H2 to produce have diagonals, may again pentadiagonal preserved system the fact but they now are farther be applied, a new equation combining involving equa- the unknowns (3) Xk_ 4, xk and Xk+ 4. These The transformations pressed in the following new equations from one system way. form the system H3x = v . to the next may be succinctly Let H I = A, v (I) = v, m = ex- Flog2 N_, and define Hi+ I = (I + C[Hi])Hi, v (i.l) = (I + C[Hi])v(i), Block diagrams distance of the H sequence between the block -I v(m+l) bY x = Mm+ I for N = 15 are shown the three non-zero diagonal matrix i = 1,2 ,... ,m. diagonals Hm+ I is obtained; doubles in Figure at each the solution 4. la. The stage until x is then obtained 39 Hi H2 H4 Figure 4.1a. Figure 4.lb. H3 H5 H., i=l,...,m+l; N = 15 1 H., i=l,...,m+l; N = 16 l 40 We note that the same elimination of the periodic A tridiagonal Cl a2 b2 c2 a3 b3 to n_trices al • c3 • • • • aN_ I • • bN_ I CN_ I N It is only necessary may be applied form 1 ._. principle aN to include bN row operations that "wrap around" the matrix, such as row(N) An H sequence for N = 16 is diagrammed the same general complications In fact, Theorem as before, i then row(l)• 4.1b, and is constructed are serious and interesting for any partitioned when IIB[Hi+I]II< IIC[Hi+I]II I < IIC[H i ]2 the sequence separate in the proof• we have Hi+ I = D[Hi](l- are independent Since H i = D[Hi](I JlHi+llJ :< IIHilJ. IIHi+lll I _ IIHillI ' - D[Hi]- iHi ) = Hi(l + B[Hi]) . These assumptions A, and [J B[A][ I< I. JlB[Hi]211 and II I and matrix may be continued. We have H I = A, Hi+ I = (I + C[Hi])H i = (21 - HiD[Hi] IIC[Hi]ll I < I. by of 2. lIB[Hi]l[ is quadratic 2H i - HiD[Hi]-IH i = Hi(21 be kept there under which If lJB[A]II < I then < in Figure though conditions the sequence 4•1. -CNb:l we can define an H sequence to establish If IIC[A]III Proof• rule if N is not a power Clearly, only have - aNDNII_ row(N-l) -1)H I = Suppose of one another IIB[Hi]ll < I, and will - B[Hi] ) = (I - C[Hi])D[Hi] , B[Hi ]2) = (I _ C[Hi]2)D[Hi ]. Setting S = D[B[H i ]2 ], 41 T = D[C[H i ]2 ], we obtain D[Hi+I] = D[Hi](I- S) = (I- T)D[Hi]. Since IISll < IIB[Hi]21; < 1 , IITilI _ IIC[Hi]211 I < I, and D[Hi] is invertible it follows = I - D[Hi.I] " iHi+ i= that D[Hi+I] is invertible. I - (I - S)-ID [Hi]- ID[Hi ](Isimilarly block C[Hi ]2 = C[Hi+I](I diagonal part implies that IIB[Hi]ll< B[Hi]2), is null, and we conclude 2 IIC[Hi+I]II I < IIC[H i] I, write II I" Hi+ I = D[Hi](I- B[Hi+I] so B[Hi] 2 = (I- - T) + T. of B[Hi+l] Lemma 2.2 now applies, We have Now, so S is block S)B[Hi+I] + IS; diagonal and the IB[Hi]21 = I(I- S)B[Hi+I] I + IIB[Hi+ I]II < IIB[Hi ]211" Lemma To show -I D[H i] ISI. 2.4 IIHi+ill _ IIHill if (D[H i] - Hi)B[Hi]) N = D[H i] \, + (H i - D[Hi])B[Hi]. We then have ' 0k(Hi+ 1) _ 0%(D[Hi]) for I < % _ _j=l n., J + P_(H i - D[Hi])II B[Hi]II p%(D[Hi]) + 0k(H i - D[Hi]) = 0_(H i) • To show llHi+l}ll _ {lIIi{{lif D[Hi]-I)D[Hi {{C[Hi]{I I < I, write ] = D[Hi] + C[Hi](H i - D[Hi] ). _(Hi+ I) _ 7%(D[H i]) + Hi+ I = <I- C[Hi]<D[H i] - Hi) X We have IIC[H i]ll 7_(H i " D[H i]) <__/%(D[Hi]) + _/%(Hi - D[H i]) = v£(H i) • For a block block diagonal; 4.1 implicitly tridiagonal matrix thus the process uses QED A we have B[Hm+I] stops the convexity-like at this point. formulae Hi+ I = H i B[H i] + D[Hi](I= C[Hi]Hi+ (I- B[Hi]) C[Hi])D[Hi]. = 0 since Hm+ I is The proof of Theorem 42 As with Theorems to show that Theorem particular, or even 3.2 and 3.3, 4. I does not hold it is not true that rather off-diagonal I{B[A]{I < I. close of Theorem blocks matrix < I implies norms. In IIgCH2]ll2 _ IIg[H I 4.1 are that the matrix decrease clusion that Indeed, consider decreases dominance be relatively is not reached ]2 112 the matrix though small considered in size, and that the if blocks The relative that the diagonal than the off-diagonals. they will A = (-1,2+¢,-1), be large for which if IIB[A]II is.very initially. of Hi, and shows larger is that elements, to the diagonal is quadratic, may be quite the diagonal will always do not increase in size relative The rate of decrease quantifies elements for arbitrary IIB[A]II2 than individually, to I the actual measure that to find counterexamples IIBIN 2]II2 < I. The conclusions as a whole it is not difficult One con- on an absolute the non-zero scale. diagonals of H. are I 2 el = ¢' ek+l = 4e k + ek . l-i For small ¢ the off-diagonal 2-i elements about 2 periodic rescaling As already . in magnitude. of the main hinted [$2]. will While diagonal in the above IIB[Hi+l]II may be made when diagonals elements dealing Let A = (a,b,a), be about 2 and the main not a serious to I would remarks, with so that a more tridiagonal decrease perhaps refined diagonal in size, be desirable. estimate matrices with of constant 43 I / / b-a2/b 0 0 -a b-2a2/b -a2/b 0 0 • H2 = 0 ". • ". 21al < • ". H2 does not have but most which of the rows determine reduction constant of H 2 are essentially This property = IIB[H I]211 IIB[HI]211 . Note that any of the other H matrices, identical, will and it is these be important rows for the odd-even algorithms• When A = (aj,bj,cj) = (aj ,0, 0 , , j _(i) = max. _!i) . J J _(i+l) nor will • b2 -a2/b / 0 Ibl, IIB[H2]II = 21a2/bl/Ib-2a2/bl diagonals, IIB[H2]II" ". 0 b-2a2/b i 2 II< IIB[H2]II< IIB[H I ]2 II), and thus _IIB[HI] /(2- • ". 0 -a 2/b -a2/b that when -a 2 • .. , Hi -a 2 b-2a2/b • It follows 2 _ _(i)2. is block 0 An 0 , , Without diagonally , , j proving irreducibility _j dominant, = it here, II( b we ) If( we can show condition plus can write II+ l-j , that if _(I) < I then 8 (1) _ I yields the same conc lus ion. We also note that if A is symmetric then so is Hi, i e I, and thus Jl B[Hi]II = IJ C[Hi]IJ I" Theorem 4.2. If A is block tridiagonal and positive definite then so is H., l i _ i, and p__(B[Hi]) _ < I. Proof• First we observe be permuted into block tridiagonal Since Di = D[Hi ] is also matrix Ei such that each H.l is a 2-cyclic positive 2 that Di = E_.l Let form. definite, i = Suppose there matrix IV3] since that H i is positive exists a positive it can definite. definite E-IH l iEi -I ' so Hi+l = E _ IHi+IE-il= 2-i _ - "H.2 i 44 Hi and Since of Hi+l are positive H'l is a positive Theorem Since B 1• = definite definite Corollary H.l and 2-cyclic 3.6 and I D- IH E" i^ l i = 1 B.E I i' - iff I of matrix, Theorem it Hi+ I are 4.3 follows we of that of 0 < 0 < X < 2 and which is Any Newton H..l convergence If observed sequence entirely unexpected Kung [K2] for Kung and Traub The Newton certain sense, We seen step might that it still P(Bi)< 2 I ' B i= I also. Let I - X be aim on ZY) of = the diagonal as = < I, and so we eigenvalue (2I- YZ)Y, and y(i+l) of H should have = 2H.-H2_ l i i+l (I- 9_(y(i),z) matrix and applied to idea D[Hi]. Changing destroy quadratic The by the cf. such aiming goal Nor Yun is not is it [YI] and computations. proceeds in a towards tridiagonal it. matrix iteration the of ' then because, a block general matrix techniques. always convergence, IIB[H i]ll" two. effective Z -I any H sequence in algebraic of is is the zY(i))2[H8]. the appear, unification to mind = _(y(i),z) the method calls y(0) = between Newton's this 4.1 ]-I) , so while matrix. sequence Theorem (I - ZY (i+l)) relation elimination in the is an II" II_, method incorporates to such _(Hi,D[Hi y(i+l) matrix X2 P(Bi) QED and a powerful a block thought holds exists Newton's odd-even I since 2X- norm is a close iteration diagonal But = Y(2I- Hi+ I = give takes I-X < result occurrences [K3] that < some it that other yields be if that 91(Hi,D[Hi ]-I) block i for there it < Corollary definite. _<Y,Z) to Z -I a Newton the < I. quadratic converges Hi+ I = X2 positive y(i) eventually -I therefore III - ZY (0) II < have Then 2X- with be from that p (Bi) definite. ^ iteration. It may know IV3] ^ an eigenvalue positive it-h step the but iteration we have towards at seen each i" 45 4.B. Variants of Odd-Even There are a number cussed originally diagonals in terms is prone form of the Poisson Buneman Hockney's (-I,X,-I)(sj) = (tj), where that the reduction, (-I,X (i),-l), cision. Then Stone parallel solution case. ed Stone's Jordan result solely application. [Jl], to general when analysis the block generates X >> 2. when a variation tridiagonal tridiagonals apply to the disby of the form of Toeplitz diagonal of cyclic systems, proved case, tridiagonal and could pre- be equation. quadratic for the conver- for the constant the odd-even block obser- the machine reduction of our own work and system It was Hockney's of the Poisson to general tridiagonal systems I/_ (i) fell below thus of the B matrices independently to the block applied tridiagonal a sequence to the solution given here k-I has been given of tridiagonal system was essentially I/X (i) and equation below). be stopped damage of general The results restricted could [$2], in recommending gence of the ratios diagonal which block to obtain and a stabilization X e 2 and often the tridiagonal solved as such without major equation, dis- = (bvk - avk_ I - aVk.l). instability into a collection matrices rithm. to numerical use of Fourier Ax _ v is transformed vation + (-a2)xk+2 mostly For constant [H5] multiplies k+l by -a, and adds (see [B20] and the discussion With elimination, reduction. ab = ba, Hockney + (b2 _ 2a2)xk formulation of odd-even of the odd-even k by b, equation (-a2)xk_2 crete of variations A = (a,b,a) with by -a, equation This Elimination [H2], also extend- elimination algo- systems, though and are not this will be the 46 We now show lations that quadratic of the odd-even carry over to odd-even into one, consider convergence elimination; results we will reduction. for other see later how these In order the sequences hold to subsume several formu- results algorithms HI = A, v (I) = v, Hi+l = Fi(2D[Hi] - Hi)GiHi' v-(i+l) = Fi(2D[_i ] _ Hi)Giv(i ) where F., G.l are nonsingular i block diagonal matrices. The sequence H.i cor- -i responds to the choices F i = I, G i = D[Hi] In the Hockney-Buneman boundary conditions algorithm on a rectangle . for Poisson's equation with Dirichlet we have ab = ba, aT = a, bT=b A = (a,b,a), F.i = diag (b(i_ ,Gi= D[ Hi ]-I , b (I) = b, b (i+l) = b (i)2 - 2a (i)2, a Sweet (i) [SIll also uses rithm. We again note = a, a (i+l) _ (i)2 - -a . these choices that the H.i matrices erty for i m 2, but symmetry sented to develop rational function sures that the blocks of block rows reduction lose the constant is preserved. as a bivariate an odd-even Each block algo- diagonal prop- of H. may be reprel of a and b; the scaling by F°i en- i a and b. non-zero In fact, off-diagonal Swarztrauber elliptic blocks case the diagonal blocks will be a on a rectangle. con_nute with of these rows will polynomials in be b (i+l)_-and the (i+l) a Buneman-style are all n × n, and aj = _jln, the blocks of Hi+ I are actually blocks [$6] describes equations j2 This leads bj = T + each other algorithm for separable to A = (aj,bj,cj) _jln, cj = Vjln. and may be expressed where the In this as polynomials 47 in the ally tridiagonal uses F matrix = diag(LCM I, G i = diagonal Note that (n = symmetry In all v (i+l) the basic the B matrices so The cases when are p(i) = _'l - Hi' Fj q(i) Si 4.3. m PI = In for similar j _ 0 or algorithm (i) bj+2i_l) by that for tri- , G i = D[Hi]-i induction signal in Hi+l the Section premultiplication instability 2.A by = Fi"" "FIHi+I we block of showed that diagonal for v (i) = D[Hi]p(i) and (i) = + I = Fi(_iGiq(i) I, Pi = q(i) • the v (i) sequence by • D 1 = D[i_i] Defining DiGiQ i = QiGiDi v(1) D-l represent + assuming 0, q(1) = P stability , we have = v, (i) (Qi p + q (i) ) (i+l) SiP ). + B[Hi-I]'''B[HI]' i a 2. Then p(i) = (I- Pi)x, -- (DiPi Remark. cial -- Let = essenti- B[Hi]. = q(i+l) elimination (i) diag(bj_2i_l However, under - D[QiGiQi] P (i+l) a expressions I [B20]. = with shown These # B[Hi] considers Fi = be modifications , odd-even preserved. it may P(1) Theorem (i) q = [$2] with invariant that Buneman sequences I) but = F .... F v (i+l) l I . method matrices, Stone is not three and Qi D[Hi ]-I. matrices Swarztrauber's (i) b(i) * (i) (bj_2i_ I, j+2i_l )) , bj i j _ N + T. Qi )x Although cases of not these expressed formulae in for this the form, odd-even Buzbee et reduction al. [B20] applied prove to spe- Poisson's equation. Proof. = To begin --(DIP I - Ql)X. LCM = least the induction, Suppose common p(i) multiple p(1) --" of (I -- = 0 = P.)x ' the matrices (I - Pl)X, q(i) m _ (_._ considered q(1) _ _ as = ° v = HI x )x. Then polynomials in T. 48 m p(i+l) = (I - Pl)X + D-I(Qi (ll = (I - Pi + D-.Ii Qi = (I - = (I - Pi+l )x - Pi )x + - Di iQiPi + m (DiPi - Qi )x) _-I_._i i i - _.l_i)x B[H i]Pi)X I and q(i+l) = v(i+l) _ -Di+l p(i.l) = i+l x - 5i+l(I->i+l)X 1 = (Di+l" 1 = Qi+l - Di+l i + I Di+IPi+l )x i (Di+iPi+l - Qi+l )x. QED i-I We and note that Pk(DiPi- Qi ) IIDiPi - Qill _ IIq(i) + Buzbee [BI7 ] has Poisson's 4.C. Returning _ 0k(Di)II IIDiPill taken to linear system 8 = pk(Qi ) It also follows IIxll • Thus advantage the for original we A(2)x collect (-a2jb2j-la2j-l'b2j" (x2j)N2' the of IIB[A]II < < Pk(Di ) + that p(i) this odd-even derivation the (2) = w (2) -I x (2) = Pill+ and i then 0k(Q i) = IIx - p(i)II will fact be in an I_ ill = _ 132 Pk(Hi ), -i so IIPill IIxll and approximation a semidirect to x; method for Algorithm groundwork suppose = D[Hi ]-I Reduction Basic (4.2), A(2) = equation. The The Gi IIHill • _ixl I _ Odd-Even 4.C.I. when reduction of odd-even even-numbered algorithms has elimination equations of now in (4.2) been (4.1) into a where -i a2jb2j-lC2j -I -I -I - c2jb2j+la2j+l'-C2jb2j+ic2j+l)N2' laid. and < I 49 w(2) -i -I = (v2j - a2jb2j_IV2j_lN2 = This = c2jb2j+iv2j+l)N2 (2) (v2j)N 2 _N/2j. is the reduction step, the size of A (I) = A • compute the remaining and it is clearly Once A(2)x (2) = w components a reduction (2) has been since A (2) is half solved for x (2) we can of x (I) = x by the back substitution i x2j_l = b2j_l(V2j_l Since A (2) to obtain x (i) = is block a sequence of systems The reduction (note the change this point Before odd-even in notation) the back reduction of N, but if N = 2m-I tioned earlier, during which A (i) = v, a(i) b(i) (i) = ( J , j ,cj )Ni, block. begins, that if N = 2m-I case this effectively avoid then N. = 2m+l-i l However, simplifications The for that can be the special the inversion - i. to this particular is not necessary• involve At in x (I) = x. been restricted of important these of a single culminating have usually Typically scalings of the diagonal menblocks the reduction• It is also not necessary A(k)x (k) = w (k) leads to various By the way consists We write A(1) = A, w(1) since A (m) consists in the general [B20]. " where recursively Flog2 N+I_ A = (a,b,a) N there are a number made _Ni/2j this procedure stops with A(m)x (m) = w(m) , m = we note algorithms j = I,..., FN/2_. we can apply A (i)x(i) = w(i), substitution proceeding, - c2j_ix2j), tridiagonal (xj2i_l) N , N I = N, Ni+ I = i x J" (i) = xj2i_ I. choice - a2j_ix2j_2 to continue can be solved more polyalgorithms in which of blocks the reduction efficiently by some other and the semidirect A (i) has been constructed, taken from H i. For example, recursively methods method. of Section it is clear in Figure if 4.1a, This 7. that it A (4) is 50 the (8,8) block of H4. It follows immediately that IIA(i) II< IIHill , IIB[A(i)]II _ IIB[Hi]II, and IIC[A(i)]III _ IIC[H(i)]II I • of Theorem have Corollary 4. I we actually 4.1. If IIB[A]II _< I then llA(i+l) l]_ ]IA(i) ll. If and As a corollary IIB[A(i+I)]II _ IIB[A(i)]211 IIC[A]III < I then IIC[A(i+l)]Ill and _ IIC[A(i) ]2111 IIA(i+l)II 1 _ IIA(i) II l- Recalling Theorem %(i+I) Proof. 4.4. the definition If X(A) _ I then, with _ (%(i)/(2 It will b J (2)= of Theorem 3.4, let %(A) -- maxj(41_] lajllllb-llcj_ lj %(i) = %(A(i)), we have - _(i)))2. suffice b2 j (i- to consider i = I. We have _2j ) ' _2j = b21ja2jb2j-i IC2j - I + b21jc2jb2j+la2j+l -I a(2) -I j = -a2jb2j_ la2j_ I , c J(2) o = -c2jb -lj 2 +IC2j+l ' IIo2jll _ IIb21jamj II IIb21j_Icmj_lll + IIb21j+la2j+lll ll(l-_2j)-lll< (I- _), _2) -I = 2/(2- -i llbmjC2jll < 2(_/4) = _2, 411 b(2)-la(2)j j II IIb(m)-l-(mj_l cj-_ II < 411(1 - _2j)-lll II (I - _2j_2)-III (IIb21ja2jll IIb2j_ic2j_lll i I ) • (IIb2j_la2j - iii IIb21j-2Cmj-2 II) 4(2_2 Thus - _))2(_4)2 _(2) _ (_(I)/(2 = (_(2 - _(I) ) )2 • - _))2. QED II) 51 This may be compared which to earlier gives an equality results of the form for constant diagonal _(2) = _2/( 2 _ _2). matrices, See also Theorem 4.8 of IV3]. 4.C.2. Relation to Block Elimination By regarding tion we are able possible muted odd-even to relate to express system reduction as a compact it to the Newton the reduction algorithm (PAPT) (Px) = (Pv) (cf. [BII], show how this formulation Specifically, also yields let P be iteration as block [E3], quadratic the permutation of 21 , the odd multiples T P(1,2,...,7) . on a per- In Section convergence which elimina- It is also elimination 5 we theorems. reorders of 2 come of 22 , etc. first, the vector followed For N = 7, T = (1,3,5,7,2,6,4) , and bI cI b3 a3 b5 pAp T = b7 factorization a5 a7 • b2 a6 a4 c3 c5 c2 The block -I 0 so that the odd multiples odd multiples for A [W2]). matrix T (1,2,3,...,N) form of odd-even c6 c4 of PAP T = LU is b6 b4 by the 52 iI i L = a2bll 1 I I c2b31 a6b51 . a4b31 c4b51 b3 I c6b71 a2(2)(b(2))-ll a3 b5 c2(2)(bJ2)'-l, I/ c3 c5 a5 U = b7 a7 bI (2) c _2) bJ2) If we triangular " aJ2)/ let bj(i) _ - _!i)u(i J J ) be an LU factorization then the ordinary LU factorization with _(u) lower of PAP T = L'U' is (upper) 53 _3 L' = _7 1 i a2ull c2u31 4(2) \ -1 -1 a6u5 _\\ _(2) c6u7 _3 a4u31 c4u51 " - a_2)(u_2)) - I c _2)(u_2)) - I %_ _llCl u3 _la 3 _Ic 3 u5 U' = _Ic 5 u7 _la 5 i. _la 7 Ul(2) ( 2,_2) ) -1 _1(2) .... _% u 3(2) (__2) )- Ia 3 (2) i_ (3) uI This factorization then U' (Px) = z. gives rise It is easily to a related seen corresponding the above modification. conclusions are weaker In particular, than those pAp T = LL 7 factorization definite if A is; Theorem as well about Gaussian about Theorem of Corollary as Theorem 4.2 also allows solving (" -1_ (i) (%7i)) w. 3 3 results = first J Z (i) that We may now apply any of the theorems tion to PAP T and obtain method, or block odd-even 3.2 applies 4.1. elimina- reduction though It follows 4.2 that A (i) will the conclusion L'z = Pv, and its from the be positive that p(B[A (i)]) < I. 54 The odd-even cyclic reduction reduction defined tion of large sparse is also a special by Hageman systems. Since case of the more and Varga general [HI] for the iterative PAPT is in the 2-cyclic solu- form pAp T = _i 21 D 1 and D2 block diagonal, AI2_ D2/ we can premultiply 2 IDI by I to obtain 1 - A12D2 A21 - 1A D2 - A21D 1 0 It is clear of tt 2 that of this that have The is not: formulation, 0(B[A (i+l)]) L(U) A (2) = D2 been if computed A is < p2(B[A(i)]) block unit substitution lower (upper) Norm triangular, bounds with the natural ordering we have IILII and tion For of L and U, grow, the then 0) reduction. so A (i) determines on L and IIB[A]I] < 1. IILIIl < I + IIB[A]II, the permuted IIUIII will of A12D -2 1A2 1' PAP T = L&U, where matrices < i, IIUII _ I + part an M-matrix LU factorization phases. n = max. n.. J J as (D 1 - are In the rows consequence and < 1 [H1]. for general IIC[A]III A2 1D11A 12 and block U were For block A is the block diagonal reduction given in Section tridiagonal IIC[A]III, and then only back 3.A.2 matrices in IILII <nll C[A]III if IIUIII _ I + nll B[A]I I if tridiagonal and and system logarithmically. IIB[A]II < I, only the bounds on By the construc- 55 IILII__ _+ 1max IIC[A (_)lll_, _i-<m m II_11_ _+ _i=l IIC[A(i)311, m II ull I I + Li=I IIBEA(i)]III' IluII _ _. Since C[A (i)] is block tridiagonal, P%(C[A(i)]) IIC[A(i) ]II_ 2nll C[A(i)]II I; similarly IIB[A]II < I, IIC[A]II I < I, we II BEA(i)]II" max l_i-_n _ 2nll C[A(i)]III' so IIB[A(i) ]II i _ 2nll B[A(i)II • If then have IILII1 _ 1+ IIc[Allll II LII _ 1 + 2_ll CEA]II1, IIulI1 _ 1+ 2nmll BEA]II, IIUII _ 1+ IIN[AIlI. Mu=h _ikethep=oofof Co=o_=y 3.4we_o h_ve,if IIB[AILI < _, b(i+l) J = b_ i) j (i- _) ) ;I_ j i) II_ IIB[A(i)]211 . h (i) = (i+l) II_ II-2j II(I + _i), _i J II(b(i+l)) - I and cond(b!i+l)) 3 IIB[ A(i) ]211 I II_ II ib(i) (_-_i)' 2j )-II/ _ c°nd(b_ ))(I + 8i) / (1- _i). , 56 Since must our interest simplify the is cond(bj (i+l)) in bounding above expression• Let in K. = I terms of i _j=l (I + cond(b k) _j)/(l- we 8j), so that cond(b_ by • induction Using Sj i+l)) _ [max k _ _2j.I' we cond(b k) ]K i obtain i 2 K i < _j=l(l + _j)/(l - _j_l ) i-i This is a rather is close to = (I + _i)/[(l- < (I + _i)/(l unattractive I. Actually, upper the I (i) ]2 1 II°2j -(i> II<_ _II B[A II= _8i' _l)IIj=l(l- - 8j)] _i )i. bound situation since it may is not be so bad, quite large: if 61 since and 2+8 cond (b (i+l)) j _< [max k • If X(A) < II_ _i) j II _ i then %( i)/2, cond(b k) ] H i j=l k (i) = 2 - %(A (i) ), i c°nd(b_ When A is positive theorems that 4.C.3. Back Now (i+l) Yk A(i+l) let (i+l) definite cond(b_ i)) < _ [maxk c°nd(bk)] i < 3 max k cond(bk) it follows cond(A (i)) from < and 2 +_ %_i_l _ _j=l 2 _(i) . various eigenvalue-interlacing cond(A). Substitution us (i+l) = Xk X i+l)) J 8j -_ consider + error propagation (i+l) _k , I _ k w (i+l) and we _ hi+l, compute in defines the back the substitution. computed solution Suppose to 57 (i) Y2j-I = (b_i) j- l)-l(w_i) j-I- a(i) (i+l) - c(i) 2j-I _ Yj-! 2j-I Yj(i+l)) ¢(i) 2j-l' + Y2j (i) = y_i+l) Then F(i) _2j-i = y_i) j-i _ x_i) j-I = (b_i) (i) j- l)'I a 2j-I F (i+l) " =j-I (b_!l)-I c 2j-I (i) _(i+l) j + ¢ j-l' _i) j Now define (i+l)• = _j the vectors E 1 ' 0 , _ _ -.. T ..) • • ,(i+l) z2 = (0, _(i+l) hI , 0 , _2 We have IIz211= II_(i+l)II by construction, and _(i) = B(i) z2 + Zl + z2 ' B (i) = B[A(i)]. But in consequence of the construction we actually have II_(i) ll< max(ll B(i)z2 + Zlll ' IIz211 )" If 1[ B[A][[ Suppose <1 then [I B(i) [[ _(m)I[ z2[[ _ ¢ and _ [[ B(i) ][ e (i) II [[ z21[ < [[ < ¢ for II z2[[ all • i = 1,...,m-1. Then II _(i)II _ma_(I[ B(i) [[II z2[I + II z_II, II z211) [Iz2 II+ IIzIll --II _(i+_)II + II_(i)II (m= (m- (i+l) + I)¢ + ¢ i + I)¢, (by induction on i) 58 and II_(I)II, me, m = the error [log2 N + I]. error propagation pessimistic in the back decreases Storage which been of fill-in m is blocks, -Ji-2 specific practically and storage ordering block storage seen that expect be damped rate for it is a somewhat that errors out quickly is the number 2(N i - i) < 2N. structure system, requirements, of A, as discussed since odd-even there Since reduction owing in to of the blocks, temporary (2) 3N- However, fill-in. ,...,A (m) The , 2 storage we have not considered as this is highly nor have we considered 3.A, re- fill-in. of A A requires Of course, that block in Section does generate blocks the matrix increase. we note is no block of off-diagonal this is not an excessive the internal In fact, = w (i+l) will of fill-in no additional amount substitution. growth by IIB(i)II • in the natural it has already = v, is bounded a satisfactory and we can more in of Ax Requirements On the matter elimination solution is certainly of A(i+l)x(i+l) the quadratic quires This upper bound, the solution 4.C.4. in the computed dependent storage on the for intermediate compu tat ions. We some do note, economies however, that if the factorization may be achieved. to A (2) , the first equation three equations saved of A(2)x (2) = w of A(1)x (I) = w (I). for the back substitution, once A(2)x (2) = w (2) is formed. may overwrite tions hold For example, the second throughout the CDC STAR-100 equation Thus in the transition (2) is The first but there generated and third is no reason the first (1). on vector from A (I) from the first of these must be of A(2)x (2) = w (2) Similar but some modifications due to its restrictions then to save the second equation of A(1)x (I) = w the reduction, need not be saved, addressing considera- are needed on (see Section 10.B). 59 The special separable scalings elliptic PDE's A (i) as polynomials Since the blocks intermediate tunately, of odd-even allow or rational to a small there does not seem to nonseparable matrices, implicit of A are quite storage PDE's. reduction representation functions sparse from of the blocks of variable. this is necessary in order to limit of the original storage. Unfor- multiple to be a similar computing derived of a single matrix The only recourse and on current for systems systems procedure is to store this will that can be applied the blocks limit as dense them to small sizes. 4.C.5. Special Now suppose we would Techniques that A = (-pj,tj + pj + qj, -qj), like to apply Babuska's 3.A.4) to the factorization tially where succeeds. First, tricks pj,qj,tj _ 0, pl = qN = 0; for the LU factorization induced by odd-even we can represent reduction. (Section This only par- (2) t(2) + _(2) + q(2) (2)) A (2) = (-pj , J pj J ,_qj Pj(2) = P2j b21j-i P2j-I' t j(2) = t2j + P2j b2j-I -I t2j-i + q2j b21j+l t2 j+l' qj(2) = q2j b21j+l q2j+l' bk Note that t2j < t j(2) _ b2 j" the back will = tk + Pk + qk" substitution not work. (D + U)(Px) * (D + D On the other by combining hand the technique the LU and UL factorizations of avoiding of PAPT J, + L , ), Let A' = pApT = (I + L)(D + U) = (I + U")(D = f, (I + U )f = Pv, (D 2" - D[A' ]) (Px) = f + f + L )(Px) = f , so that * - (L + D[A' ] + U)(Px). (I + L)f= Pv, 60 The only way to have A' = L + D[A' ] + U is if no fill-in occurs in _,¢ either factorization, fill-in occurs in which in odd-even reduction, entirety. However, developing the t (i) sequences. Schr_der on a rectangular discussed it will still and Trottenberg tion" for matrices drawn grid. in [Sla]. case L = L[A' ], U = U[A]. so the method be advantageous from PDE's Numerical and finite stability The total reduction in [Sl] by a "partial enough to prevent the application convergence results mined the proper 4.D. Parallel Finally, computation Lambiotte and further by closely obtainable, reduc- approximations developments to odd-even reduction"), are reduction but just different Part of the problem related is expressible are undoubtedly of "total difference is similar is more the algorithm setting a method of our techniques. to be that total reduction though in its to do the reductions [SI] also consider is derived to the matrix, does not work (See [Sl] for some examples.) (which seems In particL,lar to the PDE than in matrix form. Various but we have not yet deter- for their analysis. Computation the utility of the odd-even has been demonstrated and Voigt [L2], methods for parallel in several and Madsen studies and Rodrigue [H3]. or pipeline Stone [MI] discuss [$2], odd-even 2'¢ reduction for tridiagonal to use when which N is large systems, (> 256). and conclude Jordan is shown in [MI] to be most useful on a particular pipeline computer, [Jl] considers for middle the CDC STAR-100. N the sequential LU factorization is best, terms engendered by the pipeline mechanism. The basic method given here that is a block owing analogue it is the best method odd-even ranges elimination, of N (64 <i N < 256) For small values of to the effect of lower order of the method examined in [L2]. 61 The inherent the equations substitution parallelism of the odd-even of A(i+l)x(i+l) for odd-even methods (i+l) - w or Hi+ix reduction. = v is in formation (i.l) It is readily of and in the back seen from (4.2) that (1) H 2 and v (2) may be computed compute from H I and v LU factors of bk, by (I < k < N) solve bk[a k _k Vk] = [ak Ck Vk]' (i _ k _ N) bk(2) - bk - ak_k_ I - CkSk+ I, (I < k < N) Vk(2) +-v k - akV_k_lak(2) +- -akak-l' CkVk+ I, (I _ k < N) (3 _ k _ N) c k(2) _- -CkCk+ ^ I, (I < k _ N - 2) (Special end conditions shortening erating vector lengths new equations straightforward l compute slightly. by storing zeros A (2) and w for even k; the back (2) strategically are computed substitution or by by only gen- is then rather once x (2) is computed. The algorithm N. = 2m+l-i are handled for odd-even reduction, - I, is as follows: LU factors (i) solve b2k+l[a2k+l k - a (0 _ k < N i+ i) ^ ra(i) W2k+l ] = L 2k+l C2k_l^ wk(i+l) = W2(k)- c c(i) w(i) 2k+l 2k+l ], (0 < k < Ni+l) a2k+l, C2k+l' (I _ k < Ni+l) i a_k)W2k_l C_k)W2k+l , (I £ k < Ni+l) ak(i+l) = _a2(k) a2k^ I' (2 < k < Ni.l) ck(i+l) =-C_k) all blocks n X n, for i = l,...,m-I of b 2k+l' (i) ^ C2k+l N = 2m-l, (I _ k < Ni+l - I) 62 solve b_ m) x_m) = w_ m) for i = m-1,...,l solve h (i) (i) _ . (i) _(i) -2k+l X2k+l - (W2k+l - a2k+l The ith reduction (122 n 3 + O(n2))Ni+l count pared _ (i+l) - c(i) x(i+l) Xk 2k+l k+l ), (0 < k < Ni+ I) and corresponding . 12n3 arithmetic back substitution operations, uses and the total operation for the algorithm is (123N--- 12m)n 3 + O(n2N + n3). This may be com2 3 to 4_Nn + O(n2N + n 3) operations for solution of Ax = v by the LU factor iza tion. , For a crude estimate of execution take TN + _, T << _, as the execution length N, and t, T << [H3]. time for odd-even (122n 3 + o(nm))(_N + om) + O(n3t) for the LU factorization done comparison similar shows results tridiagonal come Odd-even effective value than in odd-even a limited in Section that roughly observation versus of operations then be (42 n3N + o(nmN + nB))t in scalar mode. precise A qualitative analysis of n, odd-even its utility reduction. are implementation A critical on vectors given reduction dependent crossover and will values points be considered timing beto 10.C. on the execution 50_ of the total detailed of N; is often time is spent time of odd-even reduction in the first reduction from .2. I, A more be- as N increases. for moderate The actual to will (_N + _)m, but data manipulation tween algorithms extent we would than the LU factorization by computer, time for scalar reduction to the more should maintain TN + om is replaced simpler completely for a fixed more elimination the factor much systems: increasingly time for operations t << _, as the execution The total execution roughly time on a pipeline analysis is in Section 10.C and Appendix B. is 63 A (I) = A to A (2). the reduced Special systems efforts differently more, in timing order term n3N comes must each reduction and 30_o from solving systems is used saved the factorizations. to factor The use of a synchronous which The first restriction cases As with when or positive definite, Thirdly, vector the pivot selections pivoting must operations. to its limitations restricted tion between addressing small blocks In the case processing causes the diagonal of operations to numerous diagonally the blocks to represent special ineffi- blocks. requirements problem dominant will be required. the matrices for the parallel and or on the CDC Star owing and allocation ILl]. this is not as serious of the llliac IV, the problem elements also be considered. must computer. [[B[A][[ < I only guarantees within important substitutions matrix pivoting If A is strictly structures Relatively is that the blocks resorting differ between A(i+l) some constraints simultaneous In addition, to the machine's on vector to having otherwise. of data is an etc. creates of efficiency without then no pivoting This to compute on a sequential the assumption is needed. the choice conform computer This allows sizes. the LU factorization, that no block vectors in the interests block used for a2k+l' are not so important to be programmed due to varying ciencies parallel Further- 60_ of the high- the b! i)'s or to make back J all be the same size. etc.) that roughly the 2n+l linear by solving this fact into account. multiplications having (factoring, the algorithm from the matrix effort A should take it is seen little on implementation to optimize When A is as it would of data be communica- 64 4. E. Summary We now summarize tion and reduction certain block tant quantities The quadratic Newton-like the results algorithms tridiagonal involved the basic methods, may be safely linear have important play quadratic convergence. Parallel these methods, although type of system that can be solved practicalities applied decrease a block The odd-even by expressing matrix. applications computation place efficiently. various quadratically diagonal elimina- to the solution and in some cases may be explained to compute which systems, in the solution convergence iterations of this section. of impor- to zero. the methods Variations to PDE's, is a natural some restrictions as of also disformat on the for 65 5. UN IF ICAT ION : Having ITERAT ION AND F ILL- IN presented them in a unified iteration. manner We are also in terms of matrix only instances to the block methods, instances of a general and conclude iteration to arbitrary tridiagonal Let us first collect block Gauss-Jordan, and quadratic lead to an interpretation fill-in, apply linear as particular of the general These results only several matrices display convergence methods quadratic satisfying are the convergence. I B[A]II < I and not case. the formulae and odd-even for the block LU factorization, elimination. LU: Ai+ I = (I- GJ: Ai+l OE: Hi+ I = (I - (L[Hi] + U[Hi])D[Hi]-l)Hi Now consider nonstationary of quadratic that the odd-even which we now treat L[Ai]EiD[Ai] "I)A i , A I = A. = (I - (e[Ai] + g[Ai])EiD[Ai]-l)Ai the function F(X,Y,Z) = (21- , AI = A. , H I = A. XY)Z, and observe that -i Ai+ I = F(D[Ai] Ai+l , A i) = F(D[Ai] Hi+ I = F(Hi, Thus + L[Ai]E i, D[A i] + (L[Ai] + U[Ai])Ei' -I D[H i] , H i) it is necessary to analyze D[Ai ]-I' Ai ) sequences based on repeated applications of F. For fixed nonsingular r(M) = I- Y we can define MY, s(M) = I - YM, so F(X,Y,Z) r(F(X,Y,Z)) = r(Z) - r(X) + r(X)r(Z), s(F(X,Y,Z)) = s(Z) - s(X) + s(X)s(Z). the residual = (I + r(X))Z matrices and 66 By taking tionary X = Z and assuming iteration is, of course, been ZI = X, Zi+l s(Zi+l) IIr(Z I) II< i or Zi+ I = F(Zi,Y,Zi) the Newton The related viously that operator studied iteration. G(X,Y,Z) as a means = G(X'Y'Zi)' = s(X)s(Zi)' of solving Bauer Zi = (I- = Z + X(I - YZ) - Z + Xs(Z) Newton for G(Z,Y,Z) Zi+l = F(Xi'Yi'Zi)' from Z., l including lowed to change sequence systems This = F(Z,Y,Z), has pre- [H8]. that r(Zi+l) r(X) i)Y-I = Y-I(I- Z2i = G(Zi'Y'Zi)" In this section linear [B8] observes and in particular iteration, converges IIs(Z I) II< I, the sta-I quadratically to Y . This When = r(X)r(Zi) , s(X) i), Zi+k = G(Zk'Y'Zi)' last formula also leads to the s(Z2i ) = s (_i)2 we study the nonstationary iteration Z I = A, -I where Y.I = D[Zi] and Xi consists of blocks taken the diagonal. from step is predetermined of such an adaptive The selection of blocks to step, but we shall and will not be altered procedure, even on a highly assume for IX. is ali that the selection adaptively. parallel (The cost computer, would be prohibitive. ) For the solution (Zi+l, vi+ 1) = (21 * * ing (Z ,v ), where condition which are similar XiYi)(Zi,vi), Z is block we could also generality. diagonal with matrix, the diagonal. take (Zl,V I) = (A,v), sequence ending *-i* Then x = Z v . in or approach- One necessary is that 21 - XiY i = I + r i (Xi) be nonsingular, if and only if P(ri(Xi)) = C[Xi] , si(Xi) greater system Ax = v we * for consistency holds ri(Xi) - of a linear assume < i. P(si(Xi)) Since ri(X i) and si(Xi) < I. Using = B[Xi] ; the notation In the context the condition states that A is sufficiently namely D[A]. has been changed of an iteration IIB[A]II < I (implying close earlier to a particular converging = I - Y.X. i i notation, to provide to a block IISl(X I) II_ l_l(Z I) II< I) block diagonal matrix, 67 We first generalize earlier results on linear methods IISi+l(Zi+ I) II< IIsi(Z i) II) and then determine obtain a quadratic Theorem 5.1 method (IIsi(Z i) II< I = In the sequence conditions {(p,q)]I under which we IISi+l(Zi+ I) II< IIsi(Z i) 112). Zi+ I = F(Xi,Yi,Zi), [(p,p) ]I _ p _ N] c T.l c (llsi(Zi)II< I = suppose that _ p,q < N} = U -I X i = _.EpZ.E i q , Y i = D[Zi] i si(M)_ _ = I- Then , Y.M.I IIs I(ZI> II< I implies IISi+l(Zi+ I) II_ IIsi(Zi+l> iI< IIsi(Zi> II • Proof. Suppose IIsi(Z i) II< I. Q = -YiZi + Y1"XiY'Zl i =-si(Xi) S = D[Q] = D[si(Xi)si(Z i) ]. and IISII _ IIsi(Z i) 112< i. and si(Zi+ I) = (Iimplies Define Q by Zi+ I = Zi " D[Zi ]Q' so + si(Xi)si(Zi). Since Since D[si(Xi) ] = 0 si(X i) = _i pIEs.(Zi)E q, Thus D[Zi+l] S)Si.l(Zi+ I) + S. IIsi(X i) II< llsi(Zi) II = D[Zi] (I - S), Yi+l = (I - S)'IYi , Lemma 2.2 applies, so that I_i(Zi+l ) II< I IISi+l(Zi. I) II_ IIsi(Zi+ I) If" But si(Zi+ I) = Si(Zi) + Q = si(Z i) - si(X i) + si(Xi)si(Z i) = _T,EpSi(Zi)Eq 1 T' = U - T so 1 i' pj(si(Zi+l)) + _.EpSi(Zi)EqSi(Zi), 1 _ 0j(_,E r i P si(Zi)E q) + 0j(F-_li PiEs (Zi)Eq)II si(Zi)II Oj (si(Zi)) • Thus IIsi(Zi+l)II _ IIsi(zi) II< i. QED. 68 To see when the method is actually at least some blocks of Si+l(Zi+l) Corollary the hypotheses 5.1. With are bounded of Theorem _iEpSi+l(Zi+l)Eqll Proof. quadratic, we first in norm by establish that I]si(Zi) ll2" 5.1, < IIsi(Z i) [I2 • From (5.1) (I- S) Si+l(Zi+ I) + S = si(Zi) S = D[si(Xi)si(Zi) and E (IP S) = (I- - si(Xi) + si(Xi)si(Zi) , ], S)E , we obtain P (I - S) 7T.E p Si+l(Zi+l)Eq i = _ + S E s (Zi)E - _ E i p i q ip s (Xi)E i q + _ X.E p s i (Xi) si(Zi)E q But _.Ep si(Zi)Eq = s i (Xi) = _ x.E p s i (Xi)E q ' I so (I- [I_iEp S) _iEp Si+l(Zi.l)Eq Si+l(Zi+l)Eqll = _i pE si(Xi) si(Zi)Eq - S. By eemma 2.1, _ I[_i E p s i (Xi)si(Zi)Eql I < [Isi(X i) s i(Zi ) [I_ l_i(Z i) 112 . QED In certain we could where have cases Corollary _f.Ep 1 Si+l(Zi+l)E this occurs. However, 5. I provides q = 0. it is seen Block no useful elimination information, is one that if we are to have since example llSi+l(Zi+l)II [[si(Zi) 112 then we must have (5.2) This II_,.Ep l is certainly Si+l(Zi+l)Eq[ I _ I[si(Z i) 112. true if T _. 1 is. the null set, in which case we are generating 69 the H.l sequence, (5.2) will so we shall not hold assume for any starting now write Inclusion affect practical matrix consequences: in X.l of all blocks on T., it follows i and not by design. Corollary 5.2. Proof. a quadratic Suppose in Z.I known that if (p,q) accident, yields be in T.. l it = 0 For example, we as Zi+ I = F(Xi,Yi,Zi). With though if it is true that E Z.E p lq (p,q) must algorithm purposes, (L[Ai] + U[Ai ])(_ij=IE j) , D[Ai ]-I' Ai ) --F(D[Ai'] + the sequence on T.l for notational ZI, then the Gauss-Jordan Ai+l and show that in general. We set one final condition turns out to have that T'l is non-empty With of Theorem for arbitrary that T' _ _. l the additional _ T. then E Z.E l p lq the hypotheses method to be zero clearly not condition can be zero only by 5.1 starting Following will ' only the choice matrices the proof T'i = Z I. of Corollary 5.1 we obtain (I- S) _,E P s = _,E ip which implies, l)E i+l (Zi+ si(Zi)E q q + _,.E ip s (Xi)s (Zi)E i i q by Lemma 2.1, p s i+l (Zi+ l)Eq II II_,.E l ]I_,El p s i (Zi)E q + S + _ '. Ep s i (Xi)si(Zi)Epl I " l The presence of nonzero the 2nd order terms (5.2) will hold Ist order (S + _,Ep l for arbitrary terms (___M,E.ri pSi(Zi si(Xi)si(Zi)Eq) Z I satisfying )Eq) in general , and we cannot IISl(Z I) II< I. dominates conclude that QED 70 will Now let us consider what during algorithms the odd-even diagonal _,q) block, happens when block for block fill-in occurs, tridiagonals. = EpXi(-YiZi )Eq = EpXi IIEpZi+IEqll < IIEpXill IIsi(Zi)Eqll- From (I- si(Zi)Eq' S) Ep Si+l(Zi.l )Eq = Ep si(Xi) it For the off- p # q, we have EpZ.EIq = 0 but EpZi+iEq Zi. I = 2Z i - X.YIiZi' EpZi+IEq as # 0. si(Zi)Eq' Since and (5.i) and E P SEq = 0 ' we have IIS + Ep si(X i) si(Zi)Eqll < IIsi(Zi ) II 2. which implies 0_e cannot IIEp Si+l(Zi+l)Eql I conclude that IIEp Si+l(Zi+l)Eql I _ IISi+l(Zi+l ) II 2, however.) As shown by these blocks of Zi+ I when are quadratically tic convergence In summary, the odd-even inequalities, produces small IIsi(Zi) II< i, and the corresponding small with is from block we now have methods is that the method fill-in when respect to si(Zi). fill-in during applied is a variation to block blocks Thus one source the process two explanations off-diagonal of the quadratic convergence matrices. Newton of quadra- Zi+ I = F(Xi,Yi,Zi). for quadratic tridiagonal of Si+l(Zi+l) in The first iteration for -I A . The second diagonal blocks blocks is that, of H.i are eliminated of Hi+ I arise fore either zero in going from block or quadratic from H i to Hi+l, all the non-zero and all the non-zero fill-in. in B[Hi]. Each block off- off-diagonal of B[Hi+l] is there- 71 6. HIGHER-ORDER METHODS For Poisson's equation, leads to considerable algorithm. (constant b. However, A = (-I, b, -I), simplification The most block where important diagonals) Sweet [SI0] therefore which could handle and a polynomial cases other choices and still of odd-even preserve Bank's constant generalized equation uses eralized in order to control the stability is analagous to multiple algorithm. ordinary This technique differential In this of odd-even through much block as before. of this section block diagonals algorithm the same) gen- of a fast marching shooting methods for It must considered scaling, yields and hence as practical our purpose higher-order here be emphasized, reduction properties as yet another though, Sweet's that method carry the methods for general is to elaborate for a family convergence our results methods by showing variation block on the quadra- that they are special of methods. Reduction The idea behind a block diagonal of odd-even method that it possesses actually Rather, cases of convergence p-Fold Sweet's are not intended systems. tic properties 6.A. and show The algorithm a suitable tridiagonal we analyze reduction, through natural. reduction marching of (very nearly in equations. section properties. one step -I) of N are much more [B5] for Poisson's reduction reduction of b (i) as a polynomial a generalization representation. odd-even N = 2m - I are A (i) = (-I, b (i) and the expression described these in the Buneman properties in some applications the choice tridiagonal the generalization linear system of odd-even involving reduction the unknowns is to take Xl,X2,..., and 72 to produce unknowns a new, smaller Xp,X2p, .... block Once are back-substituted this system The entire process reduction is the case p = 2. Let p _ 2 be a given will integer, system involving is solved, into the original unknowns. Ax = v we generate tridiagonal equations be called the knowns to compute p-fold and define only the N2 = Xp,... the remaining reduction; LN/pJ. Starting with system A(2)x (2) = w (2) , where the reduced A (2) = (a(2)j, b (2)j, c(2))_j , x(2) = (Xpj)N2 , and the diagonal has dimension n PJ.. f When For I _ k _ N2 + Ckp-p+l ak_p- p+2 bkp-. p+2 redefined as a smaller partition A to have extend diagonal AI, bp, and call . using b (2)j Ckp- p-t--2. akp- I Ax = v with matrix block i, define bkp-p+l kp-I > N, either odd-even trivial the available equations blocks or let _ of A. Next, be re- blocks _, the new system K_ = v. b2p, ..., bN2 p, Three adjacent _2+i, block rows of K are thus 73 kSkp- I (0 ... 0 akp) ( bkp ) (Ckp0...0) p- The p-fold reduction and the back is obtained substitution by performing is obtained a 2-fold by solving reduction on A, the system _-_ V_p_/ 0 _ : -_o0 / I i jakPp i _kp- I / I / _kp- kp- I Xkp By taking / (,___ ___ _a_-_ ,_-_ %-_/ ___ :_\i the reduction step simplies to ak 2( ) = -akp _kp-I (6.1) bk(2) = bkp- akp _kp-I- Ckp _kp+l Ck 2)" = -Ckp _kp+l Wk(2) = Vkp - akp q°kp-I - Ckp _p+l' and the back substitution becomes kp- p+l : 0 o ___ v_._/ 74 Xkp-j - _kp-j - _kp-j Xkp-p i <j Rather technique than computing - _kp-j Xkp' <p-l. the entire _ and _ vectors, is to use LU and UL processes on the A blocks: for each k = 1,2,...,N 2 + I, (LU process on _) _k,l = akp-p+l; _k,l = bkp-p+l; q°k,I = Vkp-p+l ; for j = 2,...,p-I solve Bk,j_ I [Ak,j. I Ck,j_ I Fk,j_ I] = [_k,j-I solve Ckp-p+j-I q°k,j-l] _k,j = - akp-p+j Ak,j-I Bk,j = bkp-p+j - akp-p+j Ck,j-I - akp-p+j Fk,j- I __k,j = Vkp-p+j _k,p_l[_kp_l _fkp-I C°kp-l] = [_k,p-I Ckp-I a more _k,p-I ] efficient 75 for each k = 1,2,...,N 2, (UL process on _+i ) _k+l,- I = bkp+p- I; _k+l,- I = Ckp+p- I; q°k+I ,-I = Vkp+p- I for j = 2,...,p-I solve _k+l-j+l [Ak+l,-j+l = [akp+p-j+l Ck+l,-j+l Fk+l,-j+l] _fk+l,-j+l _k+l,-j+l ] _k+ I,-j - bkp+p- j _fk+l,-j - kp+p- j Ak+ 1 ,-j+ I - Ckp+p-j Ck+l,-j+l .__°k+l ,-j = Vkp+p- j - Ckp+p- j Fk+l,- j+l solve _k+l,-p+l [_kp+l _fkp+l = [akp+l Yk+l,-p+l The block tridiagonal C_kp+l] q°k+l,-p+l] system A(2)x (2) = w (2) is now generated solved by some procedure, for _ in the back and we take advantage by of the earlier (6.1) and LU process substitution: for each k = 1,2,...,N 2 + i, Xkp-I = _°kp-I - _kp-I Xkp-p - _kp-I Xkp for j = p-2,...,l Xkp_p+j 6. B. Convergence As observed ing for p-fold it follows " Fk,j - Ak,j Xkp-p - Ck,j Xkp-p+j+l" Rates in Section reduction, that p-fold 2, if IIB[A]II < I then, under we have reduction IIB[_]]I _ IIB[A]II • will IIB[A(2)]I I < IIB[_] 211 < IIB[A]II2" yield the _ partition- By Corollary 4.1 I]A(2) II< ]I_[II= IIAll and In fact, we can provide pth order 76 convergence, expected although our first in comparison We will ination. actually Let M result to earlier consider block generalization (2p-l)-diagonal Mp = (mj,_p+l,...,mj,0,...,mj,p_l)N We want to pick M P of odd-even (2p+l)-diagonal . p-i matrix; of the block the jth block A _2_f _ is constructed rows p,2p,..., and block (2) a. J = r c(2) J (2) = t.p,j w.J b JP' (2) J = s. 3P = (MpV ) jp row of M A is P _k--p+l mj,k[block row (j+k) of A], so that r.3 = m.J,-p+l a j-p+l' s. 3 = m.j,-I cj_ i + m j,0 b j + m j,+l aj+l t. =m. 3 3,p-I c j+p-I . that mj,k_ I Cj+k_ I + mj, k bj+ k + mj,k+ I aj+k+ I = 0, Ikl = 1,2,...,p-l, by selecting columns vP-I We require elim- matrix, Thus we have Now, _ 0, ..., 0, s j' 0 , ... ,0 , tj) • p-I the intersections than mi_h_ such that M pA =(rj, M A is a block P weaker results. a natural be a nonsingular P is somewhat p,2p, .... 77 where we take m° = 0, m. = 0. J,-P 3,P The special in Section 4.A. case p = 2, Mp = (-ajbill_ ,l,-cjbi+- i), has been considered For p > 2, define d. -- b j,-p+l d j,-k j-p+l =b j-k - a j-k di_ -k-I Cj-k-l' k = p-2,...,l -I _j,-k = -aj-k dj,-k-l' k = p-2,...,0 d =b j, p- i d j,k j+p- I = b j+k - c j+k dil,k+l aj+k+l' k = p-2,...,l, _j,k = -cJ +k dil,k+l' k = p-2,...,0. Then we may take mj,_k = _j,0 _j,-I "'" _j,-k+l mj, 0 = I mj,k = _j,0 _j ,I "'" _ j,k-l' k = 1,2,...,p-I. ! The dj,_k ! s are derived by a UL process. then moves It follows by an LU factorization from earlier results IImj,+klIl _ IIC[A]IIkI ' so the diagonals _ away from the main Our first theorem process, that if of M P decrease diagonal. partially generalizes Theorem 4.1. and the dj,+k IIC[A]III < i in size as one s 78 Theorem -I -I Let J = (-sj rj, 0, ..., 0, -s.j t.),j B = (-bj 0,I aj, 6.1. B = L + U, J = JL + Ju' Then IILII _ _, IIJll _ _P, IIjLII _I_ p, Proof. Let R 0 - 0, _ k = l,...,p-l, S = R IIUll _ _, Rk_I)'Iu, + T so that p-l' (I - S)JL = L(I - Rp_2)'IL(I (I - S)JU = U(I- Define IIJll _ T p- 3 )-Iu Tk_I)'IL, ... L(I - R0)-IL TO) -I U . ... U(I- IITkll*k' IIJ ll _/[<_ I_P/[2p-I( 1 - - 2_p_i)_-20 2,p_1)II_-20(1 (i - ,k)] , _'k) ]. 60 = 0, 61 = I, 6k+I - 6k - _26k_ I' so 6k = det(_,l '_)(k_l)×(k_l)> _k = _26kI/6k+l' I - _k = 6k+2/6k+l' 2 (I - 2_p_l ) 6p = 6p+l - _ T2 = I - 2 2, Thus - R p- 3)-IL T P- 2)'Iu (I- _ *k' I dT TO = 0, Tk = U(I- 2 -I _0 = 0, _k = _ (I - ,k_i ) , so II Define _ _i2" lljull _I_ p. = L(I- p-I IIBII _ 2_ = _, with ic -b-j j), Tk= ¢_=-2_k6k-i 6p_l. _ 0. IIJll < _P and (i - 2,p_l ) _-20(I - ,k ) = 2 Define Tk-I _ _ 2 ¢k-2" But then, 0 Tk = 6k+l - _ 6k-l' so T 1 = I, By induction , Tk (I) = 2-k+l , for _ < _, ¢k(_) e Tk ), so 2P-ITp(_) IIJLII , IIJUII _ l_p. e 1. QED. i Corollary 6.1. If II B[A]]I < 1, II L[B[A]]II, II U[B[A]]il _ 711B[A]il , then II B[A(2)]II _ II B[A]II In generalization Theorem we have 6.2. With P. of Theorem X (i) = X(A(i)), 4.4, there is the very X (I) _ I, _(i) = surprising (l__/__x(i))/(l+_/l_X(i))_, 79 2 Proof. It suffices to consider I + _(i) i = I, so we will "I dj,k, cj,-k = b-I j-k dj ,-k' _j ,k = b j+k Define " drop the superscripts. k = 0,...,p-l, _ so that I. m _.] = (-a. -I aj_ I ¢jl,_2) ... (-bj_p+2 I ] ¢-I J,- i) (-bj_l aj-p+2 e j,-p+ I I) X (bjlp+l. aj_p+l ) , (2) = r a. jp' ] " I cj+ I _.I,2) " .. (-bj+p_2 tj = (-cj _._l)(-bjl -I Cj+p_2 _ _p- 1) -1 X (bj+p_l aj+p_l )' c (2) = t , ] ]P b (2) = s , J JP a.] e j,-I 1 b j_1 I c j_ I - bj - s" =] = b.(I] Now, _-.1 b-1 j,l j+l aj+l (_.). ] II_'_p_l II--i, II_;l,kll _ (I - _II_.l,k+lll/4) -1 Ek = (I - %Ek_ i/4) - I , we have I II_.,p_kll < Ek • . By defining E1 = i, Similarly , ;I¢ii,_p+kll _ Ek. II(:jll_ XEp_ 1/2 , II (I - qj) -i II_ 2/(2 This yields observe cj - %Ep- I) . To bound k(2) ' that 411 b(2)-la(2)II j j IIb(2)-I-(2)II j-I cj-I a jp II IIp,- 111 IIp-I ]p ]p x IIjp,-2 II Iu Iil "'" II jp,-p+l II p-p+l X II (I - _ XE II ajp'p+lll c. II II _'p_p,lll IIb-I jp-p+l C_jp_p)" Iii IIb'.]p_pl]P-P I p-p,2 4 ajp. l p- "'" p-p,p- 1 3=i (_)P iqp'l J" jp-I Cjp-I Cjp-p+l II 80 Thus 4(2) is bounded which implies that by the above _ = 1. Xp, and we are done with shown by induction (2/(2 - XEp- 1))2= the bound Note Then Corollary Suppose = 2n/(n+l), the Now suppose (i - X)'I((I - _P)/(I + _p))2 to the desired first bound that simplifies to It may be , yielding and with _ that X = I, _ < I. - _ n+l) that En = (I + _)(I - _n)/(l simplifies a little effort result. shown that QED %(i) < I implies x(i+l) < I. is now to be resolved. 6.2. _(i) < I, and _ Proof. En this case. that we have not actually This point quantity. If x(i) _ I then (i+l) The case -< _ (i)p %(i+l) < %(i)p with inequality if . %(i) = 1 was considered 0 _ % < I, again considering strict only in Theorem the case i = i. 6.2. Suppose It is not hard that to show _/2J + _P)) = (_p(X))-l, that ((I + _)/2)P(2/(I Now, _p(1) = I, _p(X) < 0 for 0 _ _ < I, so he(X ) > i for 0 < _ < I and X(2) < xp for this case. we have 6.C. 4(2) _ _P(2/(I Back Assuming + _p))2, where _p(X) = i only _ _ I, and since which implies _(2) (Pi)(l - _) . i=0 _(I + _)2/4 = _, _ _p. QED Substitution Error propagation pends on the particular Xkp-j errors in Xkp_j depend errors from Xkp_p and in the back substitution implementation. = q°kp-j - _kp-j on newly Xkp . In the simplest Xkp-p generated We can write Ykp-j = Xkp-j + _kp-j to find for p-fold - Ykp-j rounding reduction de- form Xkp' errors the computed and propagated solution as 81 _kp-j = -_kp-j It follows from the analysis and from explicit formulae II_kp_jll _ IIB[A]I_ if _kp-p - _kp-j of the Gauss-Jordan for _kp-j' _kp-j' IIB[A]II < I, which in a newly errors in its closest decreases computed component as the distance between that 3.5) to depend component, components (Corollary II_kp_jll < IIB[A]I_ -j, IIB[A]I_II _kpll + therefore back-substituted algorithm leads II_kp-jll < IIB[A]I_-Jll _kp-pll + Errors _kp + ¢kp-j" most IIekp_jll. strongly on and the "influence" increases. Moreover, we have II_kp_jll _ IIB[_]II max(ll _kp.pll, II_kpll) + If e is a bound on errors in the final computed Slightly introduced result will different results Xkp-I = _kp-I at each be bounded prevail - _kp-I stage, it follows that errors by ¢ logp N. in the back Xkp-p IIekp_jli. - _kp-I substitution Xkp for j = 2,...,p-I Xkp_j Errors in the computed = F k ,p-j - Ak ,p-j Xkp-p solution _kp-I = -_kp-i Ykp-j _kp-p - Ck ,p-j Xkp-j+l = Xkp-j + _kp-j - Vkp-p obey the rule _kp + ekp-I for j = 2,...,p-I _kp-j = -Ak,p-j so the errors than Xkp_p in Xkp_j and Xkp. depend This _kp-p- on propagation implies a linear Ck,p-j _kp-j+l + ekp-j' from Xkp_p growth and Xkp_j+l rather as in the LU factorization 82 in conjunction again suppose with that logarithmic growth IIekp_jll _ ¢, with as in odd-even reduction. If we I]_kp_pll, I[_kpl] _ _, it follows that II_kp-I [I_ IIB[K]II _ + ¢, ]]_kp_jl[ _ IIB[K]II max(_, which leads to II_kp_jll _ _ + (p-l) e. the final computed result will I[gkp_j+ll[ ) + ¢ The end result be bounded by is that errors cp logp N. in 83 7. SEMIDIRECT METHODS Briefly, a semidirect fore completion. In common Ax = v computes is used, while x exactly while to x. producing the computation, finite intermediate of conjugate since it can produce termination Section [R2]. 8.E) factorization properties results obtained discusses equation gradients semidirect of the odd-even by Hockney a similar technique (cf. Theorem [H5], if exact a sequence are could of arithmetic intermediate number produce on this of steps, x exactly. which During approximate the is terminated is returned. is an example Parallel based its usual iteration ([H4] and and Palmer's convergent a and b positive methods of a semidirect long before Gauss as is Malcolm definite with on the quadratic methods. This will Stone [S2] and Jordan based be- of approximations then the algorithm approximation [M2] of A = (a,b,a), Here we consider gence solution example, for the solution are generated enough a good that is stopped of steps methods of steps The accelerated is another number computes results is small method to x in a finite number and an approximate The method method method The semidirect If the error prematurely a direct in a finite an approximation an additional solution. is a direct method usage, an iterative x (i) converging scale, method on the Buneman algorithm 0(b-la) various Buzbee [BI7] for Poisson's 4.3). The name semi-iterative has already been appropriated [V3]. Stone [$2] initiated the name semidirect. < i. conver- generalize [Jl]. LU for other purposes 84 7.A. Incomplete Elimination Let us first consider process elimination for Ax = v, which H I = A, v (I) = v, Hi+ I = (I + C[Hi])Hi, In Theorem so that tive odd-even 4.1 we proved the off-diagonal that if blocks to the diagonal blocks. D[Hi]'Iv (i). Because v (i+l)= IIB[A]II < I then decrease (I + C[Hi])v(i). IIB[Hi+ I]II _ IIB[H i]211, quadratically This motivates is the in magnitude the approximate rela- solution Z (i) call this technique we halt incomplete the process odd-even Theorem 7.1. Proof. Since v (i) = H l.x = D[Hi](l- before completion, elimination. IIx - z (i) II/II xll < IIB[Hi]ll" B[Hi])x we have z (i) = D[Hi]-Iv (i) = x - B[H i]x. Thus mation (7.I) QED the B matrix z (i). If B = k = I + max(0, guarantees the relative error IIB[A]II < I, 0 < e < I, m = min(m, should have In fact, incomplete determines IIB[Hk]II _ ¢. k = 6 and we applied. °g2 _og----_ For example, I if 8 = _, N > 32 for incomplete would be much better ]]B[Hi ]211/ (2 k must be larger IIB[Hi]211 ). than i + l°g211°gmog 2(8/2(e/2 1 Since approxi- )) -20 ¢ = 2 odd-even IIB[H kill < 825 = 2.32 - I0-9"6, elimination in the simple Flog2 hi, then - 10-6 , then elimination so the results to be of the than required. In the case n = I, a.J = c. J = a, b.J = b, we have (7.2) we 2 x /2 < x2/(2 - x2 IIB[Hi+l]ll = ) < x2 for 0 < x < I, 85 2k-I in order to have IIB[Hk]II _ ¢. 2k-I 2(_/2) > c then _(k) > ¢.) by defining (By induction, A more B(k) > 2(B/2) refined lower bound T 1 = _, Tn+ 1 = T2n/(2 - T2)n' and computing , and if is obtained the largest k such that (7.3) Tk > ¢. Since B[Hi] is the matrix H.xl = v(i)' when the estimate with the block IIB[Hk]II is small we can use this Jacobi iteration iteration to improve the initial the choice choice z (k) z (k) = D[Hk ]-I is the iterate = 0. If z (k,0) is an approximation (k i) -Iv(k ) = B[H k]z ' + D[Hk] then [[x - z(k'£) [[ (k,0) ][B[Hk]_II [Ix - z I] • k and _ may now be chosen to minimize x and z to (k,i+l) tion time subject to the constraint iterative methods may be used results for g(k) In fact following associated using the theory B2k- I _ _ e. in place Any of the Jacobi of non-negative matrices computa- of the other iteration; and regular standard for related splittings, see [HI]. We have remarked is irreducibly _(i+l) diagonally < B (i)2. equations. I _ i < m-l, is a case to the particular and _(i) = of practical the example IIB[Hi]II then importance A = (-I, 2, -I), that strict matrix inequality A = (aj, bj, c j) B(1) < I and in differential for which _(i) = I, is necessary for incom- to be effective. _(I) is very close boundary does not that if the tridiagonal dominant _(m) = 0, shows Indeed, when equality This However, plete elimination second-order earlier value imply that to I, as will problems, _(i) will approximation the fact be small z (i) = D[Hi] v_, j - (2i-I - I) _ _ -< j + (2 i-I + I). be the case that we have for i < m. for most strict With -I (i) (i) v , z.J depends in- regard only on Thus for N = 2m - I, i < m, 86 (i) the center though point , Z2m_l will be independent this is not the only possible elimination will be necessary of the boundary approximation, to ensure data. complete convergence Al- odd-even to the continuous solution. 7.B. Incomplete Reduction The incomplete odd-even odd-even elimination. tridiagonal systems Complete A(i)x (i) A(m)x (m) = w (m) exactly x = x (I) . Incomplete y(k) We remind that roughly quarter work the reader per step deal with shown that roundoff quite possibly sis is needed. including This those to incomplete errors at most might no growth observation to finally elimination, obtain details, But since one reduc- the 10.C. reduction must We have already are well-behaved, the approximation errors, assumptions referee where see Section a different of interesting certain is due to an anonymous in incomplete the incomplete reduction especially reduction, elimination, substitution. see that a number 4.D, in the first than the roundoff under in Section the savings For more in complete of errors Flog2 N + 17, solves for y (k) _ x (k) , and back in incomplete logarithmically. We will made Thus in the back be larger of block (i). x is performed constant. error propagation can only grow (i)_ of observations less than m = substitutes approximately and so forth. is roughly In contrast y = y the actual work be much , i = l,...,m, a sequence to incomplete stops with A(k)x (k) = w (k) for some k be- this system in the second, tion would (i) = w is analagous generates for x (m) , and back to obtain half algorithm reduction reduction tween 1 and m, solves substitutes reduction properties about of [H2]. and errors analyresult, roundoff, 87 and a superconvergence-like An initial cording y (k) approximation to our analysis = D[A k should for a suitable of incomplete Regardless in some of errors• one, and we have Here both the quadratic (k) ac- The choice l_(k)-y (k) for computational (but certainly of the source from A (k) and w elimination. be determined approximation• and useful due to damping y(k) may be computed (k) - lw(k ) ] is a simple The parameter important effect decreases IPI_ (k) II< I_[A economy (k) ]ll- and of B[A (i) ] are not all) situations. of the approximation y (k) , let us suppose that y_k)- = x J(k).+ 6(k). J . In the back substitution, y(i) is computed from y (i+l) by i) Y_j (i+l) = Yj y_i) = ib(i) -i. (i) j-i " 2j-I ) tw2j-I- with ¢ j-1 representing rounding a(i) (i+l) c(i) (i+l) (i) 2j-I Yj-I - 2j-I Yj ) + e2j-l' errors, We then have 8(i) (i+l) 2j = 6. J 6(i) o(i) 2j-I = ¢_i) j-I _ ib(i) " 2j- l)-I (a_j-I Theorem Proof. 7.2. If II¢(i)II < (i - $(i+I) + c_i) -j-i j-i IIB[A(i) ]II> By construction, o i, • 6(i+I)) j " II6(i+I)II then I16 (i) II: II 6(i+I) II" 88 As in the analysis reduction, of roundoff error propagation this expression and the hypothesis But, also by construction, This theorem ing errors derived so that this error vector components In Figure system 7. I we II 6(i> II< II$(i+I)II • in y that suppose be much It now during the back Y ll= smaller the errors stitution quickly In the of rounding damps out approximation is from substitution. in size than by component absence in y(1) (k) (k) II/l_ II° IIx(k) - y(k)If, many errors, IIx- Yll • for the tridiagonal (-I, 4, -l)x = v, N = 31, k = m-I = 4, v chosen and yj-(k)= b(k)-lw(k)j j . that no round- follows I_-Yll /I_II < l_(k)-Y to see that while IIx- illustrate First (k). will not grow of x-y will QED ¢(i) = 0 and all the error error IIx(k) II< IIxll , we conclude it is easy we have some explanation. from the approximation In fact, IIB[A(i) ]II II6(i+i)II )" II 8(i) II_ II6(i+i)II • requires are committed, IIB[A]II < I that From odd-even we have II 6(i) II_ max(ll 6(i+I)II, II¢(i)II + From for complete so that x. = I, J errors, creating the back a very sub- pleasing super convergence- like effect. Now suppose bound have II¢(i)II < ¢essentially propagation have must that rounding If solved analysis errors are committed and we have II6(k) II< v¢ for some moderate constant the system A(k)x (k) = w (k) exactly, is the same as for complete reduction. the upper _ then we and the error That is, we II8(i) II< (k - i + _)e, I < i _ k. The remaining case be considered in competition esis of Theorem 7.2, is when II8(k) II>> II 6(k) II (i - with ¢- Here an iterative IIB[A(k)]ll) the semidirect method. _ ¢, should method The hypothbe viewed as 89 -3 -4 -5 m I ._ _ 6 X £) ¢=m 0 -7 -8 -9 0 5 I I0 I 15 I 20 i 25 I 30 Component number j Figure 7.1. togl0 Ixj-yjl vs. j. J 35 90 an attempt to formalize esis remains true, the difference tirely be attributed hypothesis the condition effects effect would will As between approximation it is possible still be able to say that convergence (in norm) to the original is violated II6(k) II>> ¢- x (i) and y(i) can enerror. for the errors As soon as the to grow, but we will IIx - Yll = II 6(k) II+ O(ke). be moderated somewhat still be felt and we can expect long as the hypoth- Although by roundoff, this bound the super- its qualitative on accumulated er- ror to be an overestimate. 7.C. Applicability We close with dratic semidirect of the Methods some remarks methods. on the practical In Buzbee's rithm for the differential equation sides _I and _2' d a nonnegative application analysis [BI7] of the qua- of the Buneman (-A + d)u = f on a rectangle constant, it is shown that algo- with if i _I 2 _2 ]_ cq [d_2 + >> then a truncated quires block Buneman algorithm that the rectangle tridiagonal By directly matrix considering equation on a rectangle possible to reach a similar for more general with have verified equal mesh blocks spacing conclusion; equations. that small Geometrically when the finite difference in the other, M < N, the matrix It is readily be useful. be long and thin if d is small, will elliptic will 1 A is (-_, and hence properly approximation M nodes serves to Poisson's it is as a simple model in one direction PM' -_)N' the ordered. in both directions this also With this re- and N PM = (-I, 4, -I)M. IIB[A]II = 211 QMII , QM = PM I' 91 I _(I- D + D Im D m- ) _ n=2m' n I_nll = 2D I m "_'(1 - "_'), I1 D O = I, D 1 =4 ' D n =4 The following on the number n = 2re+l, D n-I - Dn-2 . table gives of odd-even IIB[A]II for M = 1(1)6, along with reductions needed to obtain M IIB[A]II IIB[ A(i) ]II< 10"8" II B[A(k) ]ll< k such that lower bounds (7.2) (7.3) estimates I0-8 upper bound (7.1) w These I I 2 = .5 5 5 6 2 2 3 -.67 6 6 7 3 -6 7 =. .86 6 7 8 4 I0 - .91 II 6 7 9 5 25 _ .96 26 6 8 I0 6 40 . .98 41 6 8 II figures direct methods the next again lead to the conclusion for elliptic section we consider tings and orderings equations some will iterative to take advantage that the best use of the semibe with methods long thin regions. which of this property. use matrix In split- 92 8. ITERAT IVE METHODS In this section ference we consider approximation particular interest to a general will tive step, the direct linear system. we want used 8.A. our emphasis equation. as part of each of a block on methods to be such that the method Of itera- tridiagonal for a parallel already dif- considered computer, can be Use of Elimination applications Jacobi that several to iterative iteration elliptic equations of A can be simplified subsequent Indeed, When converges; suitable will be handled the iteration acceleration) _I = I, v2 = 2/(2- Jacobi definite x (i+l) = _i+l (B[A]x(i) forming for If the structure A'x = v', then efficiently, and since any slower. iteration the Jacobi that is satisfied IV3]. will not converge of the block A is also positive more elimination Suppose this condition step, block the system Ax = v and partitioning by an elimination convergence (J - Sl or Chebychev Consider about x (i+l) = B[A ix(i) + D[A]- Iv. with computations IIB[A' ]II< IIB[A]II of our theorems methods. [[B[A][ I< I, so the iteration slow. elliptic employ, solution of a finite efficiently. the block many which or semi-direct Let us first note have solution nonseparable be methods Continuing this system the iterative will often semi-iterative be quite method IV3] + DIAl -i v- x (i- 1))+ x (i-l) , 02), Vi+ I = (I - 02Vi/4) - I , 0 = 0(B[A]), will converge at a much faster rate and is a natural candidate for parallel 93 computation. sequence If we use _ = IIB[A]II as an upper bound on P and use the of parameters * Vl then the J-SI iteration "preprocessing" be smaller; * _i+l = (I- _2 vi/4 * )- I, for A' will not be slower Ax = v we unfortunately Juncosa tive methods and Mullikin when * = 2/(2 - _2) , = I, v2 than can expect that the actual a complete proof [J2] discuss that for A. computation of this claim the effect By time will is not at hand. of elimination on itera- A = I - M, M e 0, m rr = 0, and .... m rs _ I with strict s inequality for at least one value liptic h_undary the interior be useful value problems nodes, of r. is to order and then to eliminate for domains with curved Thus the point particular the boundary the boundary or irregular that, with A' = I - M' representing p(M') < 0(M). Their iteration nodes will for el- followed nodes. boundaries. the system after Jacobi interest by This would It is shown eliminating one point, not be slower. Also, -I if G = (I - L[M]) tion, [B4]. U[M] then 0(G') < o(G). As in the proof for block elimination, is the matrix Similar results of Corollary though describing follow 3.2 we obtain not under the more the Gauss-Seidel from work on G-matrices corresponding general itera- results assumption II B AIII < 8.B. Splittinss The block diagonal Many Jacobi portion of A, and using other iterative MlX(i+l) = M2x(i) and semidireet iteration methods + v. methods When is created by "splitting the iteration are also defined M 1 is a block can be used to solve off" the block D[A]x (i+l) = (D[A]-A)x (i) + v. by a splitting tridiagonal the system A = M 1 - M2, matrix, for x various (i+l) direct 94 A particularly [C4] is readily effective adapted to parallel _u on a rectangle of variable procedure = -V discussed computation. " (a(x,y)vu) = a(x, y)2 u(x,y) and Golub The equation = f R, a(x,y) > _ and sufficiently w(x,y) by Concus smooth,lundergoes and scaling by a 2 to yield a change the equiv- alent equation -Aw + p(x,y)w I i I on R, p(x,y) = a 2 A (a2 ), q = a 2 f. = (k - p)w n + q, k a constant, (-Ah+ = q The shifted is solved in the discrete Kl)w (n+l) = (KI - P)w (n) + Q, where the system Chebychev (-A h + Kl)b = c may be solved acceleration lel and rapidly 8.C. is applicable, iteration (-A + k)Wn+ I form P is a diagonal matrix. by the fast Poisson the overall method Since methods is highly and paral- convergent. Multi- line Orderinss There are a number solution tions of block tridiagonal for a rectangular for 2-1ine iteration the resulting square grid; classical systems. is shown technique tangle as part of the iteration. systems schemes which with in Figure involve the multi-line of unknowns itera- and partitioning 8.1. A diagram of 8.2. of the multi-line form of the elliptic a few lines we obtain We begin is shown in Figure or long and broad rectangle A discrete iterative the ordering on a 6×6 grid matrix The underlying of more methods is to divide a into a set of long and thin rectangles. equation is then solved By partitioning over each the grid that can be solved well subrec- into groups by general block of 95 I 3 5 7 9 II / 2 4 6 8 I0 12 13 15 17 19 21 23 14 16 18 20 22 24 25 27 29 31 33 35 26 28 30 32 34 36 Figure J 8.1 1 I I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 8.2 The 2-1ine ordering. 96 I Figure 2 3 4 20 21 22 2.3 24 19 32 34 25 I8 31 I 36_ 35 26 17 30 29 28 27 t6 15 14 13 Figure 8.3 8.4. 33 r------'--I The single 5 12 spiral 6 II ordering 97 methods, particularly The multi-line is probably computer the semidirect block the simplest and should an n×n grid method and a k-line matrix with plete than odd-even cording 8. D. reductions is used, given useful spiral ordering order diagrammed produces represented by the tridiagonal in Figure a very 8.4. in Figures of locations tridiagonal out of this construccase. When incom- of lines may be chosen 7.C for Poisson's to solve systems We note ac- equation. matrix involving tridiagonal SOR method SOR method 8.3 and 8.4. forming is the The spiral region, the central order which portion is of the A = M I - M2 we take M I to be the 2 2 For an nXn grid M I is n ×n and has IIB[MI]II spiral form, is shown [BI0]. reduction can be used quite efficiently M I. the block peripheral of the periodic computation In the splitting that the single ting to yield for parallel long and thin rectangular tridiagonal center of A. I about _; thus incomplete odd-even block each ILl] points because as for the general of unknowns essentially eral block matrices, as one block are needed the number A = M I - M2 is in strategic Lambiotte With Orderings Another matrix or pipeline for other methods. zeros purposes in Section acceleration the splitting By storing log N reductions reduction Chebychev tridiagonal and N = nn' = n2/k. to the guidelines Spiral single N = n. block 7. on a parallel n = kn', for computational that only log n odd-even tion, rather with as a benchmark of n' decoupled k×k blocks of Section to implement iteration, has k×k blocks with M I may be regarded iteration be considered such that M I consists which Jacobi methods order method has been used with [B9], [E2]. and the convergence experimentally Here another split- the blocks are rate of the periph- to be similar to the 2-line 98 The double bine features of the single (Figure spiral 8.5, 8.6) order with is an attempt advantages to com- of a 2-cyclic For an n×n grid, -peripheral orders can also For the single submatrix tion, M I. and double spiral orders, of A should be included when n is even in the implicit This can be done at little extra row and column similar be defined. of M2 has at most modification The mapping ordering should from two non-zero and ensures elements. ordering by a permutation of the vector With is unknown (i-l)n + j in the natural ordering. tors forming the coordinates that each n is odd a (i,j) of each unknown or single T = (1,2,...,n 2) an n×n grid, Given point For simplicity are computed by taking (I _ k < n2) i(k) = I + q(k), (I _ k < n2) j(k) = k - nq(k), (I _ k _ n 2). let n = 2m-l. p(i,j) L(k-l)/nj, = min(i,j, Define 2m-i, (i,j) is part of peripheral 2m-j), number (i _ i, j < n). p(i,j), (i,j) T, two vec- and remainder: q(k) = The point When to the peripheral S = (S(1),S(2),...,S(n2)). number part of the itera- be made. the natural is defined cost, into a new vector quotient the trailing of A looks like The e'd elements spiral ordering 1 2 n odd, one spiral contains _(n + 2n - I) 1 2 and the other contains _(n - 2n + I) points. Multi-spiral oi: ordering. points spiral I _ p(i,j) _ m. 99 I 2 3 4 5 6 25 26 27 28 29 7 19 20 21 22 30 8 J Figure 18 _]36 I I,, 24 2_ 31 9 17 1,35 I _4 33 32 I0 16 15 14 13 12 II Figure 8.5 8.6. The double spiral ordering IO0 Peripheral p has 8(m-p) tains the single point The peripheral peripheral points is taken in it, except for peripheral m which con- (m,m) . ordering is per. in clockwise I, per.2, order: ..., per. m, where top, rhs, bottom, each lhs. Define p-I f(i,j) the number = _k=l of elements g(i,j) 8(m-k) = 4(2m-p)(p-l), contained = f(i,j) + in peripherals I (i-p) + 8(m-p) = 4p(2m-p) + I (j-p) + easily S(k) = g(i,j) -8m + evaluated of a vector may permute quite may be generated Of course, "unnatural" here. A more complete ILl]. be performed (if i _ j) We performed. (if i> these the vector should if i > j (2p + i + j). j) the necessary or pipeline - I, and if i < j, permutation, computer. on the CDC STAR by using though Once be used and can be The actual the built-in sparingly S is generated, reordering since the elements vector they are of A and v directly. the implementation ordering sented been on a parallel instructions, expensive. defines i - (i-p) - (j-p) + I - The vector 1,2,...,p(i,j) involves only mean comparison of an iterative much more developmental to show the potential based on actual For such a discussion method based attention for parallel machine capabilities of some other methods, on an than is pre- computation. has not see Lambiotte I01 8.E. Parallel We close Gauss our section on iterative tems with some brief remarks IT2] with extensions by Heller, we consider the normalized on the Parallel Stevenson tion is D (i) = Iof D. This _(A) = (I - _)/(I Techniques out, this dratic convergence (parabolic necessary the constant of odd-even ]ID- is much reduction. making D (i-l) II• in [H4]. As less attractive problems, ]Id' j - d j I]< (_(_ Thus Stone approximation [S2] than the qua- in certain discussed situations here) and this would that D' has already been it is hope- 2 + ^_ i ) + 4' 2 (4"-)(I + _F computed 3.4 ) + ]ID' - DII : maxj] I d'.]- djlI -< I - }(_I_--_-% + ^I-_-_ F), and we may as an initial estimate k(A) < i, D (i) II_ _II D- However, itera- for the L and U Since d' I cj-I - a'(d_ j - d j = a j dj -I j - i) Ic' j-l' by Theorem _ parallel estimates. Let A' = (a'.],I, c')] and suppose : let up the Parallel iterations iterations of related initial dj, cj), a natural _ are given or the multi-line suitable For simplicity in [H4] that, with we have to solve a sequence fully provide are Jacobi linear convergence equations [H4]. sys- due to Traub D (0) is some initial iterations It is shown + _), for reducing points and Traub where of three two parts systems. tridiagonal iteration A = (£j, I, 0)(0, L[A](D(i'I))-Iu[A], the other block bidiagonal Gauss Since D = I - L[A]D-Iu[A], forms the first Gauss method; for block form A = (aj, I, c j). In the block LU factorization D = diag(dl,...,dN). methods to D. Now, the Parallel Gauss take D' iteration is 102 most useful initial When when X and X' are not too close approximations X and X' are close the iteration For will should tridiagonal be available to 1 the bound to i, and in this case good by the above will probably not be used matrices we have bound on IID' - DII. be large, but in this case due to slow convergence. d' I _ dI = 0 -I -I d'.J d j = a j (d_ - I) (dj. I - d j- l)dj -I cj -I -I + (d _i ) (aj cj_ I - a'. J c'. J- i) , so that 2 ]d' - d ] < I- _/_ J J 1 + _,l-/Tf_ d'. I J-1 - d ,I+ j-1 1 + _ Let A = maxjlajcj_ 1 -a'c' j j-I ] _ (X + _')/4, a = (i - 1,,_-X)//(1 so by induction + Ji'_), 6. b - < (1 + a + 2/(1 ... Id' J - 6j = + _/_--_) + aJ-2)b ]ajcj-1 . Then < b/(l-a) intermediate tween _. and X' , so that if X is close should complete theorem or information When equation next time we have results, is not so close solve u(t,0), The Crank-Nicolson the tridiagonal (I + _P P i+l) u i+l k* be- some where, may give more systems• iteration for with i Let u. _u(ik,jh), J specified• for advancing 2_ P i)u i , + aj+I/2' + ,,_) the iteration system = (Ia__I/2 for > 0, u = u(t,x), i pi = (_ a__i/_ Gauss method 8j_ 1 + b, • tridiagonal of the Parallel u(t,l) < a the situation but here for block 6j .J° to I the bound ut = (aUx) x , a = a(t,x) 0 = k/h 2. step must bound the utility 0 _ x _ I, 0 < t, and u(0,x), k = At, h = Ax, Again than the one derived Let us now consider the parabolic X upper . I" 5 _ 2_/(^/_'-X = ,_I-_X* (I_'X-X + ^/rf2"_)//2 5. J < A/_ to I, a large not be used. 0, ' value a'.c' 3 j-1 d J [, hI - J By the - i " aj+I/2)" to the 103 We therefore have A = (- 0ajit_/21 , + i+l i+l /o p(ajit_/2 + aj+i/2)/2 , - paj+i/2/_); let us now drop the superscript A will and we would theorem, always be strictly i.l for convenience. diagonally dominant like to know when we also have define + aj+i/2 = 2aj_i/2 bj = ax((i.l)k, do_B/2 + aj_i/2 Using > 0, the mean value some algebraic manipulation %(A) = maxj(l + = 2aj_I/2 fy 2 _ hlax(t,x) I < I. - h _j, xj).~ it is seen that paj-I/21 (I + _2 h bj)-)-i x (I+ to guarantee + h bj, xj),- _j = ax((i+l)k, In order %(A) < I. a(t,x) x. and _. by ] ] aj_I/2 After because I paj_l/2 (1 - _2 h _j)) %(A) _ I it is sufficient If m I = maxlax(t,x) l. to assume I this becomes that h and k satis- the condition talk _ 2h. Suppose we want This may be satisfied With the more stringent if we choose condition h and k such that, cpaj_i/2 _ i + phbj/2, cpaj_l/2 < i - ph_j/2. m 0 = max a(t,x), a sufficient X(A) _ (I + c) -2, c > 0. condition for each j, is cpm 0 < 1 - phml/2 , or 104 (8.1) k _ 2h2/(2cm 0 + hml). For c = 0, (8.1) reduces to the previous condition on h and k, while for c = h we have k _ 2h/(2m 0 + ml) = O(h). Now, one of the best k may be chosen allows features to be O(h). this choice. We have Conversely, then c < h(l - rml/2)/(rm0) the Parallel restricted to small (k = O(h)), X(A) will Gauss method be close for this particular reduction would Gauss to I. Thus be preferred. When we method value (c >> h) we are necessarily be expected to be very appears than originally Once several small to warrant of c still (8.1) and rm I < 2, When h and k are chosen the method is that try to pick h and k so that be effective cannot application and _ may be sufficiently variations. = O(h). will seen that a small if k = rh satisfies time steps. Parallel of the Crank-Nicolson in the usual effective way since to be less attractive believed, reductions and odd-even have use of Parallel been performed, Gauss and its 105 9• APPLICATIONS In this in which the assumption shown how 9.A. to get around discuss two applications of practical importance IIB[A]II < I is not met; in the second case it is this problem• Curve Fitt ins Cubic rise section we spline interpolation to the tridiagonal system of the data (xi,f i) , I _ i < N + i, gives [C6, p. 238] hisi_ 1 + 2(h i + hi_l)S i + hi_isi+ I = 3(f[xi_l,Xi]hi 2 _i hi ----" X i+l - though l more they might Cox piecewise HI However, polynomials hi_ 1) and general give rise [C7] presents problems to block an algorithm which must IIB[A]I[ = 2' which need not behave tridiagonal solve is quite so nicely satis- even matrices. for least-squares curve fitting with the system CI T 0 _N, X.. We have A = (hi, 2(h i + hi_l), factory. + f[xi,xi+l]hi_l), 1 C2 I 1\ / b2 i. v •. "". 0 C2 T C2m- 2 The v vectors contain coefficients m of the Chebychev ! m" polynomial expansion 106 of the piecewise polynomial, and the _ vectors ers derived through always full rank, and if there are sufficiently have between nite. when adjacent pairs of knots the number the band of knots elimination. 9.B. Finite approximations or data. The C o'S ] data points all be positive defi- When insufficient is needed IIB[A]II < I will Indeed, data B[A] in not is not even to differential equations method. Usually example, admittedly Thus some other a simple often that A is positive one, shows of variables from finite assumptions must be used to establish in an iterative definite. how an easily II" llc_, and also made, serves The following implemented = v' such that is not actually element do not satisfy or convergence can yield a new system A'x' IIB[A]IIc_< i for some norm constructed technique in a direct method it is shown if the change change IIB[A' ]II< I. this will to establish show that numerical stability. Let us begin with Hermite finite . partition. that matrices stability of variables many and pivoting that the condition 2.A we noted IIB[A]II < I. numerical problem. Elements In Section such as is large. full rank partitioning for the most obvious multipli- band elimination may be used well and data points It seems Lagrange least-squares then the H o'S will ] the H.'s will not have ] for any reasonable defined Even linear In this case a straightforward is available hold a constrained contain cubics, element Strang the equation -u" = I on the unit and Fix [$4, p. 59] derive equations interval. the 4th order Using accurate 107 u - 2u -6( j+l + u h 2j u' j-I) +_I( - u'. J+12h J'l) = I, (9.1) - ( ]+I I I u. 2h - Uj_l) + _-u'5 j ---30(uJ+ ' I - 2u'j + u'j_l) = 0 . Taking the natural choice x. = T ] (uj ' u_) ] , we obtain the block equation T axj_ I + bxj + cxj+ I = (I,0) h2 a: b The matrix T0h = is simply A second definite, that the unknowns set of unknowns, which are actually ] from a more uj+_ and, multiplying B-splines natural are out of scale. are in scale, by coalescing choice axj is Since adjacent the Hermite nodes, than the original• J the first I IIB[A]II : _ + 3/(4h). but T : (uj - _ h uj, x.] ' uj + _h u_] ) T _ (uj_ 1/2 , uj+I/2 )T . can be derived , • A : (a, b, c) is positive The problem T =c cubics these unknowns We now have ] equation i + bxj + cxj+ I of (9.1) by h2 and the second = (h2 • I, h • 0)T by h, 108 a : _ /6 1/30/, 1(12/5 (9.2) 12/5_ = _k-8/15 8/5 J, ,""'7/5 7I __ I/30 - -16_ I/ The blocks are independent " of h and we have for the first and last block boundary conditions; For the more ing equations rows. (Actually, Neumann conditions general are derived equation IIBEA]II : I, with give this assumes Pk(B[A]) Pk(B[A]) < I Dirichlet = i.) -u" + Qu = f, Q a constant, the follow- [S4, p. 59]: i 30---h (-36 uj_ I - 3h u'j_l + 72 uj - 36 Uj+l + 3h u'j+l) + Qh (54 u + 13 h u' + 312 u + 54 u - 13h ' 420 j- I j- I j j+l Uj+l) --" F I _(3 °_ J uj_ I -h u' + 8h u' - 3 u -h ' j-I j j+l Uj+l) + Qh2(-13 420 _---FI u j-I - 3h u' + 8h u' + 13 u - 3 h ' J-I j j+l Uj+l) . J Again tions, taking X To = (u - h u' u + hu,o ) T and suitably 3 J _ J' j _ J scaling the equa- they become T dE j_ I + bxj + cxj+ I = (hFj,Fj) where now the blocks ting details, are O(h 2) perturbations for h sufficiently small of the blocks this gives in (9.2). 5 IIB[A]II = I - _ Omit- Qh 2 + O(h4). 109 Since rescaling alent to a change and change of basis should be applicable finite element for spline collapse Strictly points it is natural solvable in general: Given 2. These of this thesis. linear block cubic condition solved carries by something over in the approximations. by a block of the following matrix A, diagonal problems are find a block IIB[AE]II < I. IIB[A]II < I, find E such that related is equiv- the process is to prove systems tridiagonal system subspace, equations technique the Hermite E such that are closely to minimize function differential to ask if either matrix In I., if problems matrices yielding an arbitrary diagonal in the linear and show that the property in terms of modifying scaling, I. general The underlying approximations, of nodal in the original to more methods. of variables to the optimal numbers, IIB[AE]II < diagonal IIB[A]II. scaling and as such are outside of the scope 110 I0. IMPLEMENTAT ION OF ALGOR IT_IMS This section for particular contains applications, for use on a vector block data sizes, computer. assumed be in two parts: in which the machine vectors and second as contiguous storage The first model uses the fact cussion, as storage in which vectors This double A. ground material 10.A. Choice in practice. features computing solved satisfy are defined increment among = I). other the conditions us to pinpoint requirement necessary the on vec- for this dis- of algorithms n = I, is adeq_aately discussed elsewhere may be considered for ([H4], as back- for this section. of Algorithm The salient may be applied with model in arithmetic vectors Implementation [Jl], [L2], [MI], [$2]) and these references time compari- ignoring, by the stringent of STAR facilities use the same locations allows nXn and idealized progression comparison required uniform will the machine on STAR must 2.B and Appendix systems, simplified for the CDC STAR-100 For a description linear we assume of algorithm of algorithms and the execution (arithmetic information that machine see Section tridiagonal are defined comparison All algorithms for a very for a model of data manipulation tor operands. to n = 2. locations for the second model. effect In the comparison first on the choice detailed to be core-contained, son will progression, remarks and a more later specialized layout, things, some general point of our analysis to block Sometimes in order tridiagonal it may to reduce costs are high repeatedly) (Generalities) has been linear be possible computing costs systems effort must or estimate be made of algorithms that are likely to take advantage (such as one enormous an extra that a number problem of special error. When or a large to choose to arise the most the problem III efficient algorithm. but we feel in many general of the computer should be implemented problems. matrices The first restricted should tion and LU factorization should on a parallel issues in general can be greatly tion. Options sizes, scalings restricted or pivoting class or used as a module problems as indicated in our opinion, should The best choice not clear at this cient vector are ap- of odd-even allocation reduc- and access, and implementation the block size restric- as a set of subroutines for efficiency for specific method, and spe- or stability. This directly to small-block in the iterative solution of large 8. The experience of methods These algorithm of band/profile which, manner. a cross-over methods to be satisfactory. for the CDC STAR, addressing blocks, for moderate-sized represent sparse gained may or may not for moderate-sized in a different cases are better by vector methods than may be applied one is likely influenced under tridiagonal LU factorization for use as a semidirect of a parallel time. ILl] the band methods computer be treated implementations Either or vector for more say no more or profile Storage in Section for implementation size, be used. strategies of algorithms problems be useful of tests block a combination simplified special to algorithms some form of block implementation insertion for three general band computer operations include deal with computer vector heavily in general, good algorithms algorithms proceeding computer, should be used, and on a vector methods. to choose of a small uniform sequential on a k-parallel involved, before to blocks On a standard propriate; cial to make cases. problems block be difficult that it is not at all difficult Regardless 8×8. This choice will restrictions blocks between is effi- and the odd-even In Lambiotte's opinion but this choice on the STAR is (cf. our 112 remarks in Section 10.B). present such problems, The third basic including ciently by iterative There are several inherent eqoations, based algorithms multi-level useful must parallelism. should be used methods package developed discuss implementation on STAR must up costs. memory, while _' is another elimination Since highly suppose transfers partition the data requirements scheme system, that PoissonBrandt's considered. elimination that secondary in other IF3]. TT is given Block sizes vector lengths and hence of A memory. _', we would tridiagonal for the LU or Cholesky procedure [E3]. 2.A), in ASKA the choice minimize will IN1] l_e (see Section levels, while be in secondary it analysis on the CDC STAR-100. storage this is a natural storage Noor and Voigt between When A TT is block is a contexts, a structural accordingly. such that _ refines on ATT,XTT, = VTT,. localized, so large is that a few blocks of A TT will solvers. for elliptic suggest and block a partition dif- effi- on the Poisson to be considered. schemes finite very are available of Stuttgart A partitioned situation the bulk that many of implementation by theASKA solvers, from, and all have consid- algorithms with matrices also try to maximize The usual to choose of hypermatrix to optimize based as the hypermatrix as follows: shown can be computed partitioning at the University may not is needed. [BI3] must also be seriously exploited and let A TT be the matrix have methods the first that matrix Known has been extensively are chosen and ease for dealing [R4]. works Other be among analysis equations kinds of algorithms we note technique method to elliptic computers are the fast Poisson studies or modification adaptive Finally, complete Recent but flexibility or vector of algorithms generalizations. approximations parallel and a more class ference erable Other start- be in main For example, perform if block so is ATT, factorization are then Moreover, the diagonal 113 blocks of Am, will order. As noted themselves in Sections < IIBIAs]If, and pivoting dent in main memory. 10.B. Storage Data severely decreased computer this means of course, though possibly IIB[A_]II < I then to those is precisely part of planning computer, if data cannot since blocks what of low IIB[A, ]II of An, resi- is desired. operand. for implementation an algorithm's computational conveniently. vectors conform must We consider On a vector to the machine's some consequences on the CDC STAR-100, imple- speed can be be accessed that conceptual of a vector requirement if be restricted is an essential on any parallel definition 2.A and 3.A.2, will This, tridiagonal, Requirements layout mentation be block assuming of this that the prob- lem is core-contained. Storage available isters and either disc storage on the STAR, 512K or IM words is also available on STAR consists n×n blocks m = is illustrated of main memory. but will of s consecutive Our data arrangement for 64-bit words, A sizable not be considered storage for block consists locations, tridiagonal storage x.l = 0. vectors secondary here. A vector I _ s < 64K. linear systems with in fig. i0. i for the case n = 2, N = 7. Flog2N], N* = 2m, and if N < N* then extend Ax = v with equations of 256 reg- The matrix A and vector nN*. Define N* - N trivial v can then be held of length n2N * and one of length uniform Taking in three the subdiag- 2 onal blocks as an example, the n components I < j < n, I _ k _ N*, and the storage a((i - l)N*n + (j - I)N* + k) = aij,k. tions are needed cost of setting to store up the original the storage vectors of ak are aij,k , I < i -_n, vector a is defined Thus a total system. since of by (3n2 + n)N* We do not consider each of our algorithms locathe will c b a " = 0 _I,I c ii ,k b 3.1 ,I c II ,Z bll,Z cll'3 b eL%I,L_ ,L_ cli'5 bll,5 S.ll,5 _IZ ,k _ % _Z ,a cl, z ,a bl2'a cll'5 s.%.Z,6 blz'7 _ 0 b I?..._ _'I2"'7 -_..0 0 b2.1,2 b 2.%,3 a?.l,Z alk,3 aZ%,a bzl'_ c7_1,5 c21 6 ' ._ {3 bzl,6 c11'7 b21,7 bl_ ¢. O _ C) "I alZ ,l a22,3 _'12, a ,l c12. ,3 bzl ,l bl1,3 cll '_" bzz ,L_ bzz,5 b22,6 a,2.2 ,5 aZl,7 _ O c_%.._ clz ,l c = all ,I _ 0 c'2.1'3 c 2.%.,5 bz%5 aZl,7 _2'7 c?-1,2 a21, _ = _.Z%..,6 v",r2,6 Z ,5 c,ZI,I bll,l all' 5 I" v,2. ,I v ,2 Z v ,3 c 12,3 b _% ,3 a12,3 a%__ ,jl_.... " c 12,% b 12,2 a12,2 _. 0 cll''/ =. 0 c%_._.-._ b i_ -I = 0 all,7 _%.,6 cI_"6 bll,6 b %..I ,q ai%,6 _ k ,3 cz1,5 c2z'6 -_. 0 cll,7 _. 0 c Zl_ b ll_% _ Data layout £o_ 9lO .I. _ Z, _ _ 7, _,. 115 use the same arrangement systems. and since This particular odd-even reduction the details data layout has been algorithm more efficient the other competitive algorithms are mostly by the choice extra unaffected storage Since for trivial can very widely chosen because between it makes for the true STAR model, (LU factorization, of data odd-even layout, the while elimination) and can do without equations. the STAR forces vector lengths to be _ 64K, we must have 2 n N* < 64K = 65,536. For a given value of n, this restricts the system size as follows: n 2 3 4 5 6 7 ±---- 8 I 64K/n 2 [6384 7281 4096 2621 1826 1337 1024 max N* [6384 4096 4096 2048 1024 1024 1024 , These will A more problem store, probably not be serious important and solution and supposing times the original restriction will remain (T + I) storage, + n) plus Assuming 4 amounts restrictions 5 that the a 512K main the program the following ....$__.. cases. 6 to T on N: 7 $ T: 1 18724 8738 5041 3276 2279 1702 1310 T: 2 12483 5825 3360 2184 1533 1134 873 9362 4369 2520 1638 1149 851 655 1 16384 8192 4096 2048 2048 1024 1024 .... _ : T= 3 max N* storage in most size is our assumption core-contained. we obtain 2 (3n2 on system that temporary n 512K restrictions T= T = 2 8192! 4096 2048 2048 1024 1024 512 T= 3 8192 I L 4096 2048 1024 1024 512 512 116 The amount extent of temporary storage depends to which A and v are overwritten. temporary storage reduction would considerations memory extent complete analysis and timings of odd-even elimination. Odd-even for the basic algorithms. T _ 2. of data manipulation executed mostly unit on STAR, that vectors between in scalar mode: memory this may be overlapped requires from actual in mind, cycle runs. we suggest scalar arithmetic with counting arithmetic in an assembly Our estimate of execution four possible minimal I, a one-time see Appendices solution, 2, factor + solve, factor, solve, be stored A language program time probably assuming Actual To a con- operations. implementations: run times and load/stores. For details, will file. be (based on an I/0 bound. (Times given complete run time will A, B.) all scalar mode time > 508N - 362 Version for reg- The only data manip- and the register is that the algorithm are for n = 2, and represent Version to locations. With be greater. for the LU factorization and the requirement cycle count) between no of scalars, untested this and the from practically of the load/store is to load/store siderable It can range are the speed The LU factorization, ulation have the costs transfer as contiguous implementation typically We now consider ister-memory on the algorithm (other than O(n 2) registers) T _ m for an inefficient Major used partial vector mode version a, time > 324N + 3421 version b, time >-374N + 1769 version c, time _ 420N - 280 (use for N < -_ 38) time e 303N + 1039 (use for N > _ 38) overlap certainly 117 The three versions of factorization tions and increasingly must be solved with execution Odd-even elimination, is that scalar executed the conceptual vectors vectors, (cf. [MI] for the tridiagonal Odd-even reduction, allows vectors then no additional STAR, however, vector to perform compress vector. instructions, separation is needed must be applied to vectors for b in fig. copies is needed. two basic 10.2. of the bits compress and The first the second with even before arithmetic require- operations: is is accomplished a STAR merge set up a control I0 (a single vector model progression, of the (I) separate (2) rejoin each reduction computer In a true model with two STAR instruction. The odd- step can be performed, vector operation a the even and odd a, _b, _c, and _v" We illustrate Having vectors data manipulation If the vector in general into its even and odd components, parts of a separated of case.) locations data manipulation we need to the strictest in vector mode: to be storage Our feature on component no additional 2. I. An important all correspond needed. executed mode: opera- than one system to use Version like ours based and hence vector 10.C will use Version in vector layout fewer If more A, it is best in Section and any data for machine operations. the same matrix time comparison this algorithm ments more use increasingly and the separation z consisting of n2N*/2 on STAR), we execute b _ b__l(1)per z 12 b(2j compress - I) _ b l(j), (I < j _n b _ b l(2n2N*)--per N*) z 1 2 , 1 2.,_ b(2j) _ b!l(_n N_ + j), (I K j _n i_ j. The cost for each compress I.I 2 .. is n2N * + _(_n N_) + 92 and the total cost to 118 b z bl final state bll,l I bll,l bll,l (I) bll,2 0 bll,3 bll,3 (I) bll,3 I bll,5 bll,5(l ) bll,4 0 bll,7 bll'7 (I) bll,5 bll,6 .... 1 0 b12,1 bi2,3 b12' 1(1) b12,3 (1) (1) bij,k _" ' k odd, I _k <N* (1) b12,7 1 b22,5 b22,5(1) b12,8 _0 b2_2 7 b22_7 J (2) b21,1 1 bll,2 bll ' 1(2) b21,2 b21,3 b21,4 0 1 0 bll,4 b11,6 bi1,8 b11'3(2) b12,1(2 bi2,3 bij, ) , k odd, .... i _ k _ N*/2 b21_ , , 8 0 _ b22_3 b22,1 i b21,2 bll,l (3) b22,2 0 b21,4 bll, b22,3 b22,4 I 0 b21,6 b21,8 2 (3) (3) b12' I(3) b12,2 b22_8 0 b22 b22 , Fig. k(2) _ 10.2. Odd-even 8 separation (2) -_ _ (3) bij k ' ' 1 < k _ N*/4 2(3) for n = 2, N = 7, N* = 8. I19 odd-even separate extension a, _b, _c, and _v is 17_(3n2 + n)N* + 736. of A to N* block step using only 8 vector had not been extended equations compress we would makes it possible operations, need either Our initial to perform this 2 of n . If A independent 2 (3n2 + n) vector operations 2 to compress the vectors one component subvector at a time, or n vector 2 operations to generate of z(l:N), followed Lambiotte ILl] rejected tor lengths needed startup vector 8 vector use of odd-even of our extended operations to reduce vector control by the original The net effect of vector a modified and hence storage compress reduction data layout to manipulate costs. would natural the increased despite operations are needed for the first reduction. startup vec- impor- approach costs. In any part of the that this is roughly vector Data manipulation the odd-indexed time as is spent of dealing with - 2) We will in the a restrictive can be considerable. for the back equations 59.5N* + 736m - 855. half as much so the overhead definition to pre- I For the case n = 2 this yields operations, to prepare 17 (3n2 + n)N* + 736, is -_ 17 " cost for m-I reductions is _-(3n 2 + n)(N* I < i < m- + 736(m-I). see shortly separation the cost of the separation and the total (see fig. and thus increased vector for the arithmetic N*l = 2m+l-i' machine we have control only the odd-even In general, for the ith reduction, arithmetic for the STAR, step. So far we have considered pare the number from N to N*, and this may be a more The modified reduction copies operations. is to reduce On the other hand, in some cases. case, O(n 3) vector of n for n > I on this basis. vectors tant consideration then be most consisting substitution of A (i) in appropriate 10.2 for the final state of b). Having is simplified form by saving for vector computed operations x (i+l) = the 120 even components of x (i) , we compute the two sets into x (i). more expensive the solution The merge than the compress vector the odd components operation operation, and not blocks of A. on STAR p-fold reduction: The cost of the merge for p-fold odd-even reduction can readily reduction. into p vectors, data arrangement The odd-even which requires separation p compress the cost per reduction would m = substitution rlOgpN_. using from x The back p-I merge instructions (i+l) (i) to x ° be 12N* + 123m - 147. generalizes instructions requires at cost of 3(_-that these vectors from our discussion merging per storage vector; N i* = pm+l-i ' p vectors - l)nN*+ i P costs of to separation (p + l)(3n 2 + n)N i* + 368p, It is clear is and use of control be generalized with operation for m-I reductions For n = 2 this yields Suitable is comparatively but we are only dealing at stage i is 3nN.* + 123, and the total cost I 6n(N* - 2) + 123(m - i). of x (i) and merge into 123(p-I) are minimized l, to go by p=2. Jacobi iterations: Data manipulation is unnecessary when A and v or B[A] -i and DIAl 10.C. v are stored according Comparative Timing to our component tridiagonal systems the CDC STAR-100. and odd-even earlier statements siderations include for operations with This will ization concerning time comparison 2×2 blocks, expand reduction vector scheme. Analysis. we close with an execution block subvector based the crude given of algorithms on timing comparison in Section 4.D, the cost of various startup on short vectors, costs, creating for general information of the LU factor- and illustrates methods. Essential a substantial and data manipulation for many con- penalty to deal with 12 1 restrictive cussion definitions I. which simplified vector b. we operands pivoting 10.B began our dis- on the following only arithmetic conven- costs: may be any sequence is never necessary; b. address c. load/store d. storage management costs, and more realistic analysis of consecutive overlap costs in Section Algorithms here we present the main loop control for scalar storage also be considered, Execution capabilities, calculations, of iterations. as follows locations progression, instruction store will of storage ignore ment as explained 10.B. (m = as a consequence will demand Overlap, consider loop control are given l.a. operands storage and scalar calculation and timing derivations and make of assumption that vector so we must but not address results tests, manageload/ or termination in Appendix direct methods LU factorization, are initially estimated using scalar operations only 784 elimination, 94.5Nm + 12999m using - 93.5N vector - 1901 B; comparisons. Flog2N]): the block odd-even and termination operands, locations, times for the basic 1012N2. is based a. Our second, I. Section we assume in arithmetic consist analysis allow us to consider a. 2. operands. of the latter problem. Our initial tions, of vector operations only (OEE): (LU): 122 3. odd-even block 4. reduction, system using (OER): p-fold reduction, 145N + 19206m eral problems ratios since either - (19351preduction vector of using combining wise .23 .13 OER: OEE .24 .II of vector startups is faster when the two methods This reduces solve a savings that vector uses O(N log N) operations although method it is always inefficient. reducing A polyalgorithm if m _ 6 then use LU, othersystem by LU, and back substi- are the dominant and te_n for OER when for the other methods it is faster slower than LU for 455 Of course, OEE and soon becomes _ N < 1023, than OER. is that the OER- of those considered Now consider by successively term for OEE when N _ 137. vs. O(N) Nevertheless, Our conclusion OER and LU, of 13_. startups N _ 1532, and are the dominant inefficient. 0. the time for OER to 106.5 N + 14839m - 46908.5, for N = 1023 it represents We also note N _ .II this becomes the remaining time to LU and OEE. l_it_ OER works is straightforward: for gen- Execution is seen by comparing N < I00. and at some point do m-5 reductions, tute. N...... = 8191 method cheaper. OER in preference LU lengths, 14717.5 as a competitive OER: for the latter for the final 19579) LU or OER is always N = 1023 The importance except p e 3: of p-fold show the benefits operations I06.5N + 14839m- (log2p) We can first dispose vector LU polyalgorithm is the best direct or reductions with here. the cost of individual eliminations 123 respect to the total cost each elimination cheaper. of the algorithm. step has roughly Thus each step costs reduction, equal about the cost per reduction cost N = 1023, m = I0 28_0 17¢ 8191, m l,/m of the total 434 22_ mately equal, while the kth reduction seqaences startup reductions are only for large N, when ! 7_ I 4_ ! For our comparison x with error by a factor of I/N. true arithmetic of incomplete of the total of semidirect be to approximate a relative Thus cost 242,622 1,050,5:31 i N = 2m - i) costs costs This has reduction, cost when and iterative error 3_o reduction i/2k of the whole. a small part proceeds: 7¢ are dominant, for the effectiveness For odd-even _ ] 12_ costs costs about cost. of total (cost for kth red. = 106.5(2 m-k) + 14839 when For small N, when slightly as the algorithm as a fraction , 13 elimination, the last being I : N cost, decreases of the kth reduction For odd-even are approxi- are dominant, immediate since the omitted N is large. methods, the goal will of l/N, or to reduce if we perform either con- the initial k odd-even elimina- tions or k reductions, we want IIB[Hk+l]ll _ I/N or JlB[A(k+I)]II _ l/N, assuming Theorem 7.1 is then applied to produce IIB[A]II < I. approximation we have by incomplete elimination or reduction. IIB[Hk+ I]II, IiB[A(k+I) ]II _ 82k, and demanding Taking time for k eliminations is 94.5Nk + 12999k + 10.5N - I05(2 k) + 1205, and the time for k reductions is 8 = I_[A]II, 82k < I/N yields k > log2((log2N ) / (-log 2B)). The execution the desired 124 I06.5N + Typical savings 14839k- for incomplete 96(2 m-k) + 1287. reduction over complete = .9, pick k >- log21og2N = 12_ N = 8191, = Comparison pick k -> log21og2N consider direct methods. = 12_ be reduced pick k _ log21og2N N = 1023, k = 4, savings = 33_ N = 8191, = 16_ iterative k = 4, savings We will of iteration and Jacobi iterations needed of I/N are estimated the savings. and their combination Jacobi iteration costs such as p = p(B[A]). (16N + 2217) use the Jacobi decreases semi-iteration (cost = 38.5N + 4367 steps are demanded, The number methods Jacobi parameters to 12N + 1840 B[A] and D[A]-Iv factor k = 5, savings One step of the block to estimate + .774 N = 8191, one step of the block overhead be: 7< = 254 to the OER - LU polyalgorithm We next while k = 7, savings N = 1023, k = 5, savings 1 = _, would + 2.718 N = 1023, k = 7, savings 2 =_, reduction per iteration semi- costs 22.5N + 3031, 26.5N + 3408 plus Execution time may by directly computing - 3 iterations). in its simplest semi-iteration fo_n when when many by J-SI to reduce to be with steps the initial only a few are needed. error by a 125 The LU factorization OER for N < 312. Vector and in OER for N _ 798. tions, is always startup a 3_ saving - 70677, costs The OER-LU solve the remaining 183N + 16233m faster than OEE, and is faster are dominant polyalgorithm should perform a 164 saving m-6 reduc- The time is over OER for N = 1023 but only for N = 8191. The execution time for odd-even elimination N = 1023 Analysis in OEE for N < 152, system by LU, and back substitute. yielding than as follows: N = 8191 arithmetic overhead 96 4_ 96 4 results vector startups 874 134 984 2_ of the total time for odd-even tance of overhead is allocated reduction shows the greater operations: N = 1023 ar ithme tic _overhead N = 8191 arithmetic _overhead results startups 33 _ 23< 4 0_ 4_ 56_ 44_ results 73 27_ startups 524 37_ Ii_ I_ 88_ 12_ _ 62_ 38_ impor- 126 limit, N -_ _ results arithmetic _overhead startups ........ 58@o 42_ - 58_ 42_ I00_ The increase in time for OER due to overhead N = 8191, and 724 in the limit layout of Section overhead result The strong locations requirement compress cost be ignored "wasted" of the execution and vector startup that machine leads to much Most as N-_ 0o. As a consequence 10.B, overhead costs cannot is 37_ for N = 1023, 61_ for costs and probably vectors effort are quite consist small, cannot of individual spent in the arithmetic operation as fraction of total cost N = 8191 li_nit, N -_ 12_ 12_ 12_0 multiply 43_ 35_ 32_ add/subtract 18_ 15_ 14 compress 20_o 29_ 33_ merge 4_ 6_ 7 transmit 3_ 3_ 3_ operation assumes its low startup As noted earlier, tions and subsequent storage time. divide reflecting be decreased. operations: N = 1023 The compress but the of contiguous in execution time of OER is actually of the data importance as N increases, again costs. most back more of OER's time is spent substitutions. in the first few reduc- 127 P = .9, 1.5m iterations, 2 3' I -2' .75m iterations, p = -- 0 = 2 p = _, For for example, As suggested J-SI is never in Section Jacobi .5m iterations. with are used, followed be chosen so that B2k_ < I/N, or (*) iterations. If k steps _ iterations and back k + iog2% >- log2((log2N) The time for the OER-J method I06.5N + 14839k- and it only remains would values incomplete elimination of elimination substitution, or or reduction k and _ should / (-log2B)). then be 96(2 m-k) + 1287 + to minimize N = 1023, the resulting than OER. 7.A, we can combine reduction by faster _(19(2 m-k) + 2615), the time subject to (*). are k = 2, % = 5, which 2 B = _, For represent a savings 2 of 37°_ over OER. = 3, which represent Of course, that _ = For to determine as part IIB[A]II can be computed large N even incomplete the resulting k and _ we need directly of the first from these this cost considerably reduction, IIB[A]II rather and points values are k = 3, of 15_ over OER. to know odd-even While it is true that half reduction. any benefits out the importance estimate. + 3881, the rows of An estimate rows at a cost of 7.5N . 308. reduces than a computational _. from A at a cost of 46.5N at a cost of 15N + 308, we recognize B[A] are available of a savings IIB[A][I can be computed or from B[A] of _ = _, N = 8191, gained For from using of an analytic estimate 128 This ends our initial assumption overlap scalar mate that vector comparison operands of scalar operations, load/store operations. Odd-even than simply recompute a few cases, we are consecutive and consider and raise elimination try to indicate the locations, is to lower substantially presented where storage the time estimate remains the figures We now make allow the cost of loop control The net effect for the LU factorization reduction. of algorithms. earlier, the algorithms and the time estifor odd-even unchanged. which Rather will be done in spend most of their time. Revised indication we note i. time estimates of actual changes from runtime LU factorization, Odd-even (a 4> 3. Odd-even Time ratios These set of timings, 600N- (OEE) m = increase give a better all underestimates; rlog2N]. 150 overlap) 98.5Nm + 13403m - 101.5N - 2261 due to overhead (OER) should but are probably due to use of instruction reduction. (a I0-70_ larger (LU) elimination, increase below. on STAR, the first (a 40_ decrease 2. are given costs) 183N + 16233m - 16108 due to overhead costs; larger N incurs increases). show the effect of decreasing the LU time and increasing OER time N = 1023 N = 8191 limit, N -_ oo .31 OER :LU .54 .34 OER :OEE .32 .17 0. the 129 cost of ith reduction i = N = 1023 33_ ........ | as fraction 2 3 19_ of total cost 4 5 6 124 8_ 7_ 6_ 6_ 4_ 2_ 6_ 3_ J N = 8191 45_ 23_ 12_ limit, N _ _ 5"0_ 25_ 13_ , Of the total time, 86_ is spent I--2_ - ., in the reduction phase and 144 in the back substitution. One of the effects tance of vector Of course, be made. display startup when costs, polyalgorithm N = 1023 N = 8191 time 14_ 2_ vector time 86_ 98_ results 634 91_ startups 23#0 startup costs reduction results. are already small not much and incomplete reduction reduction, savings N = 1023 2 the impor- 7_0 (8 = IIB[A]II )" Incomplete 8 = .9 is to reduce as seen below. scalar Incomplete similar of the OER-LU over OER N = 8191 k savings k 7 I0_ 7 5 21_ 5 4 27_ 4 savinss 54 94o I 13_ improvement + Jacobi can iterations 130 Incomplete reduction + Jacobi iteration, N- 1023 = .9 k 7 3 9 2 I saving over OER N = 8191 savings k __ sav_ 22_ 4 6 5 364 2 6 13< 5 47_ 2 4 21_ 8_o 2 i 2 As seen earlier, odd-even reductions considerable roughly the semidirect make advantage doubling methods are best applied up a significant to using the savings incomplete obtained when the omitted part of the OER time. reduction by simple plus incomplete Jacobi There is iterations, reduction. 131 APPENDIX VECTOR COMPUTER This section our comparison derived Laboratory of the STAR's All times should manuals I0. Information given and timing measurements of 40 nanosecond be accurate represents somewhat needed here made for is at Lawrence a complete computers the usual discussion instructions, for each instruction. though actual conflicts. storage locations, is usually Other this restriction. scalar point as noted, due to memory is crucial. do not have with 5_ except restriction storage floating needed of N consecutive The vect_length TI ASC in particular, (64 bit) cycles to within on STAR consists for consecutive pipeline in Section of the CDC STAR-100 [C2], and by no means may be degraded i < N < 65,536. the need features times are for full-word as the number A vector INSTRUCTIONS facilities. Execution performance those of algorithms from CDC reference Livermore given describes A not serious, vector computers, Both machines instructions but the are as well as vector ins truc t ions. Useful broadcast features constants cast constant, of vector - Most vector held control on STAR include: instructions in a register) one or both of the vector cution instructions operands. allow a scalar to be transmitted (the broad- repeatedly Use of this facility to form decreases exe- time in a few cases. vectors - Bit vectors such as suppressing or determining which storage operand can modify to selected components the effect components of vector instructions, of the result are to be used vector in an operation. 132 Use in vector data manipulation (compress and merge) is explained later. offset - Modification sign control - Certain to be modified result must until instructions. time of the operands time is the number is the time after but not yet placed at the precise time given; in a register. of cycles after execution is possible, but by actual testing, determined the sign follow. Issue Shortstop is first available, it is placed allow of instructions execution. be used operations of operands. execution. classes (resister) to initiate vector before Individual scalar to the base address issue when the exact which if not, the result is placed effect the The result cannot be used in a register. on execution time Overlapped is best: we have not done. result instruction o__perat ion issue shortstop to resister 62 add, normalized 2 6 II 66 subtract, normalized 2 6 II 6B multiply, significant 2 I0 15 6F divide, 3 - 43 73 square 3 - 71 78 register 2 4 9 (register) significant root transmit load/store tions may be stacked is processed required time is the number STAR scalar issue when in a register. Result-to-register the result of cycles instructions. in the load/store sequentially at a rate unit Up to three at any given of 19 cycles load/store time. instruc- The stack per load and 16 cycles 133 per store minimum, been referenced references assuming it remains to the ba_k to other memory and reload and the conflict of 66 cycles further affect references load requires are required issue result to register at to store an item 19 7F s tore 3 - 16 Execution time for a branch on the particular the target slightly instruction, instruction pessimistic if we whether from 8 to 84 a branch take 40 cycles is actu- stack, etc. to be the time branch. Execution time followed by sequential appearance time followed Vector part of the startup by processing instructions time. for instructions time consists of result may not be overlapped of omitted below, items D8 - DC may vary of an initial vector time of operands In the table and V is the number Times can range is in the instruction instructions. generated. ioad/s tore unit bus__ 28 operation tor. will for a single bank has this time 3 for a typical vector(s) during load depending startup Once a memory 7E We will be only up for 31 cycles; to finish operation ally made, whether vector conflicts. it. instructions. cycles, Start and a minimum STAR instruction branch busy are delayed banks. least 31 cycles, no memory components, if a scalar except result or is for a negligible N is the length of the input due to use of a control as much start- as +204. vec- 134 STAR vector ins truc tion opera tion time 82 add , normalized _N + 71 K. c. +- a. + b. 86 subtract, normalized i _N + 71 c.l +-a.l -b i 8B multiply, significant N +159 c. +- a. × b. 8F divide, 2N +167 c. +- a. + b. significant e ffec t l l i i l 93 square 98 root transmit l i i 2N +155 c. +- _. 1 i _N + 91 c. +- a. l l 1 l I 99 absolute D8 z_N + 91 c.I+- ''fail maximum 6N - 2V + 95 DA summation 6.5N- c +-max a. i N c +a. i-I l DB product 6N - V + 118 DC inner product 6N- vector data manipulation selected components value instructions. of one vector vector _ and a control vector i, 2V + 122 V + 130 The compress into a shorter the instruction N a c +- Hi= I i N c +a.bo l l _i=I instruction result vector. essentially transmits Giw_n executes a the code j_l for i = I, 2, ..., N if z. = I then I The result two vectors vector c has length {c. +- a ; j +- j+l} j i " bitsum_). into a third, according The merge to the control instruction vector: "shuffles" 135 length(c) = length(z) = length(a) + length(b) j_- i; k+-I for i = I, 2, ..., length_) In the table below, items broadcast if z. = I then l [c. +- a ; j +- j+l} l j else _ci +- bk ; k _ k+l} M I -- lengthy), M 2 = length(c), R = number of input from register. STAR ins t_uct ion Execution times opera t ion BC compress BD merge a,b _ c per z for these a -_ c per z instructions time I M I + _M 2 + 92 3M 2 - 2R + 123 may vary as much as +20_. 136 APPENDIX SUMMARY We collect timing OF ALGORITHMS the algorithms formulae in general terms to block tridiagonal information for the CDC STAR-100. is given that our time estimates The algorithms form size n × n. I. factor in the main text, and present for a vector computer. systems 2 × 2 blocks using timing of methods based on the with A comparison in Section 10.C. have not been verified arithmetic This is then It must be emphasized by direct solve Ax = v, A = (aj, b j, Cj)N, The basic tract, M = multiply, AND TIMINGS discussed specialized latter set of timings B operations testing with are: on STAR. blocks of uni- (S = add or sub- D = divide) an n × n block b i i 3 cost = _(n 2 - n)D + _(2n - 3n 2 + n)(M + S) 2. having cost 3. b, solve bg = f, where _ n' (nD+ product cost 4. factored f is n × n' (n2 - n) (M + S)) of n × n and n × n' blocks = n' (n2M + difference (n2 -n)S) of two n × n' blocks cost = n' (nS) Depending scalar operation on context, the symbols or a vector operation S, M, D can mean with vector either length a single specified by the algorithm. For each algorithm translation we give a code to individual scalar in terms or vector of blocks operations of A, since is generally the obvious. 137 In those cases where vided. additional detail Two basic sets of timings seems to be necessary are given, five formulae it is proin all for each algorithm: i. counting arithemtic operations only, can be any sequence of storage locations sion, and ignoring culation, pivoting, loop control, store costs for scalar assuming operands, vectors in arithmetic instruction termination machine overlap, tests progres- address for iterations, and any other calload/ forms of storage management (a) for systems vector (b) with n × n blocks in general terms for a c_nputer for systems with n X n blocks, with timing data 2 X 2 blocks, with STAR data for the CDC STAR-100 (c) 2. a more for systems realistic contiguous model storage tions, counting loop control used locations, requiring allowing management to some extent, overlap ignoring tests vectors of scalar and load/store but still and termination machine costs, pivoting, for systems with n X n blocks, with STAR data (b) for systems with 2 X 2 blocks, with STAR data. notes are provided in many cases. to be operaand ad- for iterations (a) in Section Timings l(c) and 2(b) are 10.C. A few remarks estimates of STAR, storage dress calculations Additional with on the timings for 2 × 2 blocks, are necessary. the only loop control Since we eventually that will be counted want is 138 that on the main assumed to be expanded The address cuted scalar loops. the odd-even would this probably represents terms consist of vector operations can sometimes ignored can be completely be to increase the O(m) only a small relative startup destroy costs be exe- overlapped with Their effect terms, m = increase and are already the efficiency will are here would in the LU factorization. lieve that our basic conclusions calculations. or n-vectors code. that have been and probably operations methods on n X n matrices into straightline calculations in scalar mode, arithmetic Operations since large. of a vector not be altered on Flog2N], but the O(m) While scalar code, we be- by including address 139 I. Block LU factorization, using scalar operations mostly Algorithm: dl = bl for ; fl = Vl i = 2, ... solve , N di_l(Ui_ di = bi - solve for Storage If vi - ai dN xN = fN N-I, xi gi handling desired we d., l u°l and f. are l for with gi' overwrite due a saved The to Version load/store factorization loop n stores second fi ci basic loop needs and would -_ a i ' u i -_ ci, to save subsequent either loads. di If be Xi.l in execution algorithm, with time. estimate unit: solution. loads loop bl, loads stores initialization + increase I, a one-time initialization 2. l = a i d -I i-i , i di x i the of n the ... d i -_ bi, factorization 2 total then i = N-I, of Xi+l : a corresponding Variations , I ui can solve gi-i ... xi -_ gi -_ fi -_ vi. f.i or (ci_ I fi_l ) a i ui_ I fi = i = I gi_l ) = stores xN vI Ci.l, ai, bi, ui_ I, gi-I vi of minimum runtime and 140 back substitution loop loads ui, gi stores x. l At 19 cycles total per load and 16 per store, time _ {19(4n2 + 2n) + 16(n2 + 2n)} (N-l) 2 + 19(n + n) + 16(n) = (92n 2 + 70n)(N-1) for n = 2, total Version time >- 508N - 362 2, factor + solve, Factorization Version + 16n2 + 35n a. A = (_j, I, 0)(0, 2 requires n (3N- 2) loads factorization loop stores -I dj, 0)(0, I, uj) for A. d.i only, and computes -i _j = aj dj_ I, uj = dj c.0 by vector operations, vector length total time _ 19n2(3N- i 3 - 15n2 + n)(M+S), 2) + 16n2N + I (5n2 - n)D + _(14n D = T.N + _., etc. for n = 2, total Version b. factorization operations, total time >- 323.5N vector + 3421 loop stores d i, ui, computes length = N. time >- 19n2(3N - 2) + 16n2(2N - I) I _ i + _(3n 2 n)D + _(8n 3 - 9n2 + n)(M+S) for n = 2, total Version c. time _ 373.5N + 1769 factorization loop stores total time >- 19n2(3Nfor n = 2, total Solution elimination loop loads solve dj gj fj 2) + 16n2(3N- _''l 2) time > 420N - 280 of Ax = v, given forward di, ui, the factorization loads v I initially _i' vi' stores with vector i f. o_erations. of A. %°Iby vector = N. 141 back substitution loads xN initially loop loads ui, gi' stores total time e 19(2n 2(N-I) x. l + 2nN) + 16(2nN) 1 2 + n)D + v(2n3 1 + _(n + 3n2 - 5n) (M+S) for n -- 2, total time -> 302.5N + 1039 Timings : i. Version I, for initial overlap, load/stores (a) operations without + 9n 2 - 5n)(t M + _)}N [n2tD + (2n3 + n2)(t M + ts) .} (b) with (c) for n = 2, with Version arithmetic ignored. [I(3n2 + n)t D + I(14n3 - 2. comparison, STAR data: I, for second (70n 3 + l14n 2 - 2n)N - (60n3 + 76n 2) STAR data: comparison, 1012N - 784 allowing overlap and including load/s tores. In the absence of facilities code was written computed from here. The main initialize for the special that code. by the simpler for direct code time estimates loop substitution timing overlap is probably conflicts. so they will not be given 200 cycles 400(N- I) 250 loop total time estimated This and well-illustrated are initialize back are tedious in [L2], - loads + loop set up factorization on STAR, a scalar case n = 2 and time estimates The details tridiagonal testing 200(N-I) to be optimistic 600N - 150 cycles due to unforeseen memory and 142 Notes : I. Gaussian scalar elimination, operations, taking advantage requires of the block the same execution structure, using time as the block LU factor izat ion. 2. If u., 3 1 _ j _ N-I, is saved in the factorization, then may ]]B[AN]I].. bn a0mDutnd f011ooin_ thafirst18oDat n as_E 8_ ((n-l)_+ + Tmax)(nN) using the vector maximum on STAR. + instruction. (n-l)_+ + ¢_ max This yields 13N + 166 for n = 2 143 2. Odd-even elimination, We give solely using the algorithm and timing in terms of N and m. by setting m = Algorithm: vector Timing operations for N = 2m - I, deriving for other values formulae of N is approximated rlog2N]. a (I) = a., etc. J J for i = 1,2,...,m (r = 2i-i) vector !ensths for each j = 1,2,...,N solve b J•(i)(_j _(J q0j) - (aJ.(i) c.3 (i) v.3 (i) ) take ej = _{j 0, q0j N, N-r 0 if j < 1 or j > N b .(i+l) = b .(i) - a. (i) - c. (i) 3 3 j _j-r j _j+r N-r v. (i+l) J (i+l) a. J (i+l) cj N-r for each = v. (i) - a. (i) e0. - c. (i) 3 j 3- r j _j+r _ (i) - -a. _. J j-r (i) = -c.j _j+r N-2r N-2r j = 1,2,...,N solve b (m+l) x = v (m+l) 3 J 3 Storage handling: temporary vectors are needed b. (i)j and for _j, _j, q0jcomputed b • (i+l)_, J b (i) , a. (i+l) -* a. (i) , etc. ] 3 3 for the factorization at each stage, and we can overwrite Timing : i. arithmetic (a) only 1 [l(5n 2 + n) TD + _ (38n3 + 3n2 + 6(38n3 + - 9n 2 - 5n) TS]Nm 1 [I(5n2 + n)o D + _(38n 3 + 3n 2 + _(38n Z 5n) TM 9n 2 5n)_ M 5n)_ S) ] m of 144 - [I(3n2 - n) TD + _(46n 3 I + _(46n 3 + 3n 2 + 5n) TM 27n 2 + 5n) TS} i _(2nB) TM .+ (2n3 - 2n2) TS + _(n2 1 + _(-10n 3 + 3n 2 with + n)_ D 5n)_M 1 _+ _(-10n 3 + 15n2 (b) N 5n)_ S } STAR data, I I(38n34 + 19n2 " n)Nm + _(8740n 3 + 2343n 2 - 649n)m -i (46n 3 + n 2 + n)N - _(2282n I 3 - 2037n 2 + 649n) (c) for n = 2, with STAR data, 94.5 Nm + 12999m- 2. With overwriting to develop of a. ] the products then use a transmit the discussion (i+l) 93.5N- -_ a. ] (i) 1901. , c. ] (i+l) one row at a time operation of odd-even on the whole reduction -_ c. ] (i) , it is necessary in a temporary row. to follow. vector, For specifics The added cost and see for loop control and transmit instructions is m-I I = n2 40m + ...... i= 1 2n(_n(N-2 i) + 91) 40m + (N(m-2) + I) + 182n(m-1), or 4Nm + 404m - 8N - 360 for n = 2. (a) with STAR data, total time is 1 23n 2 1 2343n 2 _(38n 3 + - n)Nm + _(8740n 3 + + 443n + 240)m -I(46n 3- (b) + 9n 2 + n)N - --I(2282n3 - 2043n 2 + 174 In) u for n = 2, with STAR data, 98.5Nm - 101.5N - 2261. + 13403m total time is 145 Notes : I. I[B[A]II may be computed of ((2n-l)_+ in the first + Tmax) (nN) + for n = 2 on STAR. ((2n-l)_+ time through + _nax ) . IIB[Hi][l may be computed + 308 at the appropriate time through the loop at a cost This yields 15N + 308 at a cost of 15(N- the loop. 2m-i) 146 3. Odd-even As for reduction, with odd-even N - 2m - even though detailed tion form, up for the the basic = A, i = not in ..., m-I separate 8 compress the the layout N* (i 0 (N i = = rlog2N_ and in choice. and see timing, A (i), m algorithm proper form, z = operations take w (I) = v, counted odd-even data vector I, 2, N is using control cost general Flog2N+l] A (I) vector elimination, For For Algorithm: set I. m = 10.B. using Section the The storage timings are timing formulae, algorithm handling derived is given described in Sec- 4.D. = 2m 1 0 ...), length but = instruction 2m+l-i one - I, Ni* n2N * bits on STAR = 2 m+l-i) w(i) operations, cost = _(3n 2 + n)N.*l + 736 (i) factor b2j+l , (0 _ j _ Ni+ I) (i) factorization need overwrites a cost vector lengths (i) solve b2j+l - Ni+ 1 (i) (_2 j+l _2j+l (i) need a temporary cost = (2n + need a2j+l vector, I) (nD + a (i) w2j+l ), (0 _ j _ Ni+ I) lengths , etc. length (n 2 - n)(M Ni+l* + S)), = Ni+l* = b2j (i) - a2j (i) overwrites ) (i) overwrites vector %°2j+l c2j+l (i) S), (i) V2j+I (a2j+l + . (i) (i) = b j (i+l) b2j+l temporary vector, length Ni+l* 1 2 I = _(n - n) D + _(2n 3 - 3n 2 + n)(M V2j-I (i) _ c2 j (i) b2j(i) temporary vector, length Ni+ I _2j+l (i) , (I < j _Ni+ in I) 147 cost = 2n3(M for + S) simplicity, (iq-l) take (i) w.j vector lengths (i) = w2j = Ni+l* (i) - a2 j (i) - q°2j- I (i) c2j q02j+ i ' (i) overwrites w2j need a temporary cost = 2n 2(M for a. (i+l) 3 vector, + simplicity, take (i) k = for vector (i) a2j i, ..., Z = I, , ..., , (2 _ j _ Ni+ I) product developed for r = 2, ..., _I _,2j-I (i) ' (I = (i) t£j n 3M + + need two c J (i+l) = for for n)D ..., I _ j vector length length vectors, length scalar I + _(2n 3 + < Ni+ I) , (i < j < Ni+ I) ¢ Ni+l *) = Ni+l , = nNi+ I . = nNi+ I and Ni+ I. i) mode 3n 2 - 5n)(M + S) (i) - _2j+l (i+l) x.j (i) - _2j+l (i) cost time- 1 _2j+l overwrites a (i+l) (i) = _ _n; vector = wl(m) (i) x2j+l (I _ (n 3 - n 2 )S, a. J at (i) r_,2j-i j (i) V2j+ I (i) , (I _ j _ Ni+ I - I 2 _(n + i = m-l, -t _,j' < j @ akr,2 j transmits, b I (m) xl(m) cost + temporary =-c2 as solve n row n = = one n akl,2 j (i) = t_j cost = Ni+l* n t%j ak£,j lengths (i) _2j-I overwrites Ni+ I S) = -a2j for length ¢D2j+l = 2n2(M + S), vector length - Ni+l* (i+l) Xj+l , (0 _ j < Ni+ I) 148 x. J (i+1)-_ temporary cost = transmit, vector length (i) merge x2j+l = nNi+l* (i) , temporary cost = 3nNo* + i _ x 123 Timings : I. arithmetic (a) only, 1 _l(5n2 + n) TD + _(38n 3 + 15n2 - 5n) TM I + _(38n 3 + 3n 2 - 5n)Ts }(N-I) + [_(5n I 2 + n)g D + 6(38n3 + 15n2 5n)_M 1 + _(38n 3 + 3n 2 _ 5n)_s}(m-l) + (b) _l(n2 + n) tD + I(2n3 with STAR data: + 3n2 - 5n)(tM + ts) } i _(38n 3 + 31n 2 - n)(N-I) + _i 6(8740n3 + 5103n 2 - 649n)(m-l) + (10n 3 + 38n 2 - 2n) (c) 2. for n = 2, with with loop control using STAR data, STAR data: and storage manipulation (compress, I n, add 80m + _(55n 2 + 43n) (N-l) + (a) forgeneral (b) for n = 2, add 76.5n + total I06.5N + 14839m- time = 183N + 1394m- 16233m- 1390.5, 16108 so 14717.5 merge, transmit) (182n + 950) (m-I) 149 Notes i. : An estimate first loop This but at yields simply value for a cost 7.5N the maximum of this IIB[A]II + of be computed ((2n-l)T+ + 308 for n = 2 on row sum taken II[A]II could is more may be expensive found and by in the Tmax)(Nn/2) STAR. over every computing defeats the The first + block of + of rows odd-even the _max ). IIB[A]II row. remaining purpose through ((2n-l)_+ estimate other the time The of is true B[A], reduction. 150 4. p-fold reduction m The algorithm N, take m = proper and timings are given FlogpN] - m/log 2 p, m = for N = p rlog2N] ; m- = - I, p >- 3. rlOgp N+I_ For general is actually choice. Algorithm: (see Section m+l-i Ni = p 6.A for details), - I. for i = I, ..., m-I for each k = I, ..., Ni+ I + I, LU process on for each k = I, ..., Ni+ I UL process generate on _+I A (i+l) solve b I (m) Xl for i = m-l, w (i+l) (m)= wl(m) ..., I for each k = I, ..., Ni+ 1 + back substitute 1 for Xkp.j , 1 _ j _ p-I Timing : i. (a) for general n, 1 _(5n2 + n)T D + _(26n 3 + 3n 2 - 5n)T M 1 - + _(26n 3 + - 3n 2 I [5n2 + n)_ D + _(26n 3 + 3n 2 i + _(26n 3 - 3n 2 + (b) with 5n)_ S} (N-p+l) 5n)_ M 5n)_s}(m-l) (p-l) I {l(n2 + n) tD + _(2n 3 + 3n2 5n)(tM+t S)} I 21n 2 STAR data, _(26n 3 + - n)(N - p + I) + I(5980n3 + 2769n 2 - 649n) + (10n3 + 38n 2 - 2n) (m-l)(p-l) the 151 (c) using STAR data, n = 2: 145N + (19206m - 145N + 19206m __)-°g2P 2. data manipulation derived costs are outlined the complete timing - 19351)(p-1) (19351p in Section formulae, - 19579) 10.B. as the method + 228 We have not is expected to be too slow. Notes : I. In the arithmetic than (p+l)-fold and 3-fold reduction reduction reduction for N < N(n). for p e 3, p-fold for any n. yields is faster. LU factorization included timing, will Similar an N(n) reduction Comparison of 2-fold we have not determined probably be faster in the time estimates. should than either hold when N(n), odd-even the block _eduction overhead cheaper (odd-even) such that if N _ N(n)then Although results is always costs method are 152 5. Mixed odd-even Algorithm- reduction do k odd-even and LU factorization reductions, LU factorization, and back Timing: (based on N = 2m - I) i. {I(5n2 + n)T D + I(38n3 (a) 5n)G M i {i(3n2 + n) tD + _(14n 3 + 9n 2 - 5n)(t M + tS) }(2m-k + - {n2tD + difference rithm will be an expression be chosen between _n = 14839, I06.5N + 14839m- odd-even reduction of the form _n(m-k) to maximize this expression. _n = 905.5, vn and the mixed algo- - _n(2 m-k) - _n' and k For n = 2 using = 13028, k = m-5, STAR yielding 46908.5 For n = 2 and STAR data, the polyalgorithm time 183N + 16233k + 417(2 m-k) + 33, and the optimal yielding (b) i) (2113 + n 2)(t M + tS)} The timing data, we have 2. 5n) Ts}(N + I - 2m-k) I {I(5n2 + n)G D + _(38n 3 + 15n2 I + _(38n 3 + 3n 2 _ 5n)Gs}k + (c) substitute. + IDn2 - 5n)T M i + _(38n 3 + 3n 2 should C ) (k+ i) (k+I) solve A -k+l- x = w by the 183N + 16233m - 70677 is choice is k = m-6, 153 6. Incomplete Algorithm: reduction do k odd-even back redmctions, substitute to obtain Timing: (based on N = 2m - I) I. _I(5n2 . n) TD + I(38n3 (a) i + _(38n3 + y = y (I) (1) + 3n 2 - 5n)T S} (N + i- I _i(5n2 + n)_ D + _(38n 3 + 15n2 = x. 2re'k) 5n)_ M 5n)_s}k + 1 (n2 + n)T D + _(2n I 3 + 3n 2 - 5n)(T M + TS) }(2m-k - I) + I 3 _l(n2 + n)_ D + _(2n + 3n 2 - 5n)(_ M + _S )} general n, STAR data: i(38n 3 + 31n 2 - n)N + I(8740n3 + 5103n 2 - 649n) - (9n3 + 6n2)(2 m-k - i) + I(460n3 2. _ x + 15n2 - 5n)TM I + _(38n 3 + 3n 2 (b) D [A(k+ I)] Y (k+l)• = w (k+l) , and solve (c) using STAR data, (a) general n = 2: k + l191n 2 - 649n) I06.5N + 14839k - 96(2 m-k) + 1287 n, STAR data: I(38n 3 + 86n 2 + 42n)N + I(8740n 3 + 5103n 2 + 443n + I - _(36n 3 + 79n2 + 51n)(2 m'k - I) I + _(460n 3 + l191n 2 - 1651n) + 80(k+l) (b) n = 2, STAR data: 183N + 16233k- 176.5(2 m-k) + 1113.5 5700) k 154 7. Jacobi basic iteration, semi-iteration step: -- y(i+l) for each B[A]y(i) (identical for both I. (a) I _(n 2 + n)(TDN I. (b) = 2. (a) (c) Optional (I) = 2. (b) (i) - ajyj_ I vector (i) - cjYj+I computer models): + _D ) + 6(2n 3 + 15n2 using 4l(2n 3 + 19n2 I. V j = I, ..., N (i+l) _ solve bjyj -vj Timing + D[A]-I 5n)((T M + _s)N + STAR data, n)N + _(460n 3 + 3951n 2 using preprocessing, (_M + _S )) STAR data, times n = 2: 649n) 22.5N + 3031 for n = 2, STAR data solve bj(_j vj q0j) =(aj cj vj), cost = 38.5N + 4367. yj (i+l) _ -e0j- The basic step is then (i) _jYj-I (i) - VjYj+I ' cost = 12N + 1840. (2) factor bj, cost = 3.5N + 367. original, but cost is 19N + 2634. plete reduction/Jacobi Semi-iteration, basic _(m) = result Wm+ I = i/(Ip = p(B[A]). This algorithm. step: of Jacobi step from y(m) (m-i) y(m) = _m+l The basic (9(m) _ Y (p2/4)_n)), (m-I) ) + Y step is unchanged is used from the in the mixed incom- 155 Timing, n = 2, using STAR data: 26.5N + 3408, or 16N + 2217 with the first preprocessing. Notes : I. Varga IV3, p. 139] shows that for Jacobi semi-iteration, ,Ie(i),, < _+_b_(_b_ll--))ii/_ " e(0)", = _ e A conservative (i) estimate 1 =y +- (i) P----- - x. for the number IIe(n) II< 1/Nil e(0) II is n - 21ogN / the execution for the better time , for an iterative direct methods. of iterations (-log(_b-l)). solution needed to obtain It follows that is O(N log N) vs. O(N) 156 8. Incomplete Algorithm: reduction do k odd-even L Jacobi plus Jacobi iterations reductions, iterations to improve As a result of solving D[A (k+l) ]y(k+l) D[A (k+l) ]. This means the cost I. (c) I06.5N + 14839k- 2. (b) 183N + 16233k- solve D[A (k+l)]y(k+l) y (k.l) , and back = w(k+l) , we have is (using STAR = w (k+l) , do substitute factors data, n = 2) 96(2 m-k) + 1287 + %(19(2 m-k) + 2615). 176.5(2 m-k) + 1113.5 + _(19(2 m-k) + 2615). for 157 REFERENCE S [BI] I. Babuska, "Numerical stability in mathematical analysis," Consress 1968, North-Holland, Amsterdam, pp. 11-23. [B2] I. Babuska, Numerical stability in the solution of tridiagonal matrices, Rept. BN-609, Inst. for Fluid Dynamics and Appl. Math., Univ. of Md., 1969. [B3] I. Babuska, "Numerical stability in problems of linear SlAM J. Numer. Anal., vol. 9, 1972, pp. 53-77. [B4] D.W. Bailey and D. E. Crabtree, "Bounds Alg. Appl., vol. 2, 1969, pp. 303-309. [B5] R.E. Bank, "Marching algorithms in [BI6, pp. 293-307]. [B6] G.H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, R. A. Stoker, "The llliac IV computer," IEEE Trans. on Comp., vol. C-17, 1968, pp. 746-757. [B7] G.M. Dept. [B8] F.L. Bauer, "Der Newton-Prozess als quadratisch konvergente Abkurzung des allgemeinen linearen stationaren Iterationsverfahrens Baudet, Asynchronous of Computer Science, algebra," for determinants," and block Gaussian IFIP Lin. elimination," iterative methods for multiprocessors, Carnegie-Mellon Univ., Nov. 1976. I. Ordnung (Wittmeyer-Prozess)," 1955, pp. 469-470. Z. Ansew. Math. Mech., vol. 35, [B9] A. Benson and D. J. Evans, "The successive peripheral block overrelaxation method," J. Inst. Maths. Applics., vol. 9, 1972, pp. 68-79. [BI0] A. Benson and D. J. Evans, "Successive peripheral overrelaxation and other block methods," J. Comp. Phys., vol. 21, 1976, pp. 1-19. [BIll G. Birkhoff and A. George, in [TI, pp. 221-269]. [BI2] W. J. Bouknight, S. A. Denenberg, D. E. Mclntyre, J. M, Randall, A. H. Sameh, D. L. Slotnick, "The llliac IV system," Proc. IEEE, vol. 60, 1972, pp. 369-388. [BI3] A. Brandt, Multi-level method, Rept. RC 6026, Heights, N. Y., 1976. [BI4] J.L. Brenner, "A bound for a determinant with dominant diagonal," Proc. AMS, vol. 5, 1954, pp. 631-634. "Elimination by nested dissection," adaptive techniques I. The multi-grid IBM T. J. Watson Res. Cen., Yorktown main 158 [BI5] C.G. Broyden, "Some elimination process," pp. 273-286. condition number J. Inst. Maths. [BI6] J.R. Bunch and D. J. Rose, eds., Academic Press, N. Y., 1976. [BI7] B.L. Buzbee, bounds Application Applics., Sparse Matrix of fast Poisson approximation of parabolic Sci. Lab., 1972. for the Gaussian problems, Rept. vol. 12, 1973, Computations, solvers to the numerical LA-4950-T, Los Alamos [BI8] B.L. Buzbee and A. Carasso, bolic problems for preceding pp. 237-266. [BI9] B.L. Buzbee, F. W. Dorr, J. A. George and G. H. Golub, "The direct solution of the discrete Poisson equation on irregular regions," SIAM J. Numer. Anal., vol. 8, 1971, pp. 722-736. [B20] B.L. Buzbee, G. H. Golub and C. W. Nielson, "On direct methods for solving Poisson's equations," SlAM J. Numer. Anal., vol. 7, 1970, pp. 627-656. [CI] A. Carasso, "The abstract backward Anal., vol. 2, 1971, pp. 193-212. [C2 ] Control Data Corporation, CDC STAR-100 Instruction Arden Hills, Minn., Jan. 1976; in conjunction with made at Lawrence Livermore Laboratory. [C3] A.K. [C4] P. Concus and G. H. Golub, "Use of fast direct methods for the efficient solution of nonseparable elliptic equations," SlAM J. Numer. Anal., vol. I0, 1973, pp. 1103-1120. [C5] P. Concus, G. H. Golub and D. P. O'Leary, "A generalized conjugate gradient method for the numerical solution of elliptic partial differential equations," in [BI6, pp. 309-332]. [C6] S.D. Conte and C. de Boor, Hill, N. Y., 1972. [C7] M.G. Cox, "Curve fitting with piecewise polynomials," Maths. Applics., vol. 8, 1971, pp. 36-52. [C8] Cray Research, [DI] M.A. Diamond and D. L. V. Ferreira, "On a cyclic reduction method for the solution of Poisson's equations," SlAM J. Numer. Anal., vol. 13, 1976, pp. 54-70. [D2] F.W. Dorr, "The direct solution of the discrete Poisson equation on a rectangle," SlAM Review, vol. 12, 1970, pp. 248-263. Cline, personal Inc., "On the numerical computation of paratimes," Math. Comp., vol. 27, 1973, beam equation," communication, "Cray-I SlAM J. Math. execution times, measurements 1974. Elementary Computer," Numerical Chippewa Analysis, Falls, McGraw- J. Inst. Wis., 1975. 159 [D3] F.W. Dorr, "An example of ill-conditioning in the numerical solution of singular perturbation problems," Math. Comp., vol. 25, 1971, pp. 271-283. [El] S.C. Eisenstat, M. H. Schultz and A. H. Sherman, "Applications of an element model for Gaussian elimination," in [BI6, pp. 85-96]. [E2] D.J. Evans, "A new iterative method for the solution of sparse systems of linear difference equations," in [R6, pp. 89-100]. [E3] R.K. Even and Y. Wallach, "On the direct solution of Dirichlet's problem in two dimensions," Computing, vol. 5, 1970, pp. 45-56. [FI] D.C. Feingold and R. S. Varga, "Block and generalizations of the Gerschgorin Math., vol. 12, 1962, pp. 1241-1250. [F2] D. Fischer, G. Golub, O. Hald, C. Leiva and O. Widlund, "On Fourier-Toeplitz methods for separable elliptic problems," Math. Comp., vol. 28, 1974, pp. 349-368. IF3] G. von Fuchs, J. R. Roy and E. Schrem, "Hypermatrix solution large sets of symmetric positive definite linear equations," Math. in Appl. Mech. and Engng., vol. I, 1972, pp. 197-216. of Comp. [GI] J.A. George, "Nested SIAM J. Numer. Anal., mesh," [HI] L.A. Hageman and R. S. Varga, "Block iterative methods for cyclically reduced matrix equations," Numer. Math., vol. 6, 1964, pp. 106119. [H2] D.E. Heller, "Some aspects of the cyclic reduction algorithm block tridiagonal linear systems," SlAM J. Numer. Anal., vol. 1976, pp. 484-496. [H3] D.E. Heller, A survey of parallel algorithms algebra, Dept. of Comp. Sci., Carnegie-Mellon Review, to appear. [H4] D.E. Heller, D. K. Stevenson and J. F. Traub, "Accelerated iterative methods for the solution of tridiagonal systems on parallel computers," J.ACM, vol. 23, 1976, pp. 636-654. [H5] R.W. Hockney, "A fast direct solution of Poisson's equation Fourier analysis," J.ACM, vol. 12, 1965, pp. 95- 113. [H6] R.W. Hockney, "The potential calculation and some applications," Methods in Computational Physics, vol. 9, 1970, pp. 135-211. [H7 ] E. Horowitz, perturbation pp. 816-825. diagonally dominant matrices circle theorem," Pacific J. dissection of a regular finite vol. I0, 1973, pp. 345-363. element in numerical Univ., 1976; for 13, linear SlAM using "The application of symbolic mathematics to a singular problem," Proc. ACM Annual Conf., 1972, vol. II, 160 [H8] A.S. Householder, Blaisdell, N. Y., The Theory 1964. [Jl] T.L. Jordan, A new parallel algorithm for diagonally tridiagonal matrices, Los Alamos Sci. Lab., 1974. IJ2] M.L. Juncosa and T. W. Mullikin, "On the increase of convergence rates of relaxation procedures for elliptic partial difference equations," J.ACM, vol. 7, 1960, pp. 29-36. [KI] W. Kahan, Conserving confluence curbs ill-condition, Comp. Sci., Univ. of Calif., Berkeley, 1972. [K2] H.T. Kung, "On computing reciprocals Math., vol. 22, 1974, pp. 341-348. [K3] H.T. Kung and J. F. Traub, All algebraic fast, Dept. of Comp. Sci., Carnegie-Mellon ILl] J.J. Lambiotte, Jr., The solution of linear systems of equations on a vector computer, Dissertation, Univ. of Virginia, 1975. [L2] J.J. Lambiotte, Jr. and R. G. Voigt, "The solution of tridiagonal linear systems on the CDC STAR-100 computer, " ACM Trans. Math Software, vol. I, 1975, pp. 308-329. [MI] N. Madsen and G. Rodrigue, A comparison of direct methods for tridiagonal systems on the CDC STAR-100, Lawrence Livermore Lab.:, 1975. [M2] M.A. Malcolm and J. Palmer, "A fast direct method for solving CACM vol . 17 , 1974 , pp class of tridiagonal linear systems, " __, 17. [M3] W • Miller, "Software for roundoff analysis, " ACM Trans. Software, vol. I, 1975, pp. 108-128. IN1] A.K. Noor and S. J. Voigt, "Hypermatrix scheme for finite element systems on CDC STAR-100 computer, " Computers and Structures, vol 5, 1975, pp. 287-296. [01] J.M. Ortega and R. G. Voigt, Solution of partial differentia]: equations on vector computers, ICASE, Hamton, Va., 1977. [02] A.M. Ostrowski, "_ber die Determinanten mit _berwiegender Hauptdiagonale," Comment. Math. Helv., vol. I0, 1937, pp. 69-96. [PI] G • Peters and J. H. Wilkinson, "On the stability of Gauss-Jordan elimination with pivoting," CACM, vol. 18, 1975, pp. 20-24. JR1] J.K. Reid, ed., Larse Press, London, 1970. Sparse of Matrices in Numerical of power Analysis, dominant Dept. series, of " Numer. functions can be computed Univ., July 1976. Sets of Linear Equations, a 14- Math Academic 161 JR2 ] J.K. Reid, "On the method of conjugate gradients for the solution of large sparse systems of linear equations," in [RI, pp. 231-254 ]. [R3] F. Robert, "Blocks-H-matrices tives classiques par blocks," pp. 223-265. [R4] D.J. Rose and J. R. Bunch, "The role of partitioning in the numerical solution of sparse systems," in JR6, pp. 177-187]. JR5] D.J. Rose and G. F. Whitten, "A recursive strategies," in [BI6, pp. 59-83]. [R6] D.J. Rose and R. A. Willoughby, eds., Sparse Matrices Applications, Plenum Press, N. Y., 1972. JR7] J.B. Rosser, The direct solution of difference analogs of Poisson's equation, rep. MRC-TSR-797, Math. Res. Cen., Madison, Wis., 1967. [SI] J. Schr_der and U. Trottenberg, "Reduktionsverfahren fur Differenzengleichungen bei Randwertaufgaben I," Numer. Math., vol. 22, 1973, pp. 37-68. [Sla] J. SchrSder, U. Trottenberg and H. Reutersberg, "Reduktionsverfahren f_r Differenzengleichungen bei Randwertaufgaben II," Numer. Math., vol. 16, 1976, pp. 429-459. [$2] H.S. Math. [$3] H.S. Stone, ed., Introduction to Computer Architecture, Research Associates, Palo Alto, Calif., 1975. [$4] G. Strang and G. J. Fix, An Analysis of the Finite Prentice-Hail, Englewood Cliffs, N. J., 1973. [$5] P.N. et convergence des methods iteraLin. Alg. Appl., vol. 2, 1960, Stone, "Parallel tridiagonal equation Software, vol. I, 1975, pp. 289-307. Swarztrauber, "The direct equation on the surface 1974, pp. 46-54. solution of a sphere," analysis of dissection solvers," and Their ACM Trans. Element of the discrete J. Comp. Science Phys., Method, Poisson vol. 15, [$6] P.N. Swarztrauber, "A direct method for the discrete solution separable elliptic equations," SlAM J. Numer. Anal., vol. II, 1974, pp. 1136- 1150. of [$7] P.N. Swarztrauber and R. A. Sweet, "The direct solution of the discrete Poisson equation on a disc," SlAM J. Numer. Anal., vol. I0, 1973, pp. 900-907. [$8] P. Swarztrauber and R. Sweet, Efficient FORTRAN subprograms for the solution of elliptic partial differential equations, Rept. NCAR-TN/IA-109, Nat. Cen. for Atmospheric Res., Boulder, Col., 1975. 162 [$9] R.A. tion 428. Sweet, "Direct methods for the solution of Poisson's equaon a staggered grid," J. Comp. Phys., vol. 12, 1973, pp. 422- [Sl0] R.A. Sweet, "A generalized cyclic reduction Numer. Anal., vol. II, 1974, pp. 506-520. [SIll R.A. Sweet, "A cyclic reduction algorithm for block tridiagonal systems of arbitrary dimension," contributed paper, Symp. on Sparse Matrix Computations, Argonne Nat. Lab., Sept. 1975. [TI] J.F. Traub, ed., Complexity of Sequential Algorithms, Academic Press, N. Y., 1973. IT2] J.F. Traub, "Iterative solution of tridiagonal or vector computers," in [TI, pp. 49-82]. IT3] J.F. Traub, ed., Analytic Press, N. Y., 1976. IV1] J.M. Varah, "On the solution of block tridiagonal systems arising from certain finite difference equations," Math. Comp., vol. 26, 1972, pp. 859-868. IV2 ] J.M. Varah, "A comparison of some numerical methods for two-point boundary value problems," Math. Comp., vol. 28, 1974, pp. 743-755. IV3] R.S. Varga, Matrix Iterative Cliffs, N. J., 1962. IV4] R.S. Varga, "On recurring theorems on diagonal Alg. Appl., vol. 13, 1976, pp. 1-9. [WI] W.J. Watson, "The TI ASC, a highly modular and flexible supercomputer architecture," AFIPS Fall 1972, AFIPS Press, Montvale, N. J., vol. 41, pt. I, pp. 221-229. [W2] O.B. Widlund, "On the use of fast methods for separable finite difference equations for the solution of general elliptic problems," in JR6, pp. 121-131]. [W3] J.H. Wilkinson, "Error analysis of direct methods sion," J.ACM, vol. 8, 1961, pp. 281-330. [W4] W.A. Fall 777. [yl] D.Y.Y. analytic Computational Analysis, algorithm," and Parallel systems Complexity, SlAM J. Numerical on parallel Academic: Prentice-Hall, Englewood dominance," of matrix Lin. inver- Wulf and C. G. Bell, "C.mmp, a multi-mini-processor," AFIPS 1972, AFIPS Press, Montvale, N. J., vol. 41, pt. 2, pp. 765- Yun, "Hensel meets Newton - algebraic setting," in IT3, pp. 205-215]. constructions in an
© Copyright 2024