DIRECT AND ITERATIVE METHODS FOR BLOCK TRIDIAGONAL

DIRECT AND ITERATIVE METHODS
FOR BLOCK TRIDIAGONAL LINEAR SYSTEMS
Don Eric Heller
April
1977
Department of Computer Science
Carnegie-Mellon
University
Pittsburgh,
PA
15213
Submitted
of Doctor
in partial fulfillment of the requirements
for the degree
of Philosophy at Carnegie-Mellon
University.
This research was supported in part by the Office of Naval Research
under Contract N00014-76-C-0370,
NR 044-422, by the National Science
Foundation under Grant MCS75-222-55,
and by the National Aeronautics
and Space Administration
under Grant NGR-47-102-001,
while the author
was in residence at ICASE, NASA Langley Research Center.
ABSTRACT
Block
scientific
lems.
tridiagonal
computations,
Numerical
emphasis
methods
demonstrating
often
methods
on efficient
for direct
systems
of linear
forming
for solution
methods
methods.
Block
elimination
reduction
is quadratic,
matrix,
and are the only quadratic
gence
of odd-even
Semi-direct
reduction
tions for an approximate
line computer
effect
solution.
is given, with
of machine
prove
constraints
Newton
methods
methods
useful
based
dominance
iteration
on vector
cyclic
in combination
theory
of direct
odd-even
for the inverse
reasonable
on the quadratic
with
requirements
conver-
linear
time analysis
operations.
with
The odd-even
a certain
to storage
prob-
is developed,
properties
exist.
An execution
attention
complicated
is linear,
within
in
A convergence
diagonal
methods
of the quadratic
frequently
are studied
and approximation
and higher-order
are variations
of algorithms.
computer.
(LU factorization)
methods
class
of such systems
of block
convergence
occur
the core of more
for a vector
under conditions
stability,
equations
itera-
for a pipeand the
PREFACE
It is with many
my introduction
his support
to numerical
in residence
R. G. Voigt were
must
I acknowledge
mathematics
Enlightening
while
in the thesis.
at ICASE, NASA
most helpful
A number
Langley
in clarifying
at the Pennsylvania
be thanked
for his forbearance.
Of course,
the most
Molly,
important
who has put up with
more
State
Professor
and parallel
H. T. Kung and D. K. Stevenson
the ideas presented
completed
that
and encouragement.
S. H. Fuller,
while
thanks
computation,
conversations
have
Research
with
contributed
of early
Center;
University,
and for
G. J. Fix,
were
obtained
J. M. Ortega
and
The thesis was
and P. C. Fischer
has been
than her fair share.
for
to some of
results
these results.
contribution
ii
J. F. Traub,
that of my wife,
CONTENTS
I.
2.
3.
Introduction
A.
Summary
B.
Notation
Some
A.
Analytic
B.
Models
A.
4.
5.
of Main Results ...................
Initial
Linear
page
I
..........................
..........................
Remarks
Tools
Methods
8
......................
.......................
of Parallel
Computation
...............
.........................
The LU Factorization
5
I0
I0
14
18
....................
18
I.
Block Elimination ....................
18
2.
Further
25
3.
Back Substitution
4.
Special
B.
Gauss-Jordan
C.
Parallel
Quadratic
Properties
of Block
Elimination .........
....................
Techniques
...................
Elimination
Computation
..................
....................
Methods ........................
30
31
33
36
37
A.
Odd-Even
Elimination
....................
37
B.
Variants
of Odd-Even
Elimination
45
C.
Odd-Even
Reduction
..............
.....................
48
I.
The Basic Algorithm ...................
48
2.
Relation
51
3.
Back Substitution
4.
Storage
Requirements
5.
Special
Techniques
to Block
....................
..................
...................
D.
Parallel
E.
Summary ...........................
Unification:
Computation
Elimination ..............
Iteration
....................
and Fill- In ...............
iii
56
58
59
60
64
65
6.
7.
8.
9.
I0.
Higher-Order
Methods
.......................
71
Reduction
.......................
71
Rates .......................
75
A.
p-Fold
B.
Convergence
C.
Back
Substitution
Semidirect
Methods
.......................
........................
A.
Incomplete
Elimination
B.
Incomplete
Reduction
C.
Applicability
Iterative
80
83
....................
.....................
of the Methods
.................
Methods .........................
A.
Use of Elimination
B.
Splittings
C.
Multi- line Orderings
D.
Spiral
E.
Parallel
Gauss
Applications
86
90
92
......................
92
..........................
Orderings
84
93
.....................
.......................
94
97
........................
I01
..........................
105
A.
Curve Fitting .........................
105
B.
Finite
106
Elements ........................
Implementation
A.
Choice
B.
Storage
C.
Comparative
of Algorithms
of Algorithms
(Generalities).
Requirements
Appendix
A.
Vector
Appendix
B.
Summary
Timing
...................
.............
.....................
Analysis ..................
Computer
Instructions ...............
of Algorithms
and Timings
References ..............................
iv
............
Ii0
II0
113
120
131
136
156
I.
INTRODUCT ION
Block tridiagonal
in a variety
numerical
matrices
of scientific
solution
and engineering
of differential
say that the matrix
looks
bI
2
b
cI
b2
equations.
is
an
b3
X n
i
•
linear
vector-oriented
matrix
and
i
analysis
primarily
discussion
this reason
rithms.
many
methods
tion, and we make
methods.
simplified
basis.
inherent
analytic
considerable
Algorithms
model
_-I
aN
bN/
nj) × (
methods
conformaIly•
for solution
on efficient
properties
theme
cases
of the CDC STAR-100
using
llliac
Our
with
execution
For
algo-
of direct
of a general
theory
and
matrix
itera-
for direct
time estimates
and recommendations
tri-
for a
IV).
sequential
variety
use of a convergence
are compared
methods
of application•
to standard
is that a wide
as special
of a block
of the algorithms,
and areas
also apply
Thus
nj).
(e.g., CDC STAR-100,
parallelism
can be viewed
bN-I
dimensioned
N
emphasis
numerical
of our results
A unifying
iterative
computer
concerns
of their
to
,
a
c
are
N i' i
system Ax = v, with
parallel
in the
For now it is sufficient
•
the full matrix A has dimension (_j=l
In this thesis we study numerical
diagonal
typically
arise
c3
•
n
which
c2
_
i
of matrices
computations,
aN-I
where
class
like
a3
A
are a special
are made
for a
on this
There are two important
depending
on whether
A computational
into account
storage
the blocks
method
the linear
structure
and a low operation
tant applications
in which
of block
count.
an approximate
often motivate
systems,
when a direct method
Since
direct
1965, however,
solution
partial
differential
ence approximation
and the b.i blocks
structure.
ture will
was Hockney's
use of Fourier
was able
matrix
methods
and
equation
required
proved
discovery
of cyclic
reduction
for large
the seminal
more
struc-
elliptic
equation.
suggestion
reduction
by Golub,
algorithm
operations,
Other
while
O(n 3) methods
larger asymptotic
from applications
on the
grid Hockney
ordinary
band-
for Poisson's
constants,
and the
in practice.
and Buneman's
This
for the solu-
For an n X n square
(O(n2 log n) operations)
attention
sparse
possible
Transform
much
are diagonal
of the impor-
of the Fast Fourier
that attracted
differ-
for many
less time-consuming
fast
finite
equation
to be significantly
created
elliptic
paper
on an original
had considerably
in the
a very regular
the differential
operations.
interest
the ai, c i blocks
on a rectangle.
O(#)
sparse
expensive.
five-point
so A possesses
the cyclic
there are impor-
be far more
a standard
the simplest
[H5], based
then extant
new method
methods
dealt with
to obtain
from two-dimenslonal
to solve Ax = v in O(n 3) arithmetic
equation
sequent
Indeed,
take
is satisfactory.
methods
would
grid,
specializing
transforms
tion of Poisson's
With
are tridiagonal,
work
solution
derived
on a rectangular
be available.
tant developments
in order
Moreover,
iterative
systems
equations.
By further
or large and sparse.
there has been an increased
of special
matrices,
system Ax = v should
of the blocks
Such considerations
especially
tridiagonal
are small and dense
to solve
the internal
economy
subclasses
S_b-
stable version
and accurate
programmers
and
numerical
been
analysts
extended
and
to
to general
[B20],
Poisson's
separable
Fortran
subroutine
recent
work
these
and
by
Bank's
controlling
maintain
a
the
is still
general
for
to
present
practice
iteration,
the
has
(e.g.,
[$6],
and
The
great
care
and
[$8]).
Fourier-Toeplitz
[B5].
since
regions
a rectangle
available
method,
recently
Rose
the
grids.
computation
and
undoubtedly
play
References
hard
an
are
latter
must
well-
Other
methods
IF2],
methods
be
work
taken
to
partial
Ortega
and
case
is still
[C5]).
combining
discusses
work
differential
Voigt
[BI9],
not
in
been
future
equations
[DI],
[$5],
[$7],
of
[BI3]
and
[$9].
and
vector
in
by
the
techniques
a rather
a sequence
here,
The
an
separable
are
suitable
studies.
[O1].
tech-
Schultz
solved
considered
on
theoreti-
adaptive
Brandt
methods
available
Nevertheless,
solution
iteration
several
has
role
by
to a
interesting
Eisenstat,
multi-level
developed
applies
is
frequently
direct
The
which
others.
problems
method
Some
by
among
[C4],
his
[GI],
[R5]
that
direct
implement.
on
specialized
method
best
discussed
based
important
of
to
methods
successfully
The
dissection
considerably
although
include
nested
for
direct
equation.
Whitten
Brandt
methods
satisfactory
nonseparable
and
cretization
by
and
(e.g.,
approach,
direct
situation
excellent
analyzed
solution
fast
apparently
this
appeared
different
of
elliptic
but
and
and
have
marized
are
algorithm
nonrectangular
on
algorithms
unstable
is George's
improve
[Eli
for
certain
includes
completely
problem
Sherman
case
no
attractive
niqJes
packages
success
nonseparable
this
cally
on
Buneman
equations
marching
fast
The
stability.
Despite
there
equation
problems
generalized
[H6].
elliptic
tested
on
[D2],
for
of
dis-
parallel
it will
state-of-the-art
computers
is
sum-
With
the application
the major
block
research
elimination
methods,
measure,
cases where
the blocks
plicit
enough
tion,
the cyclic
Sweet's
generalized
cyclic
tions on these basic
sequence
block
of matrix
diagonal
dominance,
the diagonal
at higher
form.
matrix
are naturally
cyclic
reduction
semi-direct
algorithm,
by their
methods
the single
discussed
computer
tion of parallel
computers,
convergence
here
appeared
that supports
the algoFor the
to the definiof Ax = v.
in [H2].
vector
varying
ASC.
data
many
degrees
operations.
IV, or a pipeline
Instruments
to
or decrease
properties.
can be run, with
sufficiently
diagonal
Thus
leads
it to
relative
the solution
stream-multiple
and have
blocks
convergence
as a
to reduce
of block
and
varia-
described
respectively.
previously
here)
quadratically,
such as llliac
or the Texas
instruction
conditions
to approximate
in the area have
an array processor,
easily
ellip-
LU factoriza-
reduction
to A in order
decrease
im-
efficiently.
[SI0], along with
the off-diagonal
quadratic
on a parallel
odd-even
to
as full
such systems
the block
are most
applied
that, under
classified
such as the CDC STAR-100
fall within
The methods
not increase,
All of the algorithms
this we mean
algorithms
algorithms,
Some of our results
of efficiency,
reduction
for each basic
rithms
tion of useful
will
includes
As
restricted
for nonseparable
to solve
(called
We show
of the blocks.
are best
methods
algorithm
norms describing
blocks
rates,
methods
transformations
of direct
by some storage-efficient
on the ability
themes.
analysis
in mind,
to be stored explicitly
iterative
of direct
equations
of the sparsity
to be handled
be based
reduction
is a seneral
enough
Proposed
Our investigation
differential
the direct methods
are small
representation.
will
regardless
however,
or special
tic equations
of partial
$oal 0 f this thesis
a practical
matrices,
area
These
stream
common
By
processor,
machines
classificafeatures
to
make a general
also many
when
special
designing
a simplified
made
comparison
restrictions
programs
model
of STAR indicate
every
on a particular
general
method
naively
the costs of tailoring
ior of the general
to move
I.A.
applied
survey
are
comparisons
for
should be
of parallel
methods
in
[H3] is recommended.
method
tailored
to that problem
high.
can be used
there
Our comparisons
can be expected
can be quite
method
machine.
material,
computer
and we believe
to be superior
and machine.
Observations
to guide
that the results
to a particular
However,
about
the tailoring
presented
to a
the behavfor a later
here will
allow
us
in that direction.
Summary
Direct
discussed
methods
of Main
methods
in Section
in Section
discussed
9.
8.
3-6,
Preliminaries
Implementation
We consider
iterative
(analytic
linear
in Section
tools
systems
7, and iterative
and models
on certain
on a parallel
are
of paral-
applications
computer
are
is
I0.
as a sequence
system Ax = v with
the intention
B[A] = I - (D[A])-IA,
substitutions,
methods.
methods
2 and remarks
of A, is shown to be important
tion in back
tridiagonal
of algorithms
direct methods
The matrix
of block
semidirect
are in Section
in Section
to the linear
Results
for solution
in Sections
lel computation)
part
For a general
However,
that must be considered
how similar
case, a solution
problem
solution,
or capabilities
plus background
In almost
possible.
for a particular
for other machines.
linear algebra
ture.
of algorithms
where
of simplifying
DIAl
for stability
and approximation
We appear
of transformations
to be the first
its struc-
is the block
analysis,
errors
applied
diagonal
error
in semidirect
to systematically
propagaand
exploit
B[A]
in the analysis
been used
to analyze
Under
and semidirect
iterative
methods.
the assumption
seq_aence of matrices
as linear,
of direct
M.lwith
The _-norm
be desirable
quite
throughout;
to be linear methods,
tion 3.
Sample
results
IIB[A]II < i, bounds
convergence
inherently
parallel
is seen
a
the method
say that
to other
it is
norms would
numbers
closely
to be a variation
when
blocks,
in the Gauss-Jordan
and odd-even
related
and are discussed
are
in Sec-
factorization
of the pivotal
of the matrix
elimination
elimination
are investigated
of the block
on the condition
algorithms
Gauss-Jordan
properties
are stability
are shown to be quadratic,
ination
and their
of odd-even
IIB[Mi]I 12we
extension
and block
to zero of portions
The methods
generates
difficult.
The block LU factorization
shown
it has long
IIB[Mi+I]II _ IIB[M i]l, we classify
is used
but proves
though
a direct method
IIB[Mi+I]II < IIB[M i ]2 IIor
and if
quadratic.
IIB[A]II < I, when
methods,
algorithm.
reduction,
which
to the fast Poisson
in Section
of the quadratic
4.
Newton
and
are
solvers,
Odd-even
iteration
elimfor
-i
A
, and odd-even
and Varga's
block
reduction
general
elimination
cyclic
can be considered
reduction
on a permuted
elimination.
We also
as variations
of the odd-even
In Section
methods
cludes
within
technique
system,
certain
aspects
a certain
is also discussed.
of Hageman
to
form of odd-even
of the fast Poisson
solvers
methods.
reasonable
methods
The relationship
case
[HI], as equivalent
and as a compact
show that the odd-even
all the previous
iteration.
gence
5 we
discuss
as a special
class
methods
of algorithms.
as particular
between
block
are the only quadratic
cases
fill-in
This class
of a general
and q Jadratic
in-
matrix
conver-
In Section
families
We note
6 the quadratic
methods
of higher-order
methods,
that the p-fold
reduction
equivalent
to a 2-fold
are shown
characterized
to be special
(odd-even)
using
of
]] B[Mi+ I]]] < ]] B[M i]]]p.
by
(Sweet's generalized
reduction
cases
reduction
a coarser
IS10])
is
partitioning
of A.
Semidirect
algorithms
rounding
error
Fast
and spiral
elimination
equations,
some
is a
an approximate
reduction.
vis-a-vis
Practical
are also considered.
and general
methods
for systems
their utility
algorithms
for iterations
and the applicability
method
and its propagation
iterative
of a rectangular
producing
of odd-even
emphasizing
solvers
form the basis
orderings
error
of the odd-even
a semidirect
completion,
methods
8 we consider
Poisson
small blocks
before
properties
Briefly,
substitution
of the semidirect
from differential
7.
the approximation
in the back
In Section
on the quadratic
in Section
that is stopped
We analyze
limitations
puters.
based
are discussed
direct method
solution.
methods
using
grid.
We
for parallel
for systems
the natural,
also remark
of the Parallel
arising
Gauss
com-
with
multi-line
on the use of
iteration
[T2],
[H4].
A brief section
the assumption
ments,
on applications
I_A]II < 1 is not met,
Methods
obtain
for solution
that a combination
powerful
direct method,
may provide
two situations
curve
fitting
a modification
in which
and finite
ele-
to the system
IIB[A]I
I < I.
of block
tion i0 using a simplified
methods
namely
and in the latter case we describe
Ax = v that will
elude
deals with
model
tridiagonal
of a pipeline
of odd-even
reduction
and that a combination
limited
savings
systems
are compared
(vector)
computer.
We
and LU factorization
of semidirect
if an approximate
and
solution
in Seccon-
is a
iterative
is desired.
A general
discussion
of storage
constraints
on vector
siderations
for practical
I.B.
requirements
operations
and the effect
is included;
of machine
these are essential
con-
use of the algorithms•
Notation
Let the block
tridiagonal
matrix
\
a2
A =
b2
c2
a3
b3
c3
= (aj, bj, cj)N.
aN- I
Here b.1 is an n.1 × ni matrix
triplex
notation
will
often
bN- I
CN- I
aN
bN
and a I
0, cN
be omitted
0.
The subscript
for simplicity•
N in the
Define
L[A] = (aj, 0, 0),
DIAl = (0, bj, 0),
U[A] = (0, 0, cj),
. a., 0, -b- I
B[A] = I - (D[A])- IA = (-b-I
J
.]
3 cj) ,
C[A] = I-
A(D[A])
i=
Use of these notations
for general
nonsingularity
is assumed
the matrix
of DIAl
associated
tion of Ax = v.
semidirect
methods
with
It also
,
partitioned
0
,
-c
Jacobi
linear
an important
be central
.
matrices
in the definitions
the block
assumes
and will
(-a.bj j_
role
should
be evident;
of B and C.
iteration
is
for the solu-
in the study
to our analysis
B[A]
of certain
of direct methods•
We
further
define
n. × n. and
J
J
occurs
tiplication
by
for
example,
For
n×n
in
the
diag(0,...,0,
jth
block
the
jth
E. selects
J
N
D[A]
an
E. =
J
I,
diagonal
block
row
0,...,0),
where
position.
Pre-
(column)
of
the
I is
(post-)
a matrix,
mul-
so
that,
E
=
N
=
E.AE
_i= I
i
i'
matrix
M and an
D[A]Ej
=
E D[A],
J
v, let
n-vector
and
_]=i
E
i
L[A]
j
0.
n
0i(M ) =
_j(M)
_j=l
n
=
Imij I ,
_i=l Imijl
-
IIMII = max i 0 i(M),
IIMlll=_max o YO (M)
0j(MT),
the
=
o_-norm,
IIMTII,
IIvlJ = maxilvil,
n
II
vli
= .....
l il,
i=l
IMI =
0(M)
The
tral
The
reader
should
take
(Imijl),
=
care
radius
(0(M)) "
When
A
condition
X(A)
_ i will
and
the
spectral
not
to
=
be
radius
confuse
row
(aj ' bj ' cj) ' let
used
as
an
of M.
sums
X(A)
alternative
(0i(M))
= maxj(4
to
with
the
specJ -1
I_i i aj II J_ j-I
IIB[A]II <
I.
cj_ I I_ "
i0
2.
SOME
2.A.
IN IT IAL REMARKS
Analytic
Much
B[A].
Tools
of the analysis
presented
Why should we consider
First,
conforming
this matrix
it is a reasonably
onal dominance
of A.
with
That
invariant
sections
involves
of A,
= I - D-IA = B.
quantity
which
let A* =EA
Thus B[A]
ings of the equations
Ax = v.
a change
y = Ex, which
the matrix
at all?
is, if E is any nonsingular
the partitioning
= I - (ED)-I(EA)
in later
On the other
measures
block
the diag-
diagonal
matrix
, so B* = B[A*] = I-D*-IA *
is invariant
hand,
under
block
scal-
B is not invariant
under
-I
of variables
induces
A' = AE
.
We have
-i
B' = B[A'] = EBE
is invariant.
columns,
, and for general
Similarly,
and is invariant
under block
scaling
Secondly,
can be created
some matrix
represents
C[A] measures
under
block
various
one simple
class
by transforming
M, where
the relative
Thirdly,
of the block
the diagonal
diagonal
of parallel
radius
dominance
change
o(B)
of A by
of variables
but not
stages
error.
semidirect
methods
(Section
Ax = v into A*x = v*, A* = MA, v* = Mv
IiB[A*]II <<
the initial
matrix
the spectral
of the equations.
IIB[A]II•
This
transformation
of some direct method.
tion y = D[A*]-Iv * is then computed
measures
A and E only
with
Thus we
An approximate
the behavior
for
typically
error x - y = B[A*]x,
study
7)
so
solu-
I_[A*]II
of B[A] under
transformations.
B[A] is the matrix
Jacobi
iteration
tion, as are the faster
which
determines
for Ax = v.
semi-iterative
tive definite.
It is known
the convergence
rates of an iteration,
This is a natural
methods
that various
the rate of convergence
matrix
based
parallel
on it when
transformations
so we study
itera-
A is posican affect
this property.
II
Finally,
ination
B[A] and C[A] arise
methods
to solve Ax = v.
tion, and B represents
order
similar
to place upper bounds
the effects
of elimination
Since we will
discuss
make
some conditions
Lemma 2.1.
satisfies
naturally
Let J-
C represents
quantities
multipliers
used
in back
of block
used
on
in eliminaIn
we consider
JJB[A]JJ.
use of the condition
on A which
will
Jl + J2 be order
Then
elim-
substitution.
on the size of these quantities
frequent
K = JlK + J2"
in the analysis
guarantee
n,
JJj =
JjB[A]JJ < i we now
that it holds.
jJlJ +
jJJJJ< 1 implies
Pi(K)
JJ2J , and suppose
_ pi(J)
K
for 1 _ i _ n,
and II
KI II
Jll
Proof.
From
JJlJ +
jJ2J =
JJj we have
pi(Jl) + pi(J2) = pi(J)
N Pi(Jl ) jjKJJ+ Pi(J2 )"
for 1 _ i _ n.
From
K = JlK + J2 we have
Pi(K)
Then
Pi(K) _ pi(Jl) JJKJJ+
pi(J2) JJKJJ = pi(j) JjKJJ _ JJjjJ JJKJJ, which
JJKJJ _ jJJjJ JJKJJ and ! _ JJJjJ, a contradiction.
Thus
Suppose
JJKJJ _ I.
implies
JJKJJ < I and
Pi(K) < Pi(J I) + Pi(J2) = pi(J).
We note
if pi(Jl)
a nonzero
then
JJKJJ <
Theorem
Jl and J2 nonnegative,
p(K) < p(J) and P(J) = I implies
respondences
Lemma
element
of the Stein-Rosenberg
J is irreducible,
implies
if pi(Jl) # 0 then we actually
= 0 then Pi(K) = pi(J2) = pi(J).
Jl contains
iniscent
that
QED.
2.2.
to the theory of regular
Suppose
Pi(S) _ Pi(T),
and so H KJJ _ JJJJJ.
It follows
JJJJJ. These
(IV3],
Then
Pi(K ) < pi(J ) and
that if each row of
conditions
[H8]) which
are rem-
states
that
if
J2 # 0 and p(Jl ) < I then P(J) < I
o(K) = P(J)-
splittings
J = (I - S)K + T is order
I _ i _ n.
have
n,
JJJJJ< I implies
There
are also
cor-
IV3].
JJj = J(l - S)KJ +
Pi(K ) _ pi(J),
JTJ,
I _ i _ n,
12
Proof.
For I < i < n,
so Pi(T) < I.
Now,
Pi(T)
Thus
S)K) = 0i(J) < JJJJJ< I,
I > pi(J) = pi((l - S)K) + Pi(T)
+ Pi(T) > Pi(K ) + Pi(T)(I
I > oj(K) + pj(T)(l
K Pi(T) + pi((l-
- JjKJJ ).
- pj(K)),
Suppose
and we have
> Pi(K)
- Pi(S)JJ KJJ
jJKJJ = pj(K) > I.
Then
I < Oj(T), a contradiction.
jJKJJ < I and Pi (J) _ Pi(K)-
QED.
If S = r in Lemma 2.2 then K = SK +
= JJ - S J +
iS J, so by eemma
Similar
Lermna 2.3.
results
hold
Then
Since jT = Jl
T + J2'
r
JjMJJI =
JjMTJJ, we have
i _ i < n.
Then
JJjTjj =
JJKJJ < JJJJJ.
JJ2 j' and suppose
K satisfies
JJKJJI _ JJJJJl"
jJ2
T j' KT = Jl
r KT + J2
T and
jjJJJl < i, and by Lemma
J = K(l-
S) + T,
JJJJJl < I implies
2.1,
JJKJJI = JJKTjj
JJj = JK(I-
S)J + JTJ, _i(S)
diagonally
Indeed,
follows.
the condition
dominant
if we
then
first note
JJB[A]JJ < I under
take J to be the point
Jacobi
for A, J = I " diag(A)-IA ' Jl = D[J] , J2 = J-
that if A is
any partition-
linear
iteration
Jl , then
jJmJ and B[A] = (I - Jl)-iJ2 , so
jJB[A]JJ < lJJJJ < I by use of Lemma
2.1.
2.2
QED.
JJB[A]JJ < I, we
by rows
JjJlJj < JJJJJ< i, JJj = JJlJ +
_ _i(r),
JJKJJI _ JJJJJl"
_i(M) = Pi(M T) , jT = (I - sT)KT + TT , so by Lemma
To investigate
matrix
JJl j +
jjT J = JJl
rj +
jjKTJj _ jjjrjj, and the result
ing of A.
JSJ
QED.
Suppose
We have
strictly
j(l - S)KJ +
jjJJJl"
2.4.
Proof .
jJj =
JJJ JJl< i implies
Proof "
eemma
conclude
JJJ =
for the 1-norm.
Let J = Jl + J2'
K = KJ I + J2"
< jjjTjj =
2.I we again
(J - S),
13
Now,
if we denote
then the above
resents
another
the partitioning
of A by the vector
remark may be generalized.
partitioning
of A which
Suppose
refines
H = (nl,n2,...,nN)
_' = (m l,m2, ...,m M) rep-
H.
That
is, there are inte-
.-1
geTs
Pl
that n. =
1
....
j=n.--
I = P0 < Pl < "'" < PN = M+I such
B'[A] are defined
by the two partitions
< [[B'[A][ I. This
is an instance
methods
that "more
means
and will
be useful
in Section
decrease
[[B[A][[ is to use a coarser
Let us consider
finite
a simple
differences,
% _> 0, with
6.
example
boundary
better
Actually,
estimate
will
problems
frequently
certain
needed
finite
Section
which
will
produce
tion
be quite close
element
Varah
- u
yy
ideas.
Using
+ %,i= f on [0,112 ,
rise to the matrix
2.1,
[[B[A][I < [[J[A][[ =
determined
[Wl])
[VI],
another
we can give the
we should
to the original
for block
analysis
on a change
than strict block
allows
consider
will
matrices
based
In
of variable.
dominance
is the condi-
The general
the use of different
theorems
be
problem.
diagonal
[Icj[J ) < I, I < j _ N.
dominance
for
of this type of
continuous
tridiagonal
[[B[A]] I
need not hold
so a different
based
boundary
[IB[A][I < I [V3];
[V2] has examples
condition
also
to elliptic
This condition
approach
[[bill[ ([[aj[ +
of block diagonal
so for completeness
A such that
to I.
related
to the _-norm, which
definition
these
approximations
approximations,
are closely
9.B we consider
([FI],
that one simple way to
xx
gives
is easily
matrices
[[B[A][] < I is a weaker
relative
-u
By Lemma
B[A]
finite difference
in those cases.
analysis
(cf. [H8, p. 107])
[[B[A][[ = 2[[ M-Ill < 2/(2+_h2).
Indeed, many
value
it shows
equation
conditions,
since
for iterative
convergence"
to illustrate
A = (-I, M, -I), M = (-i, 4+_ 2, -i).
4/(4+_h 2) < I.
rule
[[ B[A]/l
partition.
the differential
Dirichlet
general
faster
Briefly,
If B[A] and
II B'[A][I_<IiI _hen
and
of the more
implicitness
m..
]
on this
norms,
14
assumption.
However,
this will not always
onal dominance
nation,
be done.
is the right
but we have
deal with
in order
found
If J = I-
assumption
convenient
related
to diagonal
II(I-
ILI)-IIuI II< i, and an H-matrix
diagonally
shows that A is an H-matrix
is strictly
recent results
on H-matrices.
methods
eral and will
2.B.
Models
oriented
needed
only consider
to read
parallel
the references
Parallel
sider
elimi-
general
to
exists
lower
IIJIl< I, a G-matrix
Ostrowski
a nonsingular
Varga
JR3] generalizes
if
[02]
diagonal
matrix
[V4] summarizes
these definitions
G- and block H-matrices
convergence
(upper)trian-
using
of the classical
vectoriterative
we will not be completely
gen-
results.
3-9 only a very
is necessary.
time comparison
details
A more
detailed
of methods
information
may
loose conception
be found
description
in Section
for the initial
in Appendix
A,
of a vector-
10.
part
[H3],
is
This
of the
[$3], and
given below.
computers
the vector
and capable
if
some representative
sufficient
text, while additional
block
Computation
computer
contains
diag-
are G- and H-matrices.
if p(IJl) < I.
for simplicity
Sections
for the execution
subsection
block
dominance
dominant.
Robert
and proves
Again,
of Parallel
In order
diagonally
dominant,
for Ax = v.
that block
and often more
dominant
iff there
E such that AE
norms,
concise
analyzing
L (U) is strictly
then A is strictly
ial and matricial
reasonably
IIB[A]II < I.
diag(A)-IA = L + U, where
diagonally
stated
to use when
gular,
to block
the discussion
It is sometimes
it more
the assumption
Other concepts
to keep
fall into several
computers,
of performing
namely
efficient
those
general
operating
floating
point
classes,
of which
in a synchronous
vector
we
con-
manner
operations.
15
Examples
are the array processor
elements
[B6], [BI2];
and the pipeline
[Wl], which
machines
can handle
and there is much
machine
the Cray-l,
processors
have become
many
of
important
interest
eight
features
64-word
(essentially)
tools
vector
able
of operation
[C8];
length.
These
computations,
good use of special
vary widely,
a general
registers
ASC
scientific
to make
processing
Instruments
arbitrary
for large
to make
64 parallel
[C2] and Texas
in algorithms
Details
common
with
IV, with
CDC STAR-100
vectors
capabilities.
ficiently
llliac
but there are suf-
comparison
of algorithms
possible.
An important
chronous
parallel
Carnegie-Mellon
processors.
Baudet
for linear
computer
not being
University
C.mmp
[B7] discusses
and nonlinear
systems
considered
[W4], with
the theory
here
is the asyn-
up to 16 minicomputer
of asynchronous
of equations,
plus results
iterations
of tests
on
C .mmp.
Instructions
scalar
will
operations
be described
simply
of a vector
and a special
set of vector
shortly.
an ordered
the details
for our model
set of storage
of storage.
locations
valid
locations
in arithmetic
sion to have
for use
increment
data manipulation
We shall
in
vector
in a vector
progression.
= I, which
order
consider
locations,
A machine
computer
consist
operations,
some of which
a conceptual
with
vector
no particular
is a more
instruction,
usually
to organize
forces programs
conceptual
to be
regard
specialized
of
the progres-
to perform
vectors
to
set of
consisting
The CDC STAR restricts
often
of the usual
extra
into machine
vectors.
We denote
the vector
vector
operations
sum w = x+y is
by a parenthesized
list of indices.
Thus
16
wi
x.l + yi'
(I _ i _ N)
N
are in R , the componentwise
if the vectors
w.l = xi X Yi'
and a matrix
multiplication
product
would
be
(i _ i _ N),
C = AB would
be
N
cij = _k=l
Execution
aik bkj , (I < i _ N;
time for a vector
and the internal mechanism
written
in the form •
computer,
1/T
tion startup
op
cost.
N + _
operations
in parallel
vector
case
operation
times are given
for the CDC STAR-100
are given
<< top
operation
add,
subtract
op
cost
in terms
is the vector
either
time
13
instruc-
k parallel
be t
op
for k
t
FN/k_ + overhead.
op
operations;
instruction
scalar
execution
times
of the 40 nanosecond
cycle
<< O'op.
scalar
be
On a pipeline
time would
for vector
Simplified
below,
is N.
(one with
instruction
is adequate
as top.
length
the basic
of the vector
but can usually
rate and _
computer
registers)
the TN + _ model
'top
result
on the length
computer,
if the vector
and the vector
In either
Usually
op
On a k-parallel
or k-word
depends
of the vector
is the asymptotic
op
processors
time.
operation
1 _ j _ N).
vector
time
I
_N + 71
multiply
17
N + 159
divide
46
2N + 167
square root
74
2N + 155
transmit
II
1
_N + 91
maximum
-
6N + 95
summation
-
6 °5N + 122
inner product
-
6N + 130
17
The use of long vectors
computer,
the vector
in order to take full advantage
operations.
severe penalty
to design
is essential
The rather
for operations
algorithms
which
tions can also be a source
sists of vector
sideration,
operations.
especially
tors.
These
issues
rithms
in Section
when
of increased
large vector
on short vectors,
keep
startup
costs
of inefficiency
there
use of a vector
result
startup
to a minimum.
most
is often
are strong restrictions
are illustrated
is based
costs
and one must
even when
Data manipulation
and others
I0, which
for effective
rates
in
create
a
be careful
Scalar
of the code con-
an essential
con-
on machine
vec-
in our comparison
on information
opera-
of algo-
for the CDC STAR-100.
18
3.
LINEAR
METHODS
The simplest
approach
to direct
system Ax = v is block elimination,
ization
of A.
We discuss
general
partitioned
matrix
under
such as pivoting
of the pivotal
special
blocks,
techniques
tant odd-even
error
properties,
elimination
of a block
effectively
in terms
Various
systems.
which
bounds
and reduction
an LU factor-
elimination
of block
on a
elimination
on the condition
in the back
substitution,
Gauss-Jordan
elimination
foreshadow
linear
IIB[A]II < I; the process
properties
propagation
tridiagonal
computes
of block
the assumption
requirements,
for special
for its numerical
which
the process
shown to be linear and stable.
considered,
solution
our analysis
is
are
numbers
and
is studied
of the more
impor-
algorithms.
3.A. The LU Factorization
3.A.I.
Block Elimination
The block
tridiagonal
LU factorization
A = (aj,bj,cj) N = (£j,l,0)N(0,dj,cj)
is computed
N
by
d
I
=b
I
I
dj = bj - ajdi_ I Cj_l,
_j = a.d
J -I
J-l'
where
For
the existence
the moment
j = 2,...,N,
j = 2, .. .,N,
of d-I
j-I is assumed
we are not concerned
and will
with
the particular
-I
or avoid computing
either
be demonstrated
_j or dj_ 1 cj_ 1•
methods
when
IIB[A]II < i.
used
to compute
19
Let A I = A and consider
block elimination
step.
the matrix
A 2 which
We have A2 = (_j,_j,Cj)N,
92 = d2, _j = aj and _j = b°j for 3 _ j < N.
Note
that B I and B2 differ
(2,3) positions.
Theorem
3.1.
IIB III< I then
Proof.
Let T = BIE I and let S = TB I.
struction,
Also, B I = (I-
the first
E1 = d I = bl, _2 = 0,
block
B2 = B[A2 ]-
row; more
specifically,
IIB211 _ IIB III •
We have
IIb21a2bllclll = I SII _ IIBill IIEIII IIBill =
is well-defined.
after
Let B I = B[AI]
only in the second
in the (2,1) and
If
is obtained
S)B 2 + T,
and 0j(S) _ _o(T)II BIll < Pj(T).
d2 = b2(l
- b_la2b]Icl
),--__
IIBII_ < I, so d21
IBII =
Thus
I(I-
eemma
exists
S)B21 +
and
Irl by con-
2.2 applies,
yielding
IIB211 < IIB III •
Now,
QED
for any partitioned
block pivoting)
B2
matrix
can be described
by
the N stages
of block
elimination
(no
the process
AI=A
for i = I,...,N-I
Ai+l
We
transform
diagonal;
Ai
into Ai+ I by eliminating
when A is block
(0,dj,cj) N.
tridiagonal
The same effect
by the process
=
L[C[Ai]]Ei)A i.
= (I
(I-+ L[Ai]EiD[Ai]-I)Ai
all the blocks
AN is the upper
can be had by eliminating
of column
block
i below
triangular
the blocks
the
matrix
individually
20
AI=A
for i = I,...,N-I
A (i)
i+l =A.
i
for j = i+l,...,N
iLA j) - (I
(j-l)
(j-l)
i+l
I .(j-l)
)Ai+l
i+l- = (I +- EjAi+I
(j-1)]Ei)
(j-l)
EjC[Ai+
IEiD[A Ai+
I ]"
|
A i+ I = A (N)
i+l"
Each minor
stage affects
only block
row j of the matrix.
It is seen that
block
block
from which
row j of A'J-lli+l_i_
= block
row j of Ai,
I
row i of A i+
(j-l)
= block
row i of A i
it follows
that
j i.l
so the major
stages
EiD[A
indeed
]
generate
= EjAiEiD[Ai ] I,
the same sequence
of matrices
A..
l
Let
B i = B[Ai].
Corollary
I _i
3.1.
If A is block
tridiagonal
and
IIB III< I then
IIBi+lll < IIBill,
<N-I.
The proof
of Corollary
3.1 would
seem to indicate;
We should
3.1 requires
also note
a bit more
this difficulty
the trivial
"error
care
will
than the proof
be resolved
relations"
IIAi+ I - ANII < IIA i - ANII,
IIBi+ I - BNII _ IIB i - BNII.
of Theorem
in Theorem
3.2.
21
These
follow
from the fact
that,
for block
i
A
which holds
regardless
Wilkinson
to Corollary
if A is strictly
This actually
two theorems,
and
that is similar
dominant
by columns
IIAi+ I (reduced)II I
to arbitrary
matrices
[B14] anticipated
generalizing
to the characterization
projection
method
elimination
, and proves
is stable when
tridiagonal
matrices
matrix,
so the result will
Theorem
3.2.
Proof.
elimination
diagonally
matrices
applies
Brenner
tion, lead
I _i
for Gaussian
and not just
the result
but in
context.
Our next
block
E.A
J I'
libIII.
a result
that
case.
z_j=i+I
of
is true for all reduced
tridiagonal
a different
EjA N +
L-j=I
[W3] has given
IiA i (reduced)III.
matrices,
N
of the value
3.1, namely
then the same
the block
=
i
tridiagonal
If
Corollary
of block
by bounding
lib[A]II< I.
libiiI< I then
elimination
the growth
to block
in some
many
immediate
that block
special
elimination
IIBi+ III < IIBill and
observa-
as a norm-non-increasing
factors
In addition,
are equivalent
be useful
3.1 and Wilkinson's
schemes
for
on a permuted
applications.
IIAi+ Ill < IIA ill for
<N-I.
Suppose
IIBill< i.
Ai+ I = (I-
We have
L[Ai]EiD[Ai]
-I)A i
= A i - L[Ai]E i + e[Ai]EiB i,
pj(Ai+l )_ pj(A i - L[Ai]E i) + pj(L[Ai]EiB i)
_ pj(A i - L[Ai]Ei)
The projections
are onto
spaces
+ 0j(L[Ai]Ei)
of matrices
llBill
with
zeros
in appropriate
positions.
22
0j(A i - L[Ai]Ei)
+ 0j(L[Ai]E i)
= Pj (Ai),
SO
IIAi. III_ IIA ill "
Now let
Q = D[Ai]-IL[Ai]EiD[Ai]-IAi
so
Ai+ 1 = A i - D[Ai]Q,
Let S = D[Q].
D[Ai+ I] -D[Ai](I-
D[Q]).
Since
Q = (-L[Bi]Ei)(l-
S = D[L[Bi]EiBi]
-I
D[Ai+I]
exists
(I - S)Bi+ I
,
and
+ L[Bi]EiB i,
IIEll < IIL[Bi]ll llEill IIBilI _ IIBill 2 < I.
and Bi+ I is well-defined.
B l - S + Q.
It may be verified
Thus
that
Let T = B i - S + Q, J = S + T = B I + Q.
D[Bi ] = D[Q - S] = 0, we have
IIJll< I implies
Bi) =-L[Bi]
bIT] = 0 and
IIBi+ III _ IIJll •
IJl =
But we have,
ISI +
after
ITI •
Thus,
Since
by Lermna 2.I,
some algebraic
manipula-
tion,
J=B.
l
+Q
= _,k_ 1 EkB i
+
and it follows
Although
from this expression
expressed
in terms
it is clear that we also have,
IIA(j)i II-< IIA(J-l)i II" Moreover,
eliminated
EkBi(I
=i+l
such inequalities
- E
that
i
+ E Bi)
i
'
IIJll _ IIBill •
of the N-I major
for Bi(j) -- B[A(j)]'
i
as long as blocks
will hold
regardless
stages
QED
of block
elimination,
llBi(J)II-_ IIBi(J-1)IIand
below
the diagonal
of the order
are
of elimination.
23
We
shall adopt
a sequence
of matrices
IIB[M I] II< I.
A result
Definition.
rows
Proof.
method"
that
Let_
elimination
be the reduced
1
of A..
(_
for any process
which
IIB[Mi+I]II _ IIB[Mi]II, given
that
3.2 is the following.
matrix
consisting
of the last N-i+l
block
= A 1 = A)
If [[ C[A][[1< 1 then [[ C[_i+l][[ 1 _ [[ C[_i]l[ 1, [[_i+1[] 1 _ [IQi[]1,
_
It is not
II
C[A]][l,
Ci :
C[Ai].
true in general
Let the blocks
that [_i+lI[l < I]Cil[I
of C.l be Crop, i _ m, p _ N.
if
IIciIlI < i.
Then C[_i]
-(Cmp)i_n,p_N.
Let K = (Crop+ CmiCip)i+l_m,p_N
-
(Cmp) i+l__ra,pNg
+
.
(ci,i+l
...
\_Ni
Using Lemma
senerates
is a linear method.
that is dual to Theorem
II
n[Ci]Ei[[ I
Remark.
M.I such
Thus block
and columns
Theorem 3.3.
and
the name "linear
2.4,
I[
c[_i+l][[ 1 _
[[ Ell 1
But
if
[[ KIll
< 1.
N
yq (K) _ yq (_k= i+l mk C_i
+
[I
"
])
[11 Yq (El C[_i ])
\CNi
N
< yq (_
+
5'q
k=i+l
E k C_])
I[C[_i][[ I _q(EiC[_i])
=i+l
E k C[_ i ])
CiN).
24
+
=
Vq (C_
Thus
IIKil I <
IIc[ai]ll I <
I and
this
implies
llL[Ci]Eilll
<
be
amp , I < m,
p < N.
_q (E i C_
i]>
i])"
IIC[_i+l]ll
IIC[A]II I for
Then_i+
I =
(amp
I
<
IIC[_i]ll I •
I _ i _ N.
+
Now,
By
let
Cmiaip)i+l_m,p_N
construction
the
blocks
of A.i
, and
N
q(_+
I)
_ 7q (
E ._ )
i l
%¢=i+I
+
II
"
lll_/q<Ei_2
i)
N
meal)
+ q(mi
i)
_-i+l
II_i+llll<
ll_illl"
QED
We
become
When
note
that
nearly
A
those
factorization,
consequence
The
norms
IIB[AI]III<
makes
the
definite,
the
Cline's
statements
were
of
used.
I then
extension
symmetric
Of
of _. l-i .
but
of
that
is
identical.
is positive
interlace
if A
course,
Cline
Actually
reduced
we
Theorems
other
in
3.3
quite
B[A] T and
C are
always
shown
proof
that
depends
the
same
<
depend
general,
IIB[AI]II I or
norms
=
ll_i+ II12
and
true
has
are
have
is not
IIB[Am]ll I <
to
[C3]
the
3.2
C[A]
B and
matrices
result
It
then
even
difficult
Theorems
similar
the
3.2
since
and
C =
eigenvalues
on use
in both
of
the
3.3
DBD -I.
of_
i
Cholesky
methods.
In
II_iI12 "
heavily
for
on
the
example,
II B[_2]ll I _
particular
that
if
IIB[_l]ll I"
to determine.
This
25
3.A.2.
Further
Properties
It is well-known
of Block
Elimination
that the block
factorization
A = LAU is given
by
N
L = I-
_
L[Cj]E
j=l
A = D[AN],
J'
U = IWe have
then
matrices
shown
3.2 and 3.3 that if
lIBl II< 2,
IIUill
the factorization
The purpose
_ 1 + Nnll BIll •
interesting
related
it is shown that partial
number
of L, though
large
in the general
looking
pivoting
the bound
case.
at some matrix
has
for arbitrary
elimination.
some more
comments
under
of
In particular
the condition
can actually
that are invariant
rows
is to
a number
of bounding
number
the block
matrices
[BI5] contains
the effect
We will have
tridiagonal
permuting
in Gaussian
and condition
properties
without
Broyden
to pivoting
(For block
IIB Ill< I
IILil _ 1 + nll C llil, IIUilI _ I + rillB Ill.)
pivoting
IIL[Cj]Ejll _ i.
results
so
can be completed
of block partial
force the inequality
IIClllI < I then
IIEll _ I + Nnll C IllI, n = max.] n.,j and if
the factor N can be removed,
Moreover,
(cf.
/'j=l Ej U[Bj].
in Theorems
IIC IllI < 2,
IIUil _ I +
of A.
B N = I-
therefore
IILill _ I +
N
be quite
on pivoting
block
after
elimination
[BI1]).
Corollary
Proof.
diagonal
3.2.
It follows
dominance:
A i and are about
block elimination
produce
Block
elimination
from Theorem
simply
to generate
diagonally
partition
Ai+ I.
A into
This
exact
diagonal
IXI blocks.
elimination,
arithmetic
then so is Ai+ I.
dominance.
elimination
preserves
Now suppose
can be done either
of Gaussian
using
dominant,
strict
3.2 that Gaussian
or by n.l stages
the same result when
is strictly
preserves
strict
we have
by one stage
of
for both processes
[H8, p. 130].
Thus
if A.i
QED
26
It is shown
G-matrices.
This can be extended
elimination.
Corollary
Proof.
elimination
in an obvious
maps
fashion
G-matrices
into
to the case
of block
We also have
3.3.
Block
Recall
diagonal
in [B4] that Gaussian
elimination
maps
that A is an H-matrix
matrix
E such that A' = AE
H-matrices
if and only
is strictly
into H-matrices.
if there
exists
diagonally
a nonsingular
dominant.
Con-
sider the sequence
A{ = A' ,
A'i+l = (I-
Clearly
A I'= AIE;
suppose
L[A_.]Ei D[A_.]'I)A_.
that A'.l= AiE
is strictly
diagonally
dominant.
We
then have
A'i+l = (I - L[A i ]E E i E-ID [Ai ]-I) A °E
l
= (I-
L[Ai]E i D[Ai]-I)Ai E
= Ai+IE,
!
so Ai+ I is an H-matrix
since Ai+ I is strictly
diagonally
dominant
by Corollary
3.2.
QED
We have chosen
it allows
IIMIIE =
to give
us to estimate
IIE-IMEII•
generally
another
remain
then
unknown
The proof of Corollary
Theorem
3.2.
norm
Then since A' = AE
that if A is an H-matrix
will
this more concrete
implies
If block elimination
B[A']
= E-IB[A]E
< IIB[Ai]II E.
is in the nature
3.2 brings
of Corollary
3.3 because
of B[A] when A is an H-matrix.
IIB[Ai+I]II E
this
proof
is actually
it follows
Since
the matrix
of an existential
out an important
performed
interpretation
then
Define
E
proof.
of
our result
27
gives
direct
information
and ordinary
Gaussian
vides information
assured
about
elimination
about
that pivoting
Section
after
storage
be limited
then Theorem
each n.z steps.
we
do know
to elements
LU factorization.
hierarchies
matrices.
When
this will
If not,
3.2 pro-
While
we are not
that pivot
within
the d. block
l
A has been
be important
selec-
partitioned
knowledge
to
(see
10.A).
One more
Corollary
cond(dj)
Proof.
instead,
will not be necessary,
tridiagonal
deal with mass
of the reduced
is used
the process
tions for these n. steps will
i
for the block
properties
consequence
3.4.
=
of Theorem
In the block
JldjlJ IIdilll<
Letting
3.2 is
tridiagonal
2 cond(bj)
LU factorization,
/ (I-
IIB llj< I then
IIBIJI2).
_j = b-la.d
j
j -I
j-lCj -I' we have
IIBill IIBNIJ _ JlBIJ_ < i, so that
if
dj = b j (I - c;J ) , iJ_jlJ <
IIdjJl _ Ilbjll (i +
II_jIl)
< 211 bjll, and
QED
It is important
must
be solved
during
to have bounds
the factorization.
IIdilll may be computed
cond(A)
possible
/
for the individual
IIBIIJ < I.
This
are identical.
different
at a moderate
< max.j 2 cond(bj)
The inequalities
(I -
However,
cost.
involving
a posteriori
We note also
bounds
d.J
on
that
IIB III< i, and it is entirely
3.2 are the best
for example,
B I.
systems
b. and d. to be better-conditioned
J
3
it is possible
about
since
Of course,
lJBIll ) when
blocks
in Theorem
is because,
assumptions
on cond(dj)
possible
the first block
to derive
stronger
under
rows
than A.
the condition
of B 1 and B2
results
by using
28
Theorem
3.4.
[H4]
J = b-ld..
J J
e.
For A = (aj,bj,cj),
If X(A) _ I then
Some useful
consequences
IIBNII and a smaller
upper
let %(A) = maxj(411 bjlajll l_]_llCj_lll),
I
III - ejlI < (I - _)/2
of this theorem
bound
include
on the condition
have BN = (0,0,-d-lc.)j
J = (0,0,-e-lb-lcj)'3
J
which
IIBNII _ 211U[BI]II / (i + _f_).
3.2 predicts
only
IIBNII _ 2(I/3)
/
(I + ^/1-4/9) - 0.382.
I_iI] l_jlll _21_jlll, and
ilboll III - ej]1+
3 con_(Dj).
More
This is further
Theorem
block
matrix
are available.
and
of time-dependent
and has several
remarkable
that
if A = (1,3,1)
then Theorem
3.4 we obtain
t_&t co_d(dj)
(I + 2_(A))
of the block
problems,
cond(bj),
where
_ = (I-^_)/(I+_).
LU factorization.
to compute
are potentially
good
is designed
properties;
= IIdilll l_jll <
of an iteration
Such methods
iteration
For we
initial
the
useful
approximations
for use on a parallel
for details,
see [H4] and
8.E.
previously
Theorem
[H4].
implies
of the d.'S.
J
for
l_jll = l_j( I " e-_ll+ l_jll _
in the analysis
(0,dj,0)N
This particular
Two results
were
of the stability
numbers
estimates
IIdilll < IIdilbjll llbjIII=
It follows
cond(dj)_
3.4 has been used
in the treatment
Section
precisely,
IIei II_ 2/(I + I_I-_)
better
by Theorem
Also,
l_jll <- l_j- doll+
IIb011 _ 311 bjl_2-
proof
diagonal
computer
As an example,
IIBNII < IIB III= 2/3, while
and
3.5.
on the block
obtained
[VII
]]cj]] _ 0, then
IIdoll < IIboil +
LU factorization
tridiagonal
matrices
by Varah.
If A is block
diagonally
IId_ll] < 1]cj]r I,
IIajl"
for block
dominant
]]£j]] < ]]ajl]/
b]111(l Ijll+
011)i)
(II
]]Cj_lll, and
29
In comparison
bound
to Theorem
IIdjlcjll < I, which
follows
corresponds
to the bound
only allows
us to conclude
3.6.
[Vl]
pose _. # 0.
J
Then
and that
on
IIdilll' corresponds
to the
J
IIBNII < I, and the bound
to Theorem
[I djll
on
3.2,
IIB III< I
IId]l[l _ IIbjlll / (I - IIB III2), and we can only
form
is
the block
II _jCj_ll I < IIajll IId]_llCj_llI _< IIajlJ.
LU factorization
a constant
(_j,l,_j_ I) is positive
that
from
on
Let _.
I cj_lll ) 1/2 , 2 _ j _ N, and supj = (IIbila j ll IIbi_l
(i.e., there exists
We note
the bound
IIAN[I _ flAil- In contrast
estimate
II_jll in the weak
Varah's second result
Theorem
3.2,
K(A)
of A is numerically
stable
such that max(l I _jll ,IIdjll ) _ K(A))
if
semidefinite.
(_j,l,_j.l)
X(A) _ 1 implies
is a symmetrization
that
it is positive
II%jll, IIdjll given by Varah
of
(I[bi laj[l,l, IIbi Icjll )
definite.
IV1] for this case will
The actual
bounds
not be important
for
this discussion.
Since many
differential
related
if the system
principle
related
equation.
of results
condition
continuous
and
see [BI8],
problem.
there
difference
under
[D2].
exists
bounds
consideration.
of
assump-
[R7] shows
maximum
(each d. is nonJ
on BN are
approximation
is the remarkable
17/3.
under
Rosser
a certain
Improved
of d. is less than
J
matrix
[CI],
been obtained
satisfies
IIBN II_ I.
by the nine-point
number
have
from the discretization
the LU factorization
In particular
on the particular
work,
arise
Ax = v, A = (aj,bj,cj),
BN is non-negative
the spectral
systems
to the original
when A is defined
Poisson's
strongly
a number
then A is nonsingular,
singular),
obtained
tridiagonal
equations,
tions closely
that
block
to
observation
that
Such results
depend
For examples
of
30
3.A.3.
Back Substitution
So far we have
Ax = v.
For block
only considered
tridiagonal
the elimination
matrices
phase
A = (aj,bj,cj) N this consists
the process
dl =b I' fl=Vl
j j-lCj
-I' fj = vj - ajd i II fJ- I'
dj = bj - a.dl
I
j = 2,...,N.
The back substitution
phase
consists
of the process
cjxj+l),
j = N-I,...,I.
-I
XN = d N
fN'
= d-I
J (fj-
xj
In matrix
terms
these are the processes
A I = A, f(1) = v,
Ai+ I = (I + L[ Ci]Ei)Ai,
f(i+l)=(l
+ L[Ci]Ei)f (i)
i = I,...,N-I,
i = i
N-I
g = x(N) = D[AN I-If(N),
x (i) = g + EiBN x (i+l) ,
X
An alternate
=
X
i = N-I, ..., I ,
(I)
form of back
substitution
is
x (N) = D[AN]-If (N) '
X (i)
X
-" X
=
(I + BNE i+ l)x (i+ l)
(1)
of the solution
,
i = N-I,...,I,
of
of
31
In either
case
it is seen that BN plays
back-substitution
the matrix
3
have A = (aj,bj,c.)
where
where
a band elimination
BN maintains
(upper)
#%
errors
substitution
the band and block
the computed
Suppose
for back
substitution.
be
We
phase
generates
xN = UN
fN
^_ I_
- = dNIfN ,
computes
^
Let us briefly
tion;
the
for Ax = v might
The elimination
x. = u. (f - c.x
) = d'l
_
]
]
j
J j+l
j (fj
cjxj+l).
roundoff
method
its importance
triangular.
f ]. = %.f
^] J , and the back
^--1
in studying
^j,c^j) ' d j = _.u
^^
_-I
= (a^j, _j ,0) (0 ,u
J j' a^ j = a.u
]^-I
j-l' c j = _.
] c.,
]
_.j(uj
^ ) is lower
vectors
role
process.
For those situations
preferred,
an essential
consider
solution
the computed
back
the effect
Except
for the specific
substitutions
of rounding
must
errors
behave
form of
identically.
in back
substitu-
will
be yj = xj + _j and we require bounds on _j.
*
(I)
*
elimination yields d. = d. + 6.
]
_I
.I
' f"
] = f J +q0j "
forward
2-Let
_j represent
*
satisfies
is used
the errors
(2)
(d.j + 6.j )yj = fj
to solve
we have
errors
3.A.4.
tend
a uniform
we obtain
upper
bound
the matrix
that may be important
or zero;
to be damped
Gaussian
It follows
II_jll _ IIBNII II_j+lll +
out in later
computations
for the ¢.'sj, IIejll _ ¢, and
II_jll < (I -
so that yj
IIBNI_-J+I)_/(I
elimination
that
6(2))yj)j, which
IIcjll, and the
if
I;BNII< I.
If
IIBNII < I, then
-
I_NI_ < e/(l-l_Nl _.
Techniques
In some cases
criterion.
- cjyj+l,
cj = d-l(_jj + _j - (8!
1)J +
Thus
II_jll _ (N - j + i)¢,
Special
fj
- cjyj+l + _j if ordinary
+ Cj' where
_j = ej - d-lc
j j_j+l"
previous
in forming
for yj in dj yj = f.j - cjyj+ 1 + _j.
yj = d-l(fjj - cjyj+l)
implies
introduced
*
to preserve
For example,
this requires
A = (aj,bj,cj)
we might
in order
possesses
to satisfy
special
properties
a strong
stability
have b.] = t.]- a.j - c.] where
that all blocks
be n × n.
Scalar
IIt011 is small
tridiagonal
systems
32
with
this property
value
problems.
arise from the discretization
Babuska
tion for such a matrix
which
to the ODE and a stable
bation
problem
[BI-3] has discussed
allows
numerical
posed by Dorr
of second
a combined
a very effective
solution
[D3].
order
ODE boundary
LU - UL factoriza-
backward
of a difficult
error
singular
(See [H7] for a symbolic
analysis
pertur-
approach
to
The first
is
this problem.)
There are two essential
to express
features
b. and d° indirectly
J
J
to Babuska's
by starting
with
method.
t. and computing
J
quence
s I = tl, sj = tj - ajd_ _lls j_I ....
By this we have
second
feature
ination
is to augment
and dispense
of elimination.
pj,qj,tj
with
the forward
the back
Specifically,
elimination
substitution
suppose
0 , Pl = qN = 0, A nonsingular.
with
+ pj
The algorithm
= tN' fN
Sl = tl' fl = Vl' SN
dj = sj - cj.
a backward
by combining
A = (-pj,tj
the se-
The
elim-
the results
+ qj'-qj)N'
for Ax = v is
= VN'
-i
sj
tj +
Sj_l }
fj =
= vj
+ pj(sj_
pj(sj_ II +
+ qj_l
qj-I )
)-Ifj-I
*
*
-I
sj = tj + qj(sj+ I + Pj+l )
J = 2,...,N,
*
sj+ I
J
* + Pj+I )-I fj+l
*
fj* = vj + qj (sj+l
*
-I
x°j = (sj + s.j - tj)
N-I,...,1,
I
*
(fj + fj
- v.),j
j = I,...,N.
In matrix
terms, we factor A = AL + A D + AU = (I + L)(D + A_ = (I + U )(D
and solve
(I + L)f = v,
(I + U )f
and -ADX -- -ADX we obtain
(D + D
= v.
Adding
- AD)X_ = f + f
(D + Au)X = f, (D
- Ax = f + f
+ AL)X =
- v.
+ AL)
f
33
Babuska's
and underflow
algorithm
has some unusual
is a problem
though
correctable
row sums by taking account
of dependent
and thus fits
notion
condition"
into Kahan's
round-off
properties
by scaling.
error
It handles
distributions
that "conserving
(cf. [M3])
small
in the data,
confluence
curbs
ill-
[KI].
To see how the method
since pj,qj,tj
sums can only
_ 0 we have
increase
fits
in with
our results,
we note
that,
s I _ 0, sj = tj + pj d-lls
j_ j-I _ t..
]
during
the factorization.
On the other
for n = I,
Thus
the row
hand,
they
-I
will not
increase
much
These are analogues
3. B.
Gauss-Jordan
An algorithm
Jordan
since
sj = tj + pj(sj_ I + qj_l )
of the results
in Theorem
sj. I < tj + Pj"
3.2.
Elimination
closely
related
to block
elimination
is the block
Gauss-
elimination
21 = A,
Ai+l = (I + C[Ai]Ei)Ai,
This avoids
diagonal.
back
substitution
The rest
i = I,...,N.
in the solution
of the Gauss-Jordan
of Ax = v since _+I
process
is block
is
2(1)= v,
_(i+l)
= (I + C[Ai]Ei)f (i)
i = I,
,N
--1 _(N+I)
x = AN+ I
Theorem
3.7.
Let Bi = B[Ai]"
If
IIBi+lll _ IIBill for i = I,...,N.
II_II < I then
IIAi+lll < IIAill
and
34
Sketch
of Proof.
suppose
The proof
IIBill < I.
This will
Q = D[Ai]-IFimiD[Ai]-iEi
and
We have
show
3.2.
the first part
IISII _ IIBill2 < I.
Also,
J = Bi(l - E.z + E.B.)z
z ' so
D[Ai+l]
I
= B
= D[Ai]( I _ S),
ITI, and eemma2.1
I]Jl] < IIBill"
From
Let T = B i - S + Q '
applies
eemma
if
IIJll < I.
to block
many more
tion, and cannot
ical properties
reduction
QED
applied
rithm requires
But
2.I we conclude
IIBi+lll _ ]IJll"
When
1;
Bi ) = - BiEi + BoE
i iBi' so S = D[BiE-i i ]
(I - S)Bi+ I = B.
i - S + Q.
IJl = ISl +
Note B
Now define
, so that Ai+ I = E i - D[Ai]Q,
Q = (- BiEi ) (I-
J = S + T, so that
that of Theorem
_
-l)i.
+ U[Ai] , so that Ai+ I - (I - FiEiD[Ai]
Let F i = L[Ai]
= A.i - F.E.
i i + F.E.B..
i i i
S = D[Q].
parallels
tridiagonal
arithmetic
be recommended
that foreshadow
matrices
operations
Gauss-Jordan
than the block
on this basis.
our analysis
the block
However,
of the more
algo-
LU factoriza-
it has some numerimportant
odd-even
algorithm.
The Gauss-Jordan
the following
elimination
on block
tridiagonals
way:
dI = b I
fo_ j = I,...,N-I
-I
u. =-d.
c.
J
]
]
dj+ I = bj+ I + aj+lU j
for k = l,...,j-I
Lek,j+ I = ekju j
ej,j+
In this form the matrix
A. is
]
I = c.
]
may be expressed
in
l
35
elj
d2
e2j
d3
e3j
dj-I
ej-l,j
0
d.
J
c.
J
aj+l
--
m
,
with AN+ I - diag(d I d2
_ • •
•
"'.
m
,dN) = D[AN] , BN+ I
bj+l
-.
aN...Cj+l DN''"//
0.
-I
We observe
which
that, for i < j <j +k
k
of BN.
is the (j,j + k) block
by giving
closer
Corollary
3.5•
j with
bounds
on the first
If JIBiij < I and
< N, -dj ej,j+ k = ujuj+l...Uj+k_l,
Thus we are able
i-i block
(point)
row
rows
to refine
Theorem
3 7
of B..
l
_ is contained
in block
row
I _ j < i-l, then
P_(Bi ) < II-d-le
j ji II< IIBNi-jll •
If (point)
row
£ is contained
in block
row j with
i < j < N, then
0z(Bi ) = P_(B i) < IIBill •
This
matrix
is our first result
during
the reduction
a rather
different
pivoting
applied
above
the main
situation
on the convergence
to a block
from Gauss-Jordan
to an arbitrary
diagonal
diagonal
matrix
can be quite
[PI].
to zero of portions
matrix•
elimination
It also
with
of a
provides
partial
In that case the multipliers
large, while
we have
seen for block
36
tridiagonal matrices
II B 1 II < 1 that the multipliers
with
decrease
actually
in size.
3.C.
Parallel
We close
Computation
this section
ent parallelism
parallel
angular
operations
operations
we often cannot make
tor lengths
might
Our comparison
therefore
note
of algorithms
diagonal
system,
On the other hand,
methods
when
the blocks
systems
derived
general
elimination
can be used
executed
LU-UL
problems
equations
sparse
this
the vec-
startup
(Section
costs.
10.C) will
in scalar mode.
We also
is the solution
parallel.
created
by fill-in
are well-known.
is often
when
sizes
as full matrices,
instead.
However,
since
the vector
algorithms.
the block
in
(dj l-cj), and
processor
factorization
to tri-
expressed
in parallel.
as competitive
directly
(or reduction
processor
is entirely
are originally
be useful
they can be represented
representation
which
from elliptic
will
efficiently
for a pipeline
the storage
methods
the solution
to overcome
of Babuska's
of
and band LU factorizations
use of a pipeline
the LU factorization
and inher-
a small number
of the augmented matrix
can be executed
that the final stage
tion methods
on rows
efficient
When
d.u. = -c. is naturally
J J
J
not be long enough
consider
of a block
the block
system
on the applications
[H3].
This is because
form) of the block
these vector
see also
are available
approaches.
terms of vector
some remarks
of the methods;
processors
are natural
with
enough
with
For
or when
large
to disqualify
The block
are small
these
enough
eliminaso that
a stable
implicit
37
4.
QUADRAT IC METHODS
In Section
ization
3 we have
outlined
and block Gauss-Jordan
ized by the generation
Now we consider
by the inequality
elimination.
of a sequence
IIB[Mi+ I]II _ IIBIN i]ll" From
methods".
reduction.
LU factor-
are character-
M.l such that
we adopted
methods,
which
the name "linear
are characterized
IIB[Mi+ l]II < IIB[M i]211"
methods,
The latter usually
and can be regarded
odd-even
of matrices
some quadratic
of the block
The properties
this inequality
There are two basic quadratic
even
some properties
elimination
as a compact
goes under
version
is a variation
odd-even
elimination
the name
of the former.
of the quadratic
cyclic
and oddreduction,
We will
Newton
show that
iteration
for
-i
A
, while
reduction
odd-even
technique
elimination
reduction
of Hageman
on a permuted
computer,
independently
of the others
4.A.
Odd-Even
We
case of the more
[HI] and is equivalent
Both methods
are ideally
of the quantities
[H3].
in Section
general
Semidirect
involved
methods
based
cyclic
to block
suited
for use
may be computed
on the quadratic
7.
Elimination
first consider
equations
as many
are discussed
and Varga
system.
on a parallel
properties
is a special
odd-even
elimination.
Pick
three
consecutive
block
from the system Ax = v, A = (aj,bj,cj):
ak-IXk -2 + bk-lXk -I + Ck-lXk
(4.1)
= Vk-I
akXk_ I + bkX k + CkXk+ I = vk
ak+iXk + bk+iXk+ I + Ck+iXk+ 2 = Vk+ I
If
we multiply
result
is
equation
k-I by -akbkl-i' equation
k+l by -Ckb kI I' and add,
the
38
(-akb il Iak- i)Xk- 2
+
(bk - akbillCk_ I - Ckbkllak+l)X k
(4.2)
+ (-Ckbkl iCk+ I)Xk+ 2
= (vk - akbkllVk_ I - CkbkllVk+l)"
For k = I or N there are only
modifications
that
should be obvious.
if k is even
ones.
this effect,
a matrix
(2)
involved,
odd-even
comes
differ
from the fact
the odd unknowns
is not the only possible
row operation
only by scaling
and possible
to
use of
condition.
the new equations
has only three
A similar
and the necessary
has eliminated
into the block
it is seen that the row eliminations
that the matrix
apart.
This
but others
commutativity
By collecting
H2x = v
The name
then the new equation
and left the even
achieve
two equations
non-zero
set of operations
tions k-2, k and k+2 of H2 to produce
have
diagonals,
may again
pentadiagonal
preserved
system
the fact
but they now are farther
be applied,
a new equation
combining
involving
equa-
the unknowns
(3)
Xk_ 4, xk and Xk+ 4.
These
The transformations
pressed
in the following
new equations
from one system
way.
form the system H3x = v
.
to the next may be succinctly
Let H I = A, v (I) = v, m =
ex-
Flog2 N_, and define
Hi+ I = (I + C[Hi])Hi,
v (i.l) = (I + C[Hi])v(i),
Block diagrams
distance
of the H sequence
between
the block
-I v(m+l)
bY x = Mm+ I
for N = 15 are shown
the three non-zero
diagonal
matrix
i = 1,2 ,... ,m.
diagonals
Hm+ I is obtained;
doubles
in Figure
at each
the solution
4. la.
The
stage until
x is then obtained
39
Hi
H2
H4
Figure 4.1a.
Figure 4.lb.
H3
H5
H.,
i=l,...,m+l; N = 15
1
H.,
i=l,...,m+l; N = 16
l
40
We note
that the same elimination
of the periodic
A
tridiagonal
Cl
a2
b2
c2
a3
b3
to n_trices
al
•
c3
•
•
•
•
aN_ I
•
•
bN_ I CN_ I
N
It is only necessary
may be applied
form
1
._.
principle
aN
to include
bN
row operations
that "wrap
around"
the matrix,
such as
row(N)
An H sequence
for N = 16 is diagrammed
the same general
complications
In fact,
Theorem
as before,
i then
row(l)•
4.1b,
and is constructed
are serious
and interesting
for any partitioned
when
IIB[Hi+I]II<
IIC[Hi+I]II I < IIC[H i
]2
the sequence
separate
in the proof•
we have Hi+ I = D[Hi](l-
are
independent
Since H i = D[Hi](I
JlHi+llJ :< IIHilJ.
IIHi+lll I _ IIHillI '
- D[Hi]- iHi ) = Hi(l + B[Hi]) .
These assumptions
A, and
[J B[A][ I< I.
JlB[Hi]211 and
II
I and
matrix
may be continued.
We have H I = A, Hi+ I = (I + C[Hi])H i = (21 - HiD[Hi]
IIC[Hi]ll I < I.
by
of 2.
lIB[Hi]l[ is quadratic
2H i - HiD[Hi]-IH i = Hi(21
be kept
there
under which
If lJB[A]II < I then
<
in Figure
though
conditions
the sequence
4•1.
-CNb:l
we can define an H sequence
to establish
If IIC[A]III
Proof•
rule
if N is not a power
Clearly,
only have
- aNDNII_ row(N-l)
-1)H I =
Suppose
of one another
IIB[Hi]ll < I,
and will
- B[Hi] ) = (I - C[Hi])D[Hi] ,
B[Hi ]2) = (I _ C[Hi]2)D[Hi ].
Setting
S = D[B[H i ]2 ],
41
T = D[C[H i ]2 ], we obtain
D[Hi+I]
= D[Hi](I-
S) = (I-
T)D[Hi].
Since
IISll < IIB[Hi]21; < 1 , IITilI _ IIC[Hi]211 I < I, and D[Hi]
is invertible
it follows
= I - D[Hi.I] " iHi+ i=
that D[Hi+I]
is invertible.
I - (I - S)-ID [Hi]- ID[Hi ](Isimilarly
block
C[Hi ]2 = C[Hi+I](I
diagonal
part
implies
that
IIB[Hi]ll<
B[Hi]2),
is null,
and we conclude
2
IIC[Hi+I]II I < IIC[H i]
I, write
II
I"
Hi+ I = D[Hi](I-
B[Hi+I]
so B[Hi] 2 = (I-
- T) + T.
of B[Hi+l]
Lemma 2.2 now applies,
We have
Now,
so
S is block
S)B[Hi+I]
+ IS;
diagonal
and the
IB[Hi]21 = I(I-
S)B[Hi+I] I +
IIB[Hi+ I]II < IIB[Hi ]211" Lemma
To show
-I
D[H i]
ISI.
2.4
IIHi+ill _ IIHill if
(D[H i] - Hi)B[Hi])
N
= D[H i]
\,
+
(H
i
- D[Hi])B[Hi].
We
then have
'
0k(Hi+ 1) _ 0%(D[Hi])
for I < % _
_j=l
n.,
J
+ P_(H i - D[Hi])II B[Hi]II
p%(D[Hi]) +
0k(H i - D[Hi])
= 0_(H i) •
To show
llHi+l}ll _ {lIIi{{lif
D[Hi]-I)D[Hi
{{C[Hi]{I I < I, write
] = D[Hi] + C[Hi](H i - D[Hi] ).
_(Hi+ I) _ 7%(D[H i]) +
Hi+ I = <I-
C[Hi]<D[H i] - Hi) X
We have
IIC[H i]ll 7_(H i " D[H i])
<__/%(D[Hi]) + _/%(Hi - D[H i])
= v£(H i) •
For a block
block
diagonal;
4.1 implicitly
tridiagonal
matrix
thus the process
uses
QED
A we have B[Hm+I]
stops
the convexity-like
at this point.
formulae
Hi+ I = H i B[H i] + D[Hi](I= C[Hi]Hi+
(I-
B[Hi])
C[Hi])D[Hi].
= 0 since Hm+ I is
The proof
of Theorem
42
As with
Theorems
to show
that Theorem
particular,
or even
3.2 and 3.3,
4. I does not hold
it is not true
that
rather
off-diagonal
I{B[A]{I < I.
close
of Theorem
blocks
matrix
< I implies
norms.
In
IIgCH2]ll2 _ IIg[H I
4.1 are that the matrix
decrease
clusion
that
Indeed,
consider
decreases
dominance
be relatively
is not reached
]2
112
the matrix
though
small
considered
in size, and that the
if
blocks
The relative
that
the diagonal
than the off-diagonals.
they will
A = (-1,2+¢,-1),
be large
for which
if
IIB[A]II is.very
initially.
of Hi, and shows
larger
is that
elements,
to the diagonal
is quadratic,
may be quite
the diagonal
will always
do not increase
in size relative
The rate of decrease
quantifies
elements
for arbitrary
IIB[A]II2
than individually,
to I the actual
measure
that
to find counterexamples
IIBIN 2]II2 < I.
The conclusions
as a whole
it is not difficult
One con-
on an absolute
the non-zero
scale.
diagonals
of H. are
I
2
el = ¢' ek+l = 4e k + ek .
l-i
For small
¢ the off-diagonal
2-i
elements
about 2
periodic
rescaling
As already
.
in magnitude.
of the main
hinted
[$2].
will
While
diagonal
in the above
IIB[Hi+l]II may be made when
diagonals
elements
dealing
Let A = (a,b,a),
be about
2
and the main
not a serious
to I would
remarks,
with
so that
a more
tridiagonal
decrease
perhaps
refined
diagonal
in size,
be desirable.
estimate
matrices
with
of
constant
43
I
/
/
b-a2/b
0
0
-a
b-2a2/b
-a2/b
0
0
•
H2 =
0
".
•
".
21al <
•
".
H2 does not have
but most
which
of the rows
determine
reduction
constant
of H 2 are essentially
This property
= IIB[H I]211
IIB[HI]211 .
Note
that
any of the other H matrices,
identical,
will
and it is these
be important
rows
for the odd-even
algorithms•
When A = (aj,bj,cj)
= (aj ,0,
0
, , j
_(i) = max. _!i) .
J J
_(i+l)
nor will
•
b2 -a2/b /
0
Ibl, IIB[H2]II = 21a2/bl/Ib-2a2/bl
diagonals,
IIB[H2]II"
".
0
b-2a2/b
i
2 II< IIB[H2]II<
IIB[H I ]2 II), and thus _IIB[HI]
/(2-
•
".
0 -a 2/b
-a2/b
that when
-a 2
•
..
,
Hi
-a 2
b-2a2/b
•
It follows
2
_ _(i)2.
is block
0
An
0
, ,
Without
diagonally
, , j
proving
irreducibility
_j
dominant,
=
it here,
II(
b
we
)
If(
we can show
condition
plus
can write
II+
l-j
,
that if _(I) < I then
8 (1) _ I yields
the same
conc lus ion.
We also note
that
if A is symmetric
then so is Hi, i e I, and thus
Jl B[Hi]II = IJ C[Hi]IJ I"
Theorem
4.2.
If A is block
tridiagonal
and positive
definite
then so is H.,
l
i _ i, and p__(B[Hi])
_
< I.
Proof•
First we observe
be permuted
into block
tridiagonal
Since Di = D[Hi ] is also
matrix
Ei such
that each H.l is a 2-cyclic
positive
2
that Di = E_.l
Let
form.
definite,
i =
Suppose
there
matrix
IV3] since
that H i is positive
exists
a positive
it can
definite.
definite
E-IH
l
iEi
-I ' so Hi+l = E _ IHi+IE-il= 2-i
_ - "H.2
i
44
Hi
and
Since
of
Hi+l
are
positive
H'l is a positive
Theorem
Since
B
1• =
definite
definite
Corollary
H.l and
2-cyclic
3.6
and
I
D- IH
E" i^
l
i =
1 B.E
I i'
-
iff
I of
matrix,
Theorem
it
Hi+ I are
4.3
follows
we
of
that
of
0 <
0 <
X < 2 and
which
is
Any
Newton
H..l
convergence
If
observed
sequence
entirely
unexpected
Kung
[K2]
for
Kung
and
Traub
The
Newton
certain
sense,
We
seen
step
might
that
it
still
P(Bi)<
2
I ' B i=
I also.
Let
I -
X be
aim
on
ZY)
of
=
the
diagonal
as
=
<
I, and
so we
eigenvalue
(2I-
YZ)Y,
and
y(i+l)
of
H
should
have
= 2H.-H2_
l
i
i+l
(I-
9_(y(i),z)
matrix
and
applied
to
idea
D[Hi].
Changing
destroy
quadratic
The
by
the
cf.
such
aiming
goal
Nor
Yun
is not
is
it
[YI]
and
computations.
proceeds
in
a
towards
tridiagonal
it.
matrix
iteration
the
of
' then
because,
a block
general
matrix
techniques.
always
convergence,
IIB[H i]ll"
two.
effective
Z -I
any
H sequence
in algebraic
of
is
is
the
zY(i))2[H8].
the
appear,
unification
to mind
= _(y(i),z)
the
method
calls
y(0)
=
between
Newton's
this
4.1
]-I) , so while
matrix.
sequence
Theorem
(I - ZY (i+l))
relation
elimination
in the
is an
II" II_,
method
incorporates
to
such
_(Hi,D[Hi
y(i+l)
matrix
X2
P(Bi)
QED
and
a powerful
a block
thought
holds
exists
Newton's
odd-even
I since
2X-
norm
is a close
iteration
diagonal
But
= Y(2I-
Hi+ I =
give
takes
I-X <
result
occurrences
[K3]
that
<
some
it
that
other
yields
be
if
that
91(Hi,D[Hi ]-I)
block
i for
there
it
<
Corollary
definite.
_<Y,Z)
to Z -I
a Newton
the
< I.
quadratic
converges
Hi+ I =
X2
positive
y(i)
eventually
-I
therefore
III - ZY (0) II <
have
Then
2X-
with
be
from
that
p (Bi)
definite.
^
iteration.
It may
know
IV3]
^
an eigenvalue
positive
it-h step
the
but
iteration
we
have
towards
at
seen
each
i"
45
4.B.
Variants
of Odd-Even
There are a number
cussed
originally
diagonals
in terms
is prone
form of the Poisson
Buneman
Hockney's
(-I,X,-I)(sj)
= (tj), where
that the reduction,
(-I,X (i),-l),
cision.
Then
Stone
parallel
solution
case.
ed Stone's
Jordan
result
solely
application.
[Jl],
to general
when
analysis
the block
generates
X >> 2.
when
a variation
tridiagonal
tridiagonals
apply
to the disby
of the form
of Toeplitz
diagonal
of cyclic
systems,
proved
case,
tridiagonal
and could
pre-
be
equation.
quadratic
for the
conver-
for the constant
the odd-even
block
obser-
the machine
reduction
of our own work
and
system
It was Hockney's
of the Poisson
to general
tridiagonal
systems
I/_ (i) fell below
thus of the B matrices
independently
to the block
applied
tridiagonal
a sequence
to the solution
given here
k-I
has been given
of tridiagonal
system was essentially
I/X (i) and
equation
below).
be stopped
damage
of general
The results
restricted
could
[$2], in recommending
gence of the ratios
diagonal
which
block
to obtain
and a stabilization
X e 2 and often
the tridiagonal
solved as such without
major
equation,
dis-
= (bvk - avk_ I - aVk.l).
instability
into a collection
matrices
rithm.
to numerical
use of Fourier
Ax _ v is transformed
vation
+ (-a2)xk+2
mostly
For constant
[H5] multiplies
k+l by -a, and adds
(see [B20] and the discussion
With
elimination,
reduction.
ab = ba, Hockney
+ (b2 _ 2a2)xk
formulation
of odd-even
of the odd-even
k by b, equation
(-a2)xk_2
crete
of variations
A = (a,b,a) with
by -a, equation
This
Elimination
[H2], also
extend-
elimination
algo-
systems,
though
and are not
this will
be the
46
We
now show
lations
that quadratic
of the odd-even
carry over
to odd-even
into one, consider
convergence
elimination;
results
we will
reduction.
for other
see later how these
In order
the sequences
hold
to subsume
several
formu-
results
algorithms
HI = A, v (I) = v,
Hi+l = Fi(2D[Hi]
- Hi)GiHi'
v-(i+l) = Fi(2D[_i ] _ Hi)Giv(i )
where
F.,
G.l are nonsingular
i
block
diagonal
matrices.
The sequence
H.i cor-
-i
responds
to the choices
F i = I, G i = D[Hi]
In the Hockney-Buneman
boundary
conditions
algorithm
on a rectangle
.
for Poisson's
equation
with
Dirichlet
we have
ab = ba, aT = a, bT=b
A = (a,b,a),
F.i = diag (b(i_ ,Gi=
D[ Hi ]-I ,
b (I) = b, b (i+l) = b (i)2 - 2a (i)2,
a
Sweet
(i)
[SIll also uses
rithm.
We again note
= a, a
(i+l) _
(i)2
- -a
.
these
choices
that the H.i matrices
erty for i m 2, but symmetry
sented
to develop
rational
function
sures that the blocks
of block
rows
reduction
lose the constant
is preserved.
as a bivariate
an odd-even
Each block
algo-
diagonal
prop-
of H. may be reprel
of a and b; the scaling
by F°i en-
i
a and b.
non-zero
In fact,
off-diagonal
Swarztrauber
elliptic
blocks
case
the diagonal
blocks
will
be a
on a rectangle.
con_nute with
of these rows will
polynomials
in
be b (i+l)_-and the
(i+l)
a Buneman-style
are all n × n, and aj = _jln,
the blocks
of Hi+ I are actually
blocks
[$6] describes
equations
j2
This
leads
bj = T +
each other
algorithm
for separable
to A = (aj,bj,cj)
_jln, cj = Vjln.
and may
be expressed
where
the
In this
as polynomials
47
in
the
ally
tridiagonal
uses
F
matrix
= diag(LCM
I, G i =
diagonal
Note
that
(n =
symmetry
In all
v (i+l)
the
basic
the
B matrices
so
The
cases
when
are
p(i)
= _'l - Hi'
Fj
q(i)
Si
4.3.
m
PI
=
In
for
similar
j _ 0 or
algorithm
(i)
bj+2i_l)
by
that
for
tri-
, G i = D[Hi]-i
induction
signal
in
Hi+l
the
Section
premultiplication
instability
2.A
by
= Fi"" "FIHi+I
we
block
of
showed
that
diagonal
for
v (i) =
D[Hi]p(i)
and
(i)
=
+
I
= Fi(_iGiq(i)
I, Pi
=
q(i)
•
the
v (i) sequence
by
•
D 1 = D[i_i]
Defining
DiGiQ i = QiGiDi
v(1)
D-l
represent
+
assuming
0, q(1)
= P
stability
, we
have
= v,
(i)
(Qi p
+
q
(i)
)
(i+l)
SiP
).
+
B[Hi-I]'''B[HI]'
i a 2.
Then
p(i)
=
(I-
Pi)x,
--
(DiPi
Remark.
cial
--
Let
=
essenti-
B[Hi].
=
q(i+l)
elimination
(i)
diag(bj_2i_l
However,
under
- D[QiGiQi]
P (i+l)
a
expressions
I [B20].
=
with
shown
These
#
B[Hi]
considers
Fi =
be
modifications
,
odd-even
preserved.
it may
P(1)
Theorem
(i)
q
=
[$2]
with
invariant
that
Buneman
sequences
I) but
= F .... F v (i+l)
l
I
.
method
matrices,
Stone
is not
three
and
Qi
D[Hi ]-I.
matrices
Swarztrauber's
(i)
b(i)
*
(i)
(bj_2i_ I,
j+2i_l )) , bj
i
j _ N +
T.
Qi )x
Although
cases
of
not
these
expressed
formulae
in
for
this
the
form,
odd-even
Buzbee
et
reduction
al.
[B20]
applied
prove
to
spe-
Poisson's
equation.
Proof.
=
To
begin
--(DIP
I - Ql)X.
LCM
=
least
the
induction,
Suppose
common
p(i)
multiple
p(1)
--"
of
(I
--
=
0 =
P.)x
'
the matrices
(I - Pl)X,
q(i)
m
_
(_._
considered
q(1)
_ _
as
=
°
v = HI x
)x.
Then
polynomials
in T.
48
m
p(i+l)
=
(I - Pl)X
+
D-I(Qi
(ll
=
(I - Pi +
D-.Ii Qi
=
(I -
=
(I - Pi+l )x
- Pi )x +
- Di iQiPi
+
m
(DiPi
- Qi )x)
_-I_._i
i i -
_.l_i)x
B[H i]Pi)X
I
and
q(i+l)
= v(i+l)
_ -Di+l p(i.l)
= i+l
x - 5i+l(I->i+l)X
1
=
(Di+l"
1
=
Qi+l
- Di+l
i
+
I
Di+IPi+l
)x
i
(Di+iPi+l
- Qi+l )x.
QED
i-I
We
and
note
that
Pk(DiPi-
Qi )
IIDiPi
- Qill _
IIq(i)
+
Buzbee
[BI7 ] has
Poisson's
4.C.
Returning
_ 0k(Di)II
IIDiPill
taken
to
linear
system
8 =
pk(Qi )
It also
follows
IIxll •
Thus
advantage
the
for
original
we
A(2)x
collect
(-a2jb2j-la2j-l'b2j"
(x2j)N2'
the
of
IIB[A]II <
< Pk(Di ) +
that
p(i)
this
odd-even
derivation
the
(2) = w (2)
-I
x (2) =
Pill+
and
i then
0k(Q i) =
IIx - p(i)II
will
fact
be
in
an
I_ ill =
_
132
Pk(Hi ),
-i
so
IIPill IIxll and
approximation
a semidirect
to x;
method
for
Algorithm
groundwork
suppose
=
D[Hi ]-I
Reduction
Basic
(4.2),
A(2)
=
equation.
The
The
Gi
IIHill •
_ixl I _
Odd-Even
4.C.I.
when
reduction
of
odd-even
even-numbered
algorithms
has
elimination
equations
of
now
in
(4.2)
been
(4.1)
into
a
where
-i
a2jb2j-lC2j
-I
-I
-I
- c2jb2j+la2j+l'-C2jb2j+ic2j+l)N2'
laid.
and
<
I
49
w(2)
-i
-I
= (v2j - a2jb2j_IV2j_lN2 =
This
=
c2jb2j+iv2j+l)N2
(2)
(v2j)N 2
_N/2j.
is the reduction
step,
the size of A (I) = A •
compute
the remaining
and it is clearly
Once A(2)x (2) = w
components
a reduction
(2) has
been
since A (2) is half
solved
for x (2) we
can
of x (I) = x by the back substitution
i
x2j_l = b2j_l(V2j_l
Since A (2)
to obtain
x (i) =
is block
a sequence
of systems
The reduction
(note the change
this point
Before
odd-even
in notation)
the back
reduction
of N, but
if N = 2m-I
tioned earlier,
during
which
A (i)
= v,
a(i) b(i)
(i)
= ( J
, j
,cj )Ni,
block.
begins,
that
if N = 2m-I
case this
effectively
avoid
then N. = 2m+l-i
l
However,
simplifications
The
for
that can be
the special
the inversion
- i.
to this particular
is not necessary•
involve
At
in x (I) = x.
been restricted
of important
these
of a single
culminating
have usually
Typically
scalings
of the diagonal
menblocks
the reduction•
It is also not necessary
A(k)x (k) = w
(k)
leads to various
By the way
consists
We write
A(1) = A, w(1)
since A (m) consists
in the general
[B20].
"
where
recursively
Flog2 N+I_
A = (a,b,a) N there are a number
made
_Ni/2j
this procedure
stops with A(m)x (m) = w(m) , m =
we note
algorithms
j = I,..., FN/2_.
we can apply
A (i)x(i) = w(i),
substitution
proceeding,
- c2j_ix2j),
tridiagonal
(xj2i_l) N , N I = N, Ni+ I =
i
x J"
(i) = xj2i_ I.
choice
- a2j_ix2j_2
to continue
can be solved more
polyalgorithms
in which
of blocks
the reduction
efficiently
by some other
and the semidirect
A (i) has been constructed,
taken from H i.
For example,
recursively
methods
method.
of Section
it is clear
in Figure
if
4.1a,
This
7.
that it
A (4) is
50
the (8,8) block
of H4.
It follows
immediately
that
IIA(i) II< IIHill ,
IIB[A(i)]II _ IIB[Hi]II, and
IIC[A(i)]III _ IIC[H(i)]II I •
of Theorem
have
Corollary
4. I we actually
4.1.
If
IIB[A]II _< I then
llA(i+l) l]_ ]IA(i) ll. If
and
As a corollary
IIB[A(i+I)]II _ IIB[A(i)]211
IIC[A]III < I then
IIC[A(i+l)]Ill
and
_ IIC[A(i) ]2111
IIA(i+l)II 1 _ IIA(i) II
l-
Recalling
Theorem
%(i+I)
Proof.
4.4.
the definition
If X(A) _ I then, with
_ (%(i)/(2
It will
b J (2)=
of Theorem
3.4,
let %(A) -- maxj(41_] lajllllb-llcj_
lj
%(i) = %(A(i)), we have
- _(i)))2.
suffice
b2 j (i-
to consider
i = I.
We have
_2j ) '
_2j = b21ja2jb2j-i IC2j - I + b21jc2jb2j+la2j+l
-I
a(2)
-I
j
= -a2jb2j_ la2j_ I ,
c J(2)
o
= -c2jb -lj
2 +IC2j+l '
IIo2jll _ IIb21jamj II IIb21j_Icmj_lll +
IIb21j+la2j+lll
ll(l-_2j)-lll< (I-
_),
_2) -I = 2/(2-
-i
llbmjC2jll
< 2(_/4)
= _2,
411 b(2)-la(2)j
j
II IIb(m)-l-(mj_l
cj-_ II
< 411(1 - _2j)-lll II
(I - _2j_2)-III (IIb21ja2jll IIb2j_ic2j_lll
i
I
)
• (IIb2j_la2j - iii IIb21j-2Cmj-2 II)
4(2_2
Thus
- _))2(_4)2
_(2) _ (_(I)/(2
= (_(2
- _(I) ) )2 •
- _))2.
QED
II)
51
This may be compared
which
to earlier
gives an equality
results
of the form
for constant
diagonal
_(2) = _2/( 2 _ _2).
matrices,
See also
Theorem
4.8 of IV3].
4.C.2.
Relation
to Block Elimination
By regarding
tion we are able
possible
muted
odd-even
to relate
to express
system
reduction
as a compact
it to the Newton
the reduction
algorithm
(PAPT) (Px) = (Pv) (cf. [BII],
show how this
formulation
Specifically,
also yields
let P be
iteration
as block
[E3],
quadratic
the permutation
of 21 , the odd multiples
T
P(1,2,...,7)
.
on a per-
In Section
convergence
which
elimina-
It is also
elimination
5 we
theorems.
reorders
of 2
come
of 22 , etc.
first,
the vector
followed
For N = 7,
T
= (1,3,5,7,2,6,4)
, and
bI
cI
b3
a3
b5
pAp T =
b7
factorization
a5
a7
•
b2
a6
a4
c3
c5
c2
The block
-I
0
so that the odd multiples
odd multiples
for A
[W2]).
matrix
T
(1,2,3,...,N)
form of odd-even
c6
c4
of PAP T = LU is
b6
b4
by
the
52
iI
i
L =
a2bll
1
I
I
c2b31
a6b51
.
a4b31
c4b51
b3
I
c6b71
a2(2)(b(2))-ll
a3
b5
c2(2)(bJ2)'-l,
I/
c3
c5
a5
U =
b7
a7
bI
(2)
c _2)
bJ2)
If we
triangular
"
aJ2)/
let bj(i) _
- _!i)u(i
J
J ) be an LU factorization
then the ordinary
LU factorization
with
_(u) lower
of PAP T = L'U'
is
(upper)
53
_3
L' =
_7
1
i
a2ull
c2u31
4(2)
\
-1
-1
a6u5
_\\
_(2)
c6u7
_3
a4u31 c4u51
"
-
a_2)(u_2))
-
I
c
_2)(u_2))
-
I
%_
_llCl
u3
_la 3
_Ic 3
u5
U' =
_Ic 5
u7
_la 5
i.
_la 7
Ul(2)
( 2,_2) ) -1 _1(2)
....
_%
u 3(2)
(__2) )- Ia 3 (2) i_
(3)
uI
This
factorization
then U' (Px) = z.
gives rise
It is easily
to a related
seen
corresponding
the above modification.
conclusions
are weaker
In particular,
than those
pAp T = LL 7 factorization
definite
if A is; Theorem
as well
about
Gaussian
about
Theorem
of Corollary
as Theorem
4.2 also allows
solving
("
-1_ (i)
(%7i))
w.
3
3
results
=
first
J
Z (i)
that
We may now apply any of the theorems
tion to PAP T and obtain
method,
or block
odd-even
3.2 applies
4.1.
elimina-
reduction
though
It follows
4.2 that A (i) will
the conclusion
L'z = Pv,
and
its
from
the
be positive
that p(B[A (i)]) < I.
54
The odd-even
cyclic
reduction
reduction
defined
tion of large sparse
is also a special
by Hageman
systems.
Since
case of the more
and Varga
general
[HI] for the iterative
PAPT is in the 2-cyclic
solu-
form
pAp T =
_i 21
D 1 and D2 block
diagonal,
AI2_
D2/
we can premultiply
2 IDI
by
I
to obtain
1 - A12D2
A21
- 1A
D2 - A21D 1
0
It
is
clear
of
tt 2 that
of
this
that
have
The
is
not:
formulation,
0(B[A (i+l)])
L(U)
A (2)
= D2 been
if
computed
A is
< p2(B[A(i)])
block
unit
substitution
lower
(upper)
Norm
triangular,
bounds
with
the natural
ordering
we have
IILII and
tion
For
of L and U,
grow,
the
then
0)
reduction.
so A (i)
determines
on
L and
IIB[A]I] < 1.
IILIIl < I +
IIB[A]II,
the permuted
IIUIII will
of
A12D -2 1A2 1'
PAP T = L&U, where
matrices
< i, IIUII _ I +
part
an M-matrix
LU factorization
phases.
n = max. n..
J J
as
(D 1 -
are
In
the
rows
consequence
and
< 1 [H1].
for general
IIC[A]III
A2 1D11A 12 and
block
U were
For block
A is
the
block
diagonal
reduction
given
in
Section
tridiagonal
IIC[A]III,
and then only
back
3.A.2
matrices
in
IILII <nll C[A]III if
IIUIII _ I + nll B[A]I I if
tridiagonal
and
and
system
logarithmically.
IIB[A]II < I,
only the bounds
on
By the construc-
55
IILII__ _+ 1max
IIC[A
(_)lll_,
_i-<m
m
II_11_ _+ _i=l
IIC[A(i)311,
m
II
ull
I
I + Li=I IIBEA(i)]III'
IluII _ _.
Since C[A (i)] is block
tridiagonal,
P%(C[A(i)])
IIC[A(i) ]II_ 2nll C[A(i)]II I; similarly
IIB[A]II < I, IIC[A]II I < I, we
II BEA(i)]II"
max
l_i-_n
_ 2nll C[A(i)]III' so
IIB[A(i) ]II i _ 2nll B[A(i)II •
If
then have
IILII1 _ 1+ IIc[Allll
II LII
_ 1 + 2_ll CEA]II1,
IIulI1 _ 1+ 2nmll
BEA]II,
IIUII _ 1+ IIN[AIlI.
Mu=h
_ikethep=oofof Co=o_=y
3.4we_o h_ve,if IIB[AILI
< _,
b(i+l)
J
= b_ i)
j (i- _)
)
;I_ j
i) II_ IIB[A(i)]211
.
h (i)
=
(i+l) II_ II-2j II(I + _i), _i
J
II(b(i+l)) - I
and
cond(b!i+l))
3
IIB[ A(i) ]211
I
II_ II ib(i)
(_-_i)'
2j )-II/
_ c°nd(b_
))(I + 8i) /
(1-
_i).
,
56
Since
must
our
interest
simplify
the
is
cond(bj (i+l))
in bounding
above
expression•
Let
in
K. =
I
terms
of
i
_j=l (I +
cond(b k)
_j)/(l-
we
8j),
so
that
cond(b_
by
•
induction
Using
Sj
i+l))
_ [max k
_ _2j.I' we
cond(b k) ]K i
obtain
i
2
K i < _j=l(l
+
_j)/(l
-
_j_l )
i-i
This
is a rather
is close
to
=
(I +
_i)/[(l-
<
(I +
_i)/(l
unattractive
I.
Actually,
upper
the
I
(i) ]2
1
II°2j
-(i> II<_ _II B[A
II= _8i'
_l)IIj=l(l-
-
8j)]
_i )i.
bound
situation
since
it may
is not
be
so bad,
quite
large:
if
61
since
and
2+8
cond (b (i+l))
j
_< [max k
•
If
X(A)
<
II_ _i)
j II _
i then
%( i)/2,
cond(b k) ] H i
j=l
k (i) =
2 -
%(A (i) ),
i
c°nd(b_
When
A
is positive
theorems
that
4.C.3.
Back
Now
(i+l)
Yk
A(i+l)
let
(i+l)
definite
cond(b_
i)) <
_ [maxk c°nd(bk)]
i
< 3
max k cond(bk)
it
follows
cond(A
(i))
from
<
and
2 +_ %_i_l
_
_j=l
2
_(i)
.
various
eigenvalue-interlacing
cond(A).
Substitution
us
(i+l)
= Xk
X
i+l))
J
8j
-_
consider
+
error
propagation
(i+l)
_k
, I _ k
w (i+l)
and
we
_ hi+l,
compute
in
defines
the
back
the
substitution.
computed
solution
Suppose
to
57
(i)
Y2j-I
= (b_i)
j- l)-l(w_i)
j-I-
a(i)
(i+l) - c(i)
2j-I _
Yj-!
2j-I Yj(i+l))
¢(i)
2j-l'
+
Y2j
(i) = y_i+l)
Then
F(i)
_2j-i = y_i)
j-i _ x_i)
j-I
= (b_i)
(i)
j- l)'I a 2j-I
F (i+l) "
=j-I
(b_!l)-I
c 2j-I
(i)
_(i+l)
j
+ ¢ j-l'
_i)
j
Now define
(i+l)•
= _j
the vectors
E 1
'
0
,
_
_
-..
T
..)
•
•
,(i+l)
z2 = (0, _(i+l)
hI
, 0 , _2
We have
IIz211=
II_(i+l)II by construction,
and
_(i) = B(i) z2 + Zl + z2 ' B (i) = B[A(i)].
But in consequence
of the construction
we actually
have
II_(i) ll< max(ll B(i)z2 + Zlll ' IIz211 )"
If
1[ B[A][[
Suppose
<1
then
[I B(i)
[[ _(m)I[
z2[[
_ ¢ and
_
[[ B(i)
][ e (i)
II [[ z21[ <
[[ < ¢ for
II z2[[
all
•
i = 1,...,m-1.
Then
II
_(i)II _ma_(I[
B(i)
[[II
z2[I
+ II
z_II,
II
z211)
[Iz2 II+
IIzIll
--II
_(i+_)II
+ II_(i)II
(m= (m-
(i+l) + I)¢ + ¢
i + I)¢,
(by induction
on i)
58
and
II_(I)II,
me, m =
the error
[log2 N + I].
error propagation
pessimistic
in the back
decreases
Storage
which
been
of fill-in
m
is
blocks,
-Ji-2
specific
practically
and storage
ordering
block storage
seen that
expect
be damped
rate
for
it is a somewhat
that errors
out quickly
is the number
2(N i - i) < 2N.
structure
system,
requirements,
of A, as discussed
since
odd-even
there
Since
reduction
owing
in
to
of the blocks,
temporary
(2)
3N-
However,
fill-in.
,...,A
(m)
The
,
2 storage
we have not considered
as this is highly
nor have we considered
3.A, re-
fill-in.
of A
A requires
Of course,
that block
in Section
does generate
blocks
the matrix
increase.
we note
is no block
of off-diagonal
this is not an excessive
the internal
In fact,
= w (i+l) will
of fill-in
no additional
amount
substitution.
growth
by
IIB(i)II •
in the natural
it has already
= v, is bounded
a satisfactory
and we can more
in
of Ax
Requirements
On the matter
elimination
solution
is certainly
of A(i+l)x(i+l)
the quadratic
quires
This
upper bound,
the solution
4.C.4.
in the computed
dependent
storage
on the
for intermediate
compu tat ions.
We
some
do note,
economies
however,
that if the factorization
may be achieved.
to A (2) , the first equation
three equations
saved
of A(2)x (2) = w
of A(1)x (I) = w (I).
for the back
substitution,
once A(2)x (2) = w (2) is formed.
may overwrite
tions hold
For example,
the second
throughout
the CDC STAR-100
equation
Thus
in the transition
(2) is
The first
but there
generated
and third
is no reason
the first
(1).
on vector
from A (I)
from the first
of these must be
of A(2)x (2) = w (2)
Similar
but some modifications
due to its restrictions
then
to save the second
equation
of A(1)x (I) = w
the reduction,
need not be saved,
addressing
considera-
are needed
on
(see Section
10.B).
59
The special
separable
scalings
elliptic
PDE's
A (i) as polynomials
Since
the blocks
intermediate
tunately,
of odd-even
allow
or rational
to a small
there does not seem
to nonseparable
matrices,
implicit
of A are quite
storage
PDE's.
reduction
representation
functions
sparse
from
of the blocks
of
variable.
this is necessary
in order
to limit
of the original
storage.
Unfor-
multiple
to be a similar
computing
derived
of a single matrix
The only recourse
and on current
for systems
systems
procedure
is to store
this will
that can be applied
the blocks
limit
as dense
them to small
sizes.
4.C.5.
Special
Now suppose
we would
Techniques
that A = (-pj,tj + pj + qj, -qj),
like to apply Babuska's
3.A.4)
to the factorization
tially
where
succeeds.
First,
tricks
pj,qj,tj
_ 0, pl = qN = 0;
for the LU factorization
induced by odd-even
we can represent
reduction.
(Section
This only par-
(2) t(2) + _(2) + q(2)
(2))
A (2) = (-pj
, J
pj
J
,_qj
Pj(2) = P2j b21j-i P2j-I'
t j(2) = t2j + P2j b2j-I
-I
t2j-i + q2j b21j+l t2 j+l'
qj(2) = q2j b21j+l q2j+l'
bk
Note
that t2j < t j(2) _ b2 j"
the back
will
= tk + Pk + qk"
substitution
not work.
(D + U)(Px)
*
(D + D
On the other
by combining
hand
the technique
the LU and UL factorizations
of avoiding
of PAPT
J, + L , ),
Let A' = pApT = (I + L)(D + U) = (I + U")(D
= f, (I + U )f
= Pv, (D
2"
- D[A' ]) (Px) = f + f
+ L )(Px)
= f , so that
*
- (L
+ D[A' ] + U)(Px).
(I + L)f= Pv,
60
The only way
to have A' = L
+ D[A' ] + U is if no fill-in
occurs
in
_,¢
either
factorization,
fill-in
occurs
in which
in odd-even
reduction,
entirety.
However,
developing
the t (i) sequences.
Schr_der
on a rectangular
discussed
it will
still
and Trottenberg
tion" for matrices
drawn
grid.
in [Sla].
case L
= L[A' ], U = U[A].
so the method
be advantageous
from PDE's
Numerical
and finite
stability
The total reduction
in [Sl] by a "partial
enough
to prevent
the application
convergence
results
mined
the proper
4.D.
Parallel
Finally,
computation
Lambiotte
and further
by
closely
obtainable,
reduc-
approximations
developments
to odd-even
reduction"),
are
reduction
but just different
Part of the problem
related
is expressible
are undoubtedly
of "total
difference
is similar
is more
the algorithm
setting
a method
of our techniques.
to be that total reduction
though
in its
to do the reductions
[SI] also consider
is derived
to the matrix,
does not work
(See [Sl] for some examples.)
(which
seems
In particL,lar
to the PDE than
in matrix
form.
Various
but we have not yet deter-
for their analysis.
Computation
the utility
of the odd-even
has been demonstrated
and Voigt
[L2],
methods for parallel
in several
and Madsen
studies
and Rodrigue
[H3].
or pipeline
Stone
[MI] discuss
[$2],
odd-even
2'¢
reduction
for tridiagonal
to use when
which
N is large
systems,
(> 256).
and conclude
Jordan
is shown in [MI] to be most useful
on a particular
pipeline
computer,
[Jl] considers
for middle
the CDC STAR-100.
N the sequential
LU factorization
is best,
terms engendered
by the pipeline
mechanism.
The basic method
given here
that
is a block
owing
analogue
it is the best method
odd-even
ranges
elimination,
of N (64 <i N < 256)
For small
values
of
to the effect
of lower
order
of the method
examined
in [L2].
61
The inherent
the equations
substitution
parallelism
of the odd-even
of A(i+l)x(i+l)
for odd-even
methods
(i+l)
- w
or Hi+ix
reduction.
= v
is in formation
(i.l)
It is readily
of
and in the back
seen from
(4.2) that
(1)
H 2 and v (2) may be computed
compute
from H I and v
LU factors
of bk,
by
(I < k < N)
solve bk[a k _k Vk] = [ak Ck Vk]'
(i _ k _ N)
bk(2) - bk - ak_k_ I - CkSk+ I, (I < k < N)
Vk(2) +-v k - akV_k_lak(2) +- -akak-l'
CkVk+ I, (I _ k < N)
(3 _ k _ N)
c k(2)
_- -CkCk+
^ I, (I < k _ N - 2)
(Special
end conditions
shortening
erating
vector
lengths
new equations
straightforward
l
compute
slightly.
by storing
zeros
A (2) and w
for even k; the back
(2)
strategically
are computed
substitution
or by
by only gen-
is then rather
once x (2) is computed.
The algorithm
N. = 2m+l-i
are handled
for odd-even
reduction,
- I, is as follows:
LU factors
(i)
solve b2k+l[a2k+l
k
- a
(0 _ k < N i+ i)
^
ra(i)
W2k+l ] = L 2k+l
C2k_l^
wk(i+l) = W2(k)-
c
c(i) w(i)
2k+l
2k+l ], (0 < k < Ni+l)
a2k+l,
C2k+l'
(I _ k < Ni+l)
i
a_k)W2k_l
C_k)W2k+l
, (I £ k < Ni+l)
ak(i+l) = _a2(k) a2k^
I' (2 < k < Ni.l)
ck(i+l) =-C_k)
all blocks n X n,
for i = l,...,m-I
of b 2k+l'
(i)
^
C2k+l
N = 2m-l,
(I _ k < Ni+l - I)
62
solve b_ m) x_m) = w_ m)
for i = m-1,...,l
solve h (i)
(i) _ . (i)
_(i)
-2k+l X2k+l - (W2k+l - a2k+l
The ith reduction
(122 n 3 + O(n2))Ni+l
count
pared
_ (i+l) - c(i)
x(i+l)
Xk
2k+l
k+l
), (0 < k < Ni+ I)
and corresponding
.
12n3 arithmetic
back
substitution
operations,
uses
and the total
operation
for the algorithm is (123N--- 12m)n 3 + O(n2N + n3).
This may be com2 3
to 4_Nn
+ O(n2N + n 3) operations for solution of Ax = v by the LU
factor iza tion.
,
For a crude estimate
of execution
take TN + _, T << _, as the execution
length N, and t, T <<
[H3].
time for odd-even
(122n 3 + o(nm))(_N
+ om) + O(n3t)
for the LU factorization
done
comparison
similar
shows results
tridiagonal
come
Odd-even
effective
value
than in odd-even
a limited
in Section
that roughly
observation
versus
of
operations
then be
(42 n3N + o(nmN + nB))t
in scalar mode.
precise
A qualitative
analysis
of n, odd-even
its utility
reduction.
are implementation
A critical
on vectors
given
reduction
dependent
crossover
and will
values
points
be considered
timing
beto
10.C.
on the execution
50_ of the total
detailed
of N;
is often
time is spent
time of odd-even
reduction
in the first reduction
from
.2.
I,
A more
be-
as N increases.
for moderate
The actual
to
will
(_N + _)m, but data manipulation
tween algorithms
extent
we
would
than the LU factorization
by
computer,
time for scalar
reduction
to the more
should maintain
TN + om is replaced
simpler
completely
for a fixed
more
elimination
the factor
much
systems:
increasingly
time for operations
t << _, as the execution
The total execution
roughly
time on a pipeline
analysis
is in Section
10.C and
Appendix
B.
is
63
A (I) = A to A (2).
the reduced
Special
systems
efforts
differently
more,
in timing
order
term n3N comes
must
each reduction
and 30_o from solving
systems
is used
saved the factorizations.
to factor
The use of a synchronous
which
The first restriction
cases
As with
when
or positive
definite,
Thirdly,
vector
the pivot selections
pivoting
must
operations.
to its limitations
restricted
tion between
addressing
small blocks
In the case
processing
causes
the diagonal
of
operations
to numerous
diagonally
the blocks
to represent
special
ineffi-
blocks.
requirements
problem
dominant
will be required.
the matrices
for the parallel
and
or
on the CDC Star owing
and allocation
ILl].
this is not as serious
of the llliac
IV, the problem
elements
also be considered.
must
computer.
[[B[A][[ < I only guarantees
within
important
substitutions
matrix
pivoting
If A is strictly
structures
Relatively
is that the blocks
resorting
differ between
A(i+l)
some constraints
simultaneous
In addition,
to the machine's
on vector
to having
otherwise.
of data
is an
etc.
creates
of efficiency
without
then no pivoting
This
to compute
on a sequential
the assumption
is needed.
the choice
conform
computer
This allows
sizes.
the LU factorization,
that no block
vectors
in the interests
block
used
for a2k+l'
are not so important
to be programmed
due to varying
ciencies
parallel
Further-
60_ of the high-
the b! i)'s or to make back
J
all be the same size.
etc.)
that roughly
the 2n+l
linear
by solving
this fact into account.
multiplications
having
(factoring,
the algorithm
from the matrix
effort
A should
take
it is seen
little
on implementation
to optimize
When
A is
as it would
of data
be
communica-
64
4. E.
Summary
We now summarize
tion and reduction
certain
block
tant quantities
The quadratic
Newton-like
the results
algorithms
tridiagonal
involved
the basic methods,
may be safely
linear
have
important
play quadratic
convergence.
Parallel
these methods,
although
type of system
that can be solved
practicalities
applied
decrease
a block
The odd-even
by expressing
matrix.
applications
computation
place
efficiently.
various
quadratically
diagonal
elimina-
to the solution
and in some cases
may be explained
to compute
which
systems,
in the solution
convergence
iterations
of this section.
of
impor-
to zero.
the methods
Variations
to PDE's,
is a natural
some restrictions
as
of
also disformat
on the
for
65
5.
UN IF ICAT ION :
Having
ITERAT ION AND F ILL- IN
presented
them in a unified
iteration.
manner
We are also
in terms of matrix
only instances
to the block
methods,
instances
of a general
and conclude
iteration
to arbitrary
tridiagonal
Let us first collect
block Gauss-Jordan,
and quadratic
lead to an interpretation
fill-in,
apply
linear
as particular
of the general
These results
only
several
matrices
display
convergence
methods
quadratic
satisfying
are the
convergence.
I B[A]II < I and not
case.
the formulae
and odd-even
for the block
LU factorization,
elimination.
LU:
Ai+ I = (I-
GJ:
Ai+l
OE:
Hi+ I = (I - (L[Hi] + U[Hi])D[Hi]-l)Hi
Now consider
nonstationary
of quadratic
that the odd-even
which
we now treat
L[Ai]EiD[Ai] "I)A i , A I = A.
= (I - (e[Ai] + g[Ai])EiD[Ai]-l)Ai
the function
F(X,Y,Z)
= (21-
, AI = A.
, H I = A.
XY)Z,
and observe
that
-i
Ai+ I = F(D[Ai]
Ai+l
, A i)
= F(D[Ai]
Hi+ I = F(Hi,
Thus
+ L[Ai]E i, D[A i]
+ (L[Ai] + U[Ai])Ei'
-I
D[H i]
, H i)
it is necessary
to analyze
D[Ai ]-I' Ai )
sequences
based
on repeated
applications
of F.
For fixed nonsingular
r(M) = I-
Y we
can define
MY, s(M) = I - YM, so F(X,Y,Z)
r(F(X,Y,Z))
= r(Z) - r(X) + r(X)r(Z),
s(F(X,Y,Z))
= s(Z) - s(X) + s(X)s(Z).
the residual
= (I + r(X))Z
matrices
and
66
By taking
tionary
X = Z and assuming
iteration
is, of course,
been
ZI = X, Zi+l
s(Zi+l)
IIr(Z I) II< i or
Zi+ I = F(Zi,Y,Zi)
the Newton
The related
viously
that
operator
studied
iteration.
G(X,Y,Z)
as a means
= G(X'Y'Zi)'
= s(X)s(Zi)'
of solving
Bauer
Zi = (I-
= Z + X(I - YZ) - Z + Xs(Z)
Newton
for G(Z,Y,Z)
Zi+l = F(Xi'Yi'Zi)'
from Z.,
l including
lowed
to change
sequence
systems
This
= F(Z,Y,Z),
has pre-
[H8].
that r(Zi+l)
r(X) i)Y-I = Y-I(I-
Z2i = G(Zi'Y'Zi)"
In this section
linear
[B8] observes
and in particular
iteration,
converges
IIs(Z I) II< I, the sta-I
quadratically
to Y
. This
When
= r(X)r(Zi) ,
s(X) i), Zi+k = G(Zk'Y'Zi)'
last formula
also
leads
to the
s(Z2i
) = s (_i)2
we study
the nonstationary
iteration Z I = A,
-I
where Y.I = D[Zi]
and Xi consists of blocks taken
the diagonal.
from step
is predetermined
of such an adaptive
The selection
of blocks
to step,
but we shall
and will
not be altered
procedure,
even on a highly
assume
for IX.
is ali
that the selection
adaptively.
parallel
(The cost
computer,
would
be prohibitive. )
For the solution
(Zi+l,
vi+ 1) = (21
*
*
ing (Z ,v ), where
condition
which
are similar
XiYi)(Zi,vi),
Z
is block
we could also
generality.
diagonal
with
matrix,
the
diagonal.
take
(Zl,V I) = (A,v),
sequence
ending
*-i*
Then x = Z
v .
in
or
approach-
One necessary
is that 21 - XiY i = I + r i (Xi) be nonsingular,
if and only if P(ri(Xi))
= C[Xi] , si(Xi)
greater
system Ax = v we
*
for consistency
holds
ri(Xi)
-
of a linear
assume
< i.
P(si(Xi))
Since
ri(X i) and si(Xi)
< I.
Using
= B[Xi] ; the notation
In the context
the condition
states
that A is sufficiently
namely
D[A].
has been changed
of an iteration
IIB[A]II < I (implying
close
earlier
to a particular
converging
= I - Y.X.
i i
notation,
to provide
to a block
IISl(X I) II_ l_l(Z I) II< I)
block
diagonal
matrix,
67
We first generalize
earlier
results
on linear methods
IISi+l(Zi+ I) II< IIsi(Z i) II) and then determine
obtain
a quadratic
Theorem
5.1
method
(IIsi(Z i) II< I =
In the sequence
conditions
{(p,q)]I
under which
we
IISi+l(Zi+ I) II< IIsi(Z i) 112).
Zi+ I = F(Xi,Yi,Zi),
[(p,p) ]I _ p _ N] c T.l c
(llsi(Zi)II< I =
suppose
that
_ p,q < N} = U
-I
X i = _.EpZ.E
i q , Y i = D[Zi]
i
si(M)_
_ = I-
Then
,
Y.M.I
IIs I(ZI> II< I implies
IISi+l(Zi+ I) II_ IIsi(Zi+l> iI< IIsi(Zi> II •
Proof.
Suppose
IIsi(Z i) II< I.
Q = -YiZi + Y1"XiY'Zl
i =-si(Xi)
S = D[Q] = D[si(Xi)si(Z i) ].
and
IISII _ IIsi(Z i) 112< i.
and si(Zi+ I) = (Iimplies
Define
Q by Zi+ I = Zi " D[Zi ]Q' so
+ si(Xi)si(Zi).
Since
Since
D[si(Xi) ] = 0
si(X i) = _i pIEs.(Zi)E q,
Thus D[Zi+l]
S)Si.l(Zi+ I) + S.
IIsi(X i) II< llsi(Zi) II
= D[Zi] (I - S), Yi+l = (I - S)'IYi ,
Lemma
2.2 applies,
so that
I_i(Zi+l ) II< I
IISi+l(Zi. I) II_ IIsi(Zi+ I) If" But si(Zi+ I) = Si(Zi) + Q =
si(Z i) - si(X i) + si(Xi)si(Z i) = _T,EpSi(Zi)Eq
1
T' = U - T
so
1
i'
pj(si(Zi+l))
+ _.EpSi(Zi)EqSi(Zi),
1
_ 0j(_,E
r i P si(Zi)E q)
+ 0j(F-_li
PiEs (Zi)Eq)II si(Zi)II
Oj (si(Zi)) •
Thus
IIsi(Zi+l)II _ IIsi(zi) II< i.
QED.
68
To see when
the method
is actually
at least some blocks
of Si+l(Zi+l)
Corollary
the hypotheses
5.1.
With
are bounded
of Theorem
_iEpSi+l(Zi+l)Eqll
Proof.
quadratic,
we first
in norm by
establish
that
I]si(Zi) ll2"
5.1,
< IIsi(Z i) [I2 •
From
(5.1)
(I-
S) Si+l(Zi+ I) + S = si(Zi)
S = D[si(Xi)si(Zi)
and E (IP
S) = (I-
- si(Xi) + si(Xi)si(Zi)
,
],
S)E , we obtain
P
(I - S) 7T.E
p Si+l(Zi+l)Eq
i
= _
+ S
E s (Zi)E - _ E
i p
i
q
ip
s (Xi)E
i
q
+ _ X.E p s i (Xi) si(Zi)E q
But
_.Ep
si(Zi)Eq = s i (Xi) = _ x.E p s i (Xi)E q '
I
so
(I-
[I_iEp
S) _iEp
Si+l(Zi.l)Eq
Si+l(Zi+l)Eqll
= _i pE si(Xi)
si(Zi)Eq
- S.
By eemma
2.1,
_ I[_i E p s i (Xi)si(Zi)Eql I < [Isi(X i) s i(Zi ) [I_ l_i(Z i) 112
.
QED
In certain
we could
where
have
cases Corollary
_f.Ep
1
Si+l(Zi+l)E
this occurs.
However,
5. I provides
q
= 0.
it is seen
Block
no useful
elimination
information,
is
one
that if we are to have
since
example
llSi+l(Zi+l)II
[[si(Zi) 112 then we must have
(5.2)
This
II_,.Ep
l
is
certainly
Si+l(Zi+l)Eq[ I _ I[si(Z i) 112.
true
if
T _.
1 is.
the
null
set,
in which
case
we are
generating
69
the H.l sequence,
(5.2) will
so we shall
not hold
assume
for any starting
now write
Inclusion
affect
practical
matrix
consequences:
in X.l of all blocks
on T., it follows
i
and not by design.
Corollary
5.2.
Proof.
a quadratic
Suppose
in Z.I known
that if (p,q)
accident,
yields
be in T..
l
it
= 0
For example,
we
as
Zi+ I = F(Xi,Yi,Zi).
With
though
if it is true that E Z.E
p lq
(p,q) must
algorithm
purposes,
(L[Ai] + U[Ai ])(_ij=IE j) , D[Ai ]-I' Ai )
--F(D[Ai'] +
the sequence
on T.l for notational
ZI, then
the Gauss-Jordan
Ai+l
and show that
in general.
We set one final condition
turns out to have
that T'l is non-empty
With
of Theorem
for arbitrary
that T'
_ _.
l
the additional
_ T. then E Z.E
l
p lq
the hypotheses
method
to be zero clearly
not
condition
can be zero only by
5.1
starting
Following
will
'
only the choice
matrices
the proof
T'i =
Z I.
of Corollary
5.1
we
obtain
(I-
S) _,E
P
s
= _,E
ip
which
implies,
l)E
i+l (Zi+
si(Zi)E
q
q
+ _,.E
ip
s (Xi)s (Zi)E
i
i
q
by Lemma 2.1,
p s i+l (Zi+ l)Eq II
II_,.E
l
]I_,El p s i (Zi)E q + S + _ '.
Ep s i (Xi)si(Zi)Epl I "
l
The presence
of nonzero
the 2nd order terms
(5.2) will hold
Ist order
(S + _,Ep
l
for arbitrary
terms
(___M,E.ri
pSi(Zi
si(Xi)si(Zi)Eq)
Z I satisfying
)Eq)
in
general
, and we cannot
IISl(Z I) II< I.
dominates
conclude
that
QED
70
will
Now let us consider
what
during
algorithms
the odd-even
diagonal _,q)
block,
happens
when
block
for block
fill-in
occurs,
tridiagonals.
= EpXi(-YiZi )Eq = EpXi
IIEpZi+IEqll < IIEpXill IIsi(Zi)Eqll-
From
(I-
si(Zi)Eq'
S) Ep Si+l(Zi.l )Eq = Ep si(Xi)
it
For the off-
p # q, we have EpZ.EIq = 0 but EpZi+iEq
Zi. I = 2Z i - X.YIiZi' EpZi+IEq
as
# 0.
si(Zi)Eq'
Since
and
(5.i) and E P SEq = 0 ' we have
IIS + Ep si(X i) si(Zi)Eqll < IIsi(Zi ) II
2.
which
implies
0_e cannot
IIEp Si+l(Zi+l)Eql I
conclude
that
IIEp Si+l(Zi+l)Eql I _ IISi+l(Zi+l ) II
2, however.)
As shown by these
blocks
of Zi+ I when
are quadratically
tic convergence
In summary,
the odd-even
inequalities,
produces
small
IIsi(Zi) II< i, and the corresponding
small with
is from block
we now have
methods
is that the method
fill-in
when
respect
to si(Zi).
fill-in
during
applied
is a variation
to block
blocks
Thus one source
the process
two explanations
off-diagonal
of the quadratic
convergence
matrices.
Newton
of quadra-
Zi+ I = F(Xi,Yi,Zi).
for quadratic
tridiagonal
of Si+l(Zi+l)
in
The first
iteration
for
-I
A
.
The second
diagonal
blocks
blocks
is that,
of H.i are eliminated
of Hi+ I arise
fore either
zero
in going
from block
or quadratic
from H i to Hi+l,
all the non-zero
and all the non-zero
fill-in.
in B[Hi].
Each
block
off-
off-diagonal
of B[Hi+l]
is there-
71
6.
HIGHER-ORDER
METHODS
For Poisson's
equation,
leads to considerable
algorithm.
(constant
b.
However,
A = (-I, b, -I),
simplification
The most
block
where
important
diagonals)
Sweet
[SI0] therefore
which
could handle
and a polynomial
cases
other
choices
and still
of odd-even
preserve
Bank's
constant
generalized
equation
uses
eralized
in order
to control
the stability
is analagous
to multiple
algorithm.
ordinary
This
technique
differential
In this
of odd-even
through
much
block
as before.
of this section
block
diagonals
algorithm
the same)
gen-
of a fast marching
shooting
methods
for
It must
considered
scaling,
yields
and hence
as practical
our purpose
higher-order
here
be emphasized,
reduction
properties
as yet another
though,
Sweet's
that
method
carry
the methods
for general
is to elaborate
for a family
convergence
our results
methods
by showing
variation
block
on the quadra-
that they are special
of methods.
Reduction
The idea behind
a block
diagonal
of odd-even
method
that it possesses
actually
Rather,
cases of convergence
p-Fold
Sweet's
are not intended
systems.
tic properties
6.A.
and show
The algorithm
a suitable
tridiagonal
we analyze
reduction,
through
natural.
reduction
marching
of (very nearly
in
equations.
section
properties.
one step
-I)
of N are much more
[B5] for Poisson's
reduction
reduction
of b (i) as a polynomial
a generalization
representation.
odd-even
N = 2m - I
are A (i) = (-I, b (i)
and the expression
described
these
in the Buneman
properties
in some applications
the choice
tridiagonal
the generalization
linear
system
of odd-even
involving
reduction
the unknowns
is to take
Xl,X2,...,
and
72
to produce
unknowns
a new,
smaller
Xp,X2p, ....
block
Once
are back-substituted
this system
The entire
process
reduction
is the case
p = 2.
Let p _ 2 be a given
will
integer,
system
involving
is solved,
into the original
unknowns.
Ax = v we generate
tridiagonal
equations
be called
the knowns
to compute
p-fold
and define
only the
N2 =
Xp,...
the remaining
reduction;
LN/pJ.
Starting with
system A(2)x (2) = w (2) , where
the reduced
A (2) = (a(2)j, b (2)j, c(2))_j , x(2) = (Xpj)N2 , and the diagonal
has dimension
n PJ..
f
When
For
I _ k _ N2 +
Ckp-p+l
ak_p- p+2
bkp-. p+2
redefined
as a smaller
partition
A to have
extend
diagonal
AI, bp,
and call
.
using
b (2)j
Ckp- p-t--2.
akp- I
Ax = v with
matrix
block
i, define
bkp-p+l
kp-I > N, either
odd-even
trivial
the available
equations
blocks
or let _
of A.
Next,
be
re-
blocks
_,
the new system K_ = v.
b2p,
..., bN2 p,
Three
adjacent
_2+i,
block
rows of K are thus
73
kSkp- I
(0 ... 0 akp)
( bkp )
(Ckp0...0)
p-
The p-fold
reduction
and the back
is obtained
substitution
by performing
is obtained
a 2-fold
by solving
reduction
on A,
the system
_-_ V_p_/ 0
_
:
-_o0 /
I i jakPp i
_kp-
I /
I /
_kp-
kp- I Xkp
By taking
/
(,___ ___
_a_-_ ,_-_ %-_/
___ :_\i
the reduction
step simplies
to
ak
2( ) = -akp _kp-I
(6.1)
bk(2) = bkp-
akp _kp-I-
Ckp _kp+l
Ck
2)" = -Ckp _kp+l
Wk(2) = Vkp - akp q°kp-I - Ckp _p+l'
and the back
substitution
becomes
kp- p+l
:
0
o
___ v_._/
74
Xkp-j
- _kp-j
- _kp-j Xkp-p
i <j
Rather
technique
than computing
- _kp-j
Xkp'
<p-l.
the entire
_ and _ vectors,
is to use LU and UL processes
on the A blocks:
for each k = 1,2,...,N 2 + I,
(LU process
on _)
_k,l = akp-p+l;
_k,l = bkp-p+l;
q°k,I = Vkp-p+l ;
for j = 2,...,p-I
solve
Bk,j_ I [Ak,j. I Ck,j_ I Fk,j_ I]
= [_k,j-I
solve
Ckp-p+j-I
q°k,j-l]
_k,j =
- akp-p+j
Ak,j-I
Bk,j = bkp-p+j
- akp-p+j
Ck,j-I
- akp-p+j
Fk,j- I
__k,j
= Vkp-p+j
_k,p_l[_kp_l
_fkp-I C°kp-l]
= [_k,p-I
Ckp-I
a more
_k,p-I ]
efficient
75
for each k = 1,2,...,N 2,
(UL process
on _+i )
_k+l,- I = bkp+p- I; _k+l,- I = Ckp+p- I;
q°k+I ,-I = Vkp+p- I
for j = 2,...,p-I
solve
_k+l-j+l
[Ak+l,-j+l
= [akp+p-j+l
Ck+l,-j+l
Fk+l,-j+l]
_fk+l,-j+l _k+l,-j+l ]
_k+ I,-j - bkp+p- j
_fk+l,-j -
kp+p- j Ak+ 1 ,-j+ I
- Ckp+p-j
Ck+l,-j+l
.__°k+l
,-j = Vkp+p- j - Ckp+p- j Fk+l,- j+l
solve
_k+l,-p+l
[_kp+l
_fkp+l
= [akp+l Yk+l,-p+l
The block
tridiagonal
C_kp+l]
q°k+l,-p+l]
system A(2)x (2) = w (2) is now generated
solved
by some procedure,
for _
in the back
and we
take advantage
by
of the earlier
(6.1) and
LU process
substitution:
for each k = 1,2,...,N 2 + i,
Xkp-I = _°kp-I - _kp-I
Xkp-p
- _kp-I
Xkp
for j = p-2,...,l
Xkp_p+j
6. B.
Convergence
As
observed
ing for p-fold
it follows
" Fk,j
- Ak,j
Xkp-p
- Ck,j Xkp-p+j+l"
Rates
in Section
reduction,
that p-fold
2, if
IIB[A]II < I then, under
we have
reduction
IIB[_]]I _ IIB[A]II •
will
IIB[A(2)]I I < IIB[_] 211 < IIB[A]II2"
yield
the _ partition-
By Corollary
4.1
I]A(2) II< ]I_[II= IIAll and
In fact, we can provide
pth order
76
convergence,
expected
although
our first
in comparison
We will
ination.
actually
Let M
result
to earlier
consider
block
generalization
(2p-l)-diagonal
Mp = (mj,_p+l,...,mj,0,...,mj,p_l)N
We want
to pick M
P
of odd-even
(2p+l)-diagonal
.
p-i
matrix;
of the block
the jth block
A _2_f
_ is constructed
rows p,2p,...,
and block
(2)
a.
J
= r
c(2)
J
(2)
= t.p,j w.J
b
JP'
(2)
J
= s.
3P
= (MpV ) jp
row of M A is
P
_k--p+l
mj,k[block
row
(j+k) of A],
so that
r.3 = m.J,-p+l
a j-p+l'
s.
3 = m.j,-I cj_ i + m j,0 b j + m j,+l aj+l
t. =m.
3
3,p-I
c
j+p-I
.
that
mj,k_ I Cj+k_ I + mj, k bj+ k + mj,k+ I aj+k+ I = 0,
Ikl = 1,2,...,p-l,
by selecting
columns
vP-I
We require
elim-
matrix,
Thus we have
Now,
_
0, ..., 0, s j' 0 , ... ,0 , tj) •
p-I
the intersections
than mi_h_
such that
M pA =(rj,
M A is a block
P
weaker
results.
a natural
be a nonsingular
P
is somewhat
p,2p, ....
77
where we
take m°
= 0, m.
= 0.
J,-P
3,P
The special
in Section
4.A.
case
p = 2, Mp = (-ajbill_ ,l,-cjbi+- i), has been
considered
For p > 2, define
d.
-- b
j,-p+l
d j,-k
j-p+l
=b j-k - a j-k di_ -k-I
Cj-k-l'
k = p-2,...,l
-I
_j,-k
= -aj-k
dj,-k-l'
k = p-2,...,0
d
=b
j, p- i
d j,k
j+p- I
= b j+k - c j+k dil,k+l aj+k+l'
k = p-2,...,l,
_j,k
= -cJ +k dil,k+l'
k = p-2,...,0.
Then we may take
mj,_k = _j,0 _j,-I
"'" _j,-k+l
mj, 0
= I
mj,k
= _j,0 _j ,I "'" _ j,k-l'
k = 1,2,...,p-I.
!
The dj,_k
!
s are derived
by a UL process.
then
moves
It follows
by an LU factorization
from earlier
results
IImj,+klIl
_ IIC[A]IIkI ' so the diagonals
_
away
from the main
Our first theorem
process,
that
if
of M P decrease
diagonal.
partially
generalizes
Theorem
4.1.
and the dj,+k
IIC[A]III < i
in size as one
s
78
Theorem
-I
-I
Let J = (-sj rj, 0, ..., 0, -s.j t.),j B = (-bj 0,I
aj,
6.1.
B = L + U, J = JL + Ju'
Then
IILII _ _,
IIJll _ _P, IIjLII _I_ p,
Proof.
Let R 0 - 0, _
k = l,...,p-l,
S = R
IIUll _ _,
Rk_I)'Iu,
+ T
so that
p-l'
(I - S)JL = L(I - Rp_2)'IL(I
(I - S)JU = U(I-
Define
IIJll _
T p- 3 )-Iu
Tk_I)'IL,
... L(I - R0)-IL
TO) -I U .
... U(I-
IITkll*k'
IIJ ll
_/[<_
I_P/[2p-I(
1 -
- 2_p_i)_-20
2,p_1)II_-20(1
(i -
,k)]
,
_'k) ].
60 = 0, 61 = I, 6k+I - 6k - _26k_ I' so 6k = det(_,l '_)(k_l)×(k_l)>
_k = _26kI/6k+l' I - _k = 6k+2/6k+l'
2
(I - 2_p_l ) 6p = 6p+l - _
T2 = I - 2 2,
Thus
- R p- 3)-IL
T P- 2)'Iu (I-
_ *k'
I
dT
TO = 0, Tk = U(I-
2
-I
_0 = 0, _k = _ (I - ,k_i )
, so
II
Define
_ _i2"
lljull _I_ p.
= L(I-
p-I
IIBII _ 2_ = _, with
ic
-b-j j),
Tk=
¢_=-2_k6k-i
6p_l.
_ 0.
IIJll < _P and
(i - 2,p_l ) _-20(I - ,k ) =
2
Define
Tk-I _ _ 2 ¢k-2"
But then,
0
Tk = 6k+l - _ 6k-l' so T 1 = I,
By induction , Tk (I) = 2-k+l ,
for _ < _, ¢k(_)
e Tk
), so 2P-ITp(_)
IIJLII , IIJUII _ l_p.
e 1.
QED.
i
Corollary
6.1.
If
II
B[A]]I
< 1, II
L[B[A]]II, II
U[B[A]]il
_ 711B[A]il
, then
II B[A(2)]II _ II B[A]II
In generalization
Theorem
we have
6.2.
With
P.
of Theorem
X (i) = X(A(i)),
4.4,
there
is the very
X (I) _ I, _(i) =
surprising
(l__/__x(i))/(l+_/l_X(i))_,
79
2
Proof.
It suffices
to consider
I + _(i)
i = I, so we will
"I dj,k,
cj,-k = b-I
j-k dj ,-k' _j ,k = b j+k
Define
"
drop the superscripts.
k = 0,...,p-l,
_
so that
I.
m
_.] = (-a.
-I
aj_ I ¢jl,_2) ... (-bj_p+2
I
] ¢-I
J,- i) (-bj_l
aj-p+2
e j,-p+
I
I)
X (bjlp+l. aj_p+l ) ,
(2) = r
a.
jp'
]
" I cj+ I _.I,2) " .. (-bj+p_2
tj = (-cj _._l)(-bjl
-I
Cj+p_2
_ _p- 1)
-1
X (bj+p_l
aj+p_l
)'
c (2) = t
,
]
]P
b (2) = s
,
J
JP
a.] e j,-I
1
b j_1 I c j_ I -
bj -
s"
=]
= b.(I]
Now,
_-.1
b-1
j,l
j+l aj+l
(_.).
]
II_'_p_l II--i,
II_;l,kll _ (I - _II_.l,k+lll/4) -1
Ek = (I - %Ek_ i/4) - I , we have
I
II_.,p_kll
< Ek •
.
By defining
E1
=
i,
Similarly , ;I¢ii,_p+kll _ Ek.
II(:jll_ XEp_ 1/2 , II (I - qj) -i II_ 2/(2
This yields
observe
cj
- %Ep- I) .
To bound
k(2) '
that
411 b(2)-la(2)II
j
j
IIb(2)-I-(2)II
j-I cj-I
a jp II
IIp,- 111
IIp-I
]p
]p
x IIjp,-2 II
Iu
Iil
"'" II
jp,-p+l II
p-p+l
X II (I -
_ XE
II
ajp'p+lll
c.
II II _'p_p,lll
IIb-I
jp-p+l
C_jp_p)" Iii IIb'.]p_pl]P-P
I
p-p,2
4
ajp. l
p-
"'"
p-p,p- 1
3=i
(_)P iqp'l
J"
jp-I Cjp-I
Cjp-p+l II
80
Thus
4(2) is bounded
which
implies
that
by the above
_ = 1.
Xp, and we are done with
shown by
induction
(2/(2 - XEp- 1))2=
the bound
Note
Then
Corollary
Suppose
= 2n/(n+l),
the
Now suppose
(i - X)'I((I
- _P)/(I + _p))2
to the desired
first
bound
that
simplifies
to
It may be
, yielding
and with
_
that X = I,
_ < I.
- _ n+l)
that En = (I + _)(I - _n)/(l
simplifies
a little
effort
result.
shown
that
QED
%(i) < I implies
x(i+l)
< I.
is now to be resolved.
6.2.
_(i) < I, and _
Proof.
En
this case.
that we have not actually
This point
quantity.
If x(i) _ I then
(i+l)
The case
-< _
(i)p
%(i+l)
< %(i)p with
inequality
if
.
%(i) = 1 was considered
0 _ % < I, again considering
strict
only
in Theorem
the case
i = i.
6.2.
Suppose
It is not hard
that
to show
_/2J
+ _P)) = (_p(X))-l,
that
((I + _)/2)P(2/(I
Now,
_p(1) = I, _p(X) < 0 for 0 _ _ < I, so he(X ) > i for 0 < _ < I and
X(2) < xp for this case.
we have
6.C.
4(2) _ _P(2/(I
Back
Assuming
+ _p))2,
where
_p(X) =
i
only _ _ I, and since
which
implies
_(2)
(Pi)(l - _) .
i=0
_(I + _)2/4
= _,
_ _p.
QED
Substitution
Error propagation
pends on the particular
Xkp-j
errors
in Xkp_j
depend
errors
from Xkp_p
and
in the back
substitution
implementation.
= q°kp-j - _kp-j
on newly
Xkp
.
In the simplest
Xkp-p
generated
We can write
Ykp-j = Xkp-j + _kp-j to find
for p-fold
- Ykp-j
rounding
reduction
de-
form
Xkp'
errors
the computed
and propagated
solution
as
81
_kp-j = -_kp-j
It follows
from the analysis
and from explicit
formulae
II_kp_jll _ IIB[A]I_
if
_kp-p
- _kp-j
of the Gauss-Jordan
for _kp-j'
_kp-j'
IIB[A]II < I, which
in a newly
errors
in its closest
decreases
computed
component
as the distance
between
that
3.5)
to
depend
component,
components
(Corollary
II_kp_jll < IIB[A]I_ -j,
IIB[A]I_II _kpll +
therefore
back-substituted
algorithm
leads
II_kp-jll < IIB[A]I_-Jll _kp-pll +
Errors
_kp + ¢kp-j"
most
IIekp_jll.
strongly
on
and the "influence"
increases.
Moreover,
we
have
II_kp_jll _ IIB[_]II max(ll _kp.pll, II_kpll) +
If e is a bound
on errors
in the final computed
Slightly
introduced
result will
different
results
Xkp-I = _kp-I
at each
be bounded
prevail
- _kp-I
stage,
it follows
that errors
by ¢ logp N.
in the back
Xkp-p
IIekp_jli.
- _kp-I
substitution
Xkp
for j = 2,...,p-I
Xkp_j
Errors
in the computed
= F k ,p-j - Ak ,p-j Xkp-p
solution
_kp-I = -_kp-i
Ykp-j
_kp-p
- Ck ,p-j Xkp-j+l
= Xkp-j + _kp-j
- Vkp-p
obey the rule
_kp + ekp-I
for j = 2,...,p-I
_kp-j = -Ak,p-j
so the errors
than Xkp_p
in Xkp_j
and Xkp.
depend
This
_kp-p-
on propagation
implies
a linear
Ck,p-j
_kp-j+l + ekp-j'
from Xkp_p
growth
and
Xkp_j+l
rather
as in the LU factorization
82
in conjunction
again
suppose
with
that
logarithmic
growth
IIekp_jll _ ¢, with
as in odd-even
reduction.
If we
I]_kp_pll, I[_kpl] _ _, it follows
that
II_kp-I [I_ IIB[K]II _ +
¢,
]]_kp_jl[ _ IIB[K]II max(_,
which
leads
to II_kp_jll _ _ + (p-l) e.
the final computed
result
will
I[gkp_j+ll[ ) + ¢
The end result
be bounded
by
is that errors
cp logp N.
in
83
7.
SEMIDIRECT
METHODS
Briefly,
a semidirect
fore completion.
In common
Ax = v computes
is used, while
x exactly
while
to x.
producing
the computation,
finite
intermediate
of conjugate
since it can produce
termination
Section
[R2].
8.E)
factorization
properties
results
obtained
discusses
equation
gradients
semidirect
of the odd-even
by Hockney
a similar
technique
(cf. Theorem
[H5],
if exact
a sequence
are
could
of
arithmetic
intermediate
number
produce
on this
of steps,
x exactly.
which
During
approximate
the
is terminated
is returned.
is an example
Parallel
based
its usual
iteration
([H4] and
and Palmer's
convergent
a and b positive
methods
of a semidirect
long before
Gauss
as is Malcolm
definite
with
on the quadratic
methods.
This will
Stone
[S2] and Jordan
based
be-
of approximations
then the algorithm
approximation
[M2] of A = (a,b,a),
Here we consider
gence
solution
example,
for the solution
are generated
enough
a good
that is stopped
of steps
methods
of steps
The accelerated
is another
number
computes
results
is small
method
to x in a finite
number
and an approximate
The method
method
method
The semidirect
If the error
prematurely
a direct
in a finite
an approximation
an additional
solution.
is a direct method
usage,
an iterative
x (i) converging
scale,
method
on the Buneman
algorithm
0(b-la)
various
Buzbee
[BI7]
for Poisson's
4.3).
The name semi-iterative
has already been appropriated
[V3].
Stone [$2] initiated the name semidirect.
< i.
conver-
generalize
[Jl].
LU
for other
purposes
84
7.A.
Incomplete
Elimination
Let us first consider
process
elimination
for Ax = v, which
H I = A, v (I) = v, Hi+ I = (I + C[Hi])Hi,
In Theorem
so that
tive
odd-even
4.1 we
proved
the off-diagonal
that
if
blocks
to the diagonal
blocks.
D[Hi]'Iv (i).
Because
v (i+l)=
IIB[A]II < I then
decrease
(I + C[Hi])v(i).
IIB[Hi+ I]II _ IIB[H i]211,
quadratically
This motivates
is the
in magnitude
the approximate
rela-
solution
Z (i)
call
this technique
we halt
incomplete
the process
odd-even
Theorem
7.1.
Proof.
Since v (i) = H l.x = D[Hi](l-
before
completion,
elimination.
IIx - z (i) II/II xll < IIB[Hi]ll"
B[Hi])x
we have
z (i) = D[Hi]-Iv (i) =
x - B[H i]x.
Thus
mation
(7.I)
QED
the B matrix
z (i).
If B =
k = I + max(0,
guarantees
the relative
error
IIB[A]II < I, 0 < e < I, m =
min(m,
should have
In fact,
incomplete
determines
IIB[Hk]II _ ¢.
k = 6 and we
applied.
°g2 _og----_
For example,
I
if 8 = _,
N > 32 for incomplete
would
be much
better
]]B[Hi ]211/ (2 k must be larger
IIB[Hi]211 ).
than
i + l°g211°gmog
2(8/2(e/2
1
Since
approxi-
))
-20
¢ = 2
odd-even
IIB[H kill < 825 = 2.32 - I0-9"6,
elimination
in the simple
Flog2 hi, then
- 10-6 , then
elimination
so the results
to be
of the
than required.
In the case n = I, a.J = c.
J = a, b.J = b, we have
(7.2)
we
2
x /2 < x2/(2
-
x2
IIB[Hi+l]ll =
) <
x2
for 0 < x < I,
85
2k-I
in order to have IIB[Hk]II _ ¢.
2k-I
2(_/2)
> c then _(k) > ¢.)
by defining
(By induction,
A more
B(k) > 2(B/2)
refined
lower bound
T 1 = _, Tn+ 1 = T2n/(2 - T2)n' and computing
, and if
is obtained
the largest
k such
that
(7.3)
Tk > ¢.
Since
B[Hi]
is the matrix
H.xl = v(i)' when
the estimate
with
the block
IIB[Hk]II is small we can use this
Jacobi
iteration
iteration
to improve
the initial
the choice
choice
z
(k)
z (k) = D[Hk ]-I is the iterate
= 0.
If z
(k,0)
is an approximation
(k i)
-Iv(k )
= B[H k]z
'
+ D[Hk]
then [[x - z(k'£) [[
(k,0)
][B[Hk]_II [Ix - z
I] • k and _ may now be chosen to minimize
x and z
to
(k,i+l)
tion time
subject
to the constraint
iterative
methods
may be used
results
for
g(k)
In fact
following
associated
using
the theory
B2k- I _ _ e.
in place
Any
of the Jacobi
of non-negative
matrices
computa-
of the other
iteration;
and regular
standard
for related
splittings,
see [HI].
We have
remarked
is irreducibly
_(i+l)
diagonally
< B (i)2.
equations.
I _ i < m-l,
is a case
to the particular
and _(i) =
of practical
the example
IIB[Hi]II then
importance
A = (-I, 2, -I),
that strict
matrix
inequality
A = (aj, bj, c j)
B(1) < I and
in differential
for which
_(i) = I,
is necessary
for incom-
to be effective.
_(I) is very close
boundary
does not
that if the tridiagonal
dominant
_(m) = 0, shows
Indeed, when
equality
This
However,
plete elimination
second-order
earlier
value
imply that
to I, as will
problems,
_(i) will
approximation
the fact
be small
z (i) = D[Hi]
v_, j - (2i-I - I) _ _ -< j + (2 i-I + I).
be the case
that we have
for i < m.
for most
strict
With
-I (i)
(i)
v
, z.J
depends
in-
regard
only on
Thus for N = 2m - I, i < m,
86
(i)
the center
though
point
,
Z2m_l will
be independent
this is not the only possible
elimination
will
be necessary
of the boundary
approximation,
to ensure
data.
complete
convergence
Al-
odd-even
to the continuous
solution.
7.B.
Incomplete
Reduction
The incomplete
odd-even
odd-even
elimination.
tridiagonal
systems
Complete
A(i)x (i)
A(m)x (m) = w (m) exactly
x = x (I) .
Incomplete
y(k)
We remind
that roughly
quarter
work
the reader
per step
deal with
shown that roundoff
quite
possibly
sis is needed.
including
This
those
to incomplete
errors
at most
might
no growth
observation
to finally
elimination,
obtain
details,
But since
one
reduc-
the
10.C.
reduction
must
We have already
are well-behaved,
the approximation
errors,
assumptions
referee
where
see Section
a different
of interesting
certain
is due to an anonymous
in incomplete
the incomplete
reduction
especially
reduction,
elimination,
substitution.
see that a number
4.D,
in the first
than the roundoff
under
in Section
the savings
For more
in complete
of errors
Flog2 N + 17, solves
for y (k) _ x (k) , and back
in incomplete
logarithmically.
We will
made
Thus
in the back
be larger
of block
(i).
x
is performed
constant.
error propagation
can only grow
(i)_
of observations
less than
m =
substitutes
approximately
and so forth.
is roughly
In contrast
y = y
the actual work
be much
, i = l,...,m,
a sequence
to incomplete
stops with A(k)x (k) = w (k) for some k be-
this system
in the second,
tion would
(i)
= w
is analagous
generates
for x (m) , and back
to obtain
half
algorithm
reduction
reduction
tween 1 and m, solves
substitutes
reduction
properties
about
of [H2].
and
errors
analyresult,
roundoff,
87
and a superconvergence-like
An initial
cording
y
(k)
approximation
to our analysis
= D[A
k should
for a suitable
of incomplete
Regardless
in some
of errors•
one, and we have
Here
both
the quadratic
(k) ac-
The choice
l_(k)-y
(k)
for computational
(but certainly
of the source
from A (k) and w
elimination.
be determined
approximation•
and useful
due to damping
y(k) may be computed
(k) - lw(k )
]
is a simple
The parameter
important
effect
decreases
IPI_
(k)
II< I_[A
economy
(k)
]ll-
and
of B[A (i) ] are
not all) situations.
of the approximation
y (k) , let us suppose
that
y_k)- = x J(k).+ 6(k).
J
.
In the back substitution,
y(i)
is computed
from
y (i+l) by
i)
Y_j
(i+l)
= Yj
y_i)
= ib(i)
-i. (i)
j-i
" 2j-I )
tw2j-I-
with
¢ j-1
representing
rounding
a(i)
(i+l)
c(i)
(i+l)
(i)
2j-I Yj-I
- 2j-I Yj
) + e2j-l'
errors,
We then
have
8(i)
(i+l)
2j = 6.
J
6(i)
o(i)
2j-I = ¢_i)
j-I _ ib(i)
" 2j- l)-I (a_j-I
Theorem
Proof.
7.2.
If II¢(i)II < (i -
$(i+I)
+ c_i)
-j-i
j-i
IIB[A(i) ]II>
By construction,
o
i,
•
6(i+I))
j
"
II6(i+I)II then
I16
(i) II:
II
6(i+I) II"
88
As in the analysis
reduction,
of roundoff
error
propagation
this expression
and the hypothesis
But, also by construction,
This theorem
ing errors
derived
so that
this error
vector
components
In Figure
system
7. I we
II 6(i> II< II$(i+I)II •
in y
that
suppose
be much
It now
during
the back
Y ll=
smaller
the errors
stitution
quickly
In the
of rounding
damps out approximation
is
from
substitution.
in size than
by component
absence
in y(1)
(k)
(k)
II/l_
II°
IIx(k) - y(k)If, many
errors,
IIx-
Yll •
for the tridiagonal
(-I, 4, -l)x = v, N = 31, k = m-I = 4, v chosen
and yj-(k)= b(k)-lw(k)j
j
.
that no round-
follows
I_-Yll /I_II < l_(k)-Y
to see that while IIx-
illustrate
First
(k).
will not grow
of x-y will
QED
¢(i) = 0 and all the error
error
IIx(k) II< IIxll , we conclude
it is easy
we have
some explanation.
from the approximation
In fact,
IIB[A(i) ]II II6(i+i)II )"
II 8(i) II_ II6(i+i)II •
requires
are committed,
IIB[A]II < I that
From
odd-even
we have
II 6(i) II_ max(ll 6(i+I)II, II¢(i)II +
From
for complete
so that x. = I,
J
errors,
creating
the back
a very
sub-
pleasing
super convergence- like effect.
Now suppose
bound
have
II¢(i)II < ¢essentially
propagation
have
must
that rounding
If
solved
analysis
errors
are committed
and we have
II6(k) II< v¢ for some moderate
constant
the system A(k)x (k) = w (k) exactly,
is the same as for complete
reduction.
the upper
_ then we
and the error
That
is, we
II8(i) II< (k - i + _)e, I < i _ k.
The remaining
case
be considered
in competition
esis of Theorem
7.2,
is when
II8(k) II>>
II 6(k) II (i -
with
¢-
Here
an iterative
IIB[A(k)]ll)
the semidirect
method.
_ ¢, should
method
The hypothbe viewed
as
89
-3
-4
-5
m
I
._
_ 6
X
£)
¢=m
0
-7
-8
-9
0
5
I
I0
I
15
I
20
i
25
I
30
Component number j
Figure 7.1.
togl0 Ixj-yjl vs.
j.
J
35
90
an attempt
to formalize
esis remains
true, the difference
tirely be attributed
hypothesis
the condition
effects
effect would
will
As
between
approximation
it is possible
still be able to say that
convergence
(in norm)
to the original
is violated
II6(k) II>> ¢-
x (i) and y(i) can enerror.
for the errors
As
soon as the
to grow,
but we will
IIx - Yll = II 6(k) II+ O(ke).
be moderated
somewhat
still be felt and we can expect
long as the hypoth-
Although
by roundoff,
this bound
the super-
its qualitative
on accumulated
er-
ror to be an overestimate.
7.C.
Applicability
We close with
dratic
semidirect
of the Methods
some remarks
methods.
on the practical
In Buzbee's
rithm for the differential
equation
sides _I and _2' d a nonnegative
application
analysis
[BI7]
of the qua-
of the Buneman
(-A + d)u = f on a rectangle
constant,
it is shown
that
algo-
with
if
i
_I
2
_2 ]_
cq [d_2 +
>>
then a truncated
quires
block
Buneman
algorithm
that the rectangle
tridiagonal
By directly
matrix
considering
equation
on a rectangle
possible
to reach a similar
for more
general
with
have
verified
equal mesh
blocks
spacing
conclusion;
equations.
that
small
Geometrically
when
the finite difference
in the other, M < N, the matrix
It is readily
be useful.
be long and thin if d is small,
will
elliptic
will
1
A is (-_,
and hence
properly
approximation
M nodes
serves
to Poisson's
it is
as a simple model
in one direction
PM' -_)N'
the
ordered.
in both directions
this also
With
this re-
and N
PM = (-I, 4, -I)M.
IIB[A]II = 211 QMII , QM = PM I'
91
I
_(I-
D
+ D
Im D m- ) _ n=2m'
n
I_nll =
2D
I
m
"_'(1 - "_'),
I1
D O = I, D 1 =4
' D n =4
The following
on the number
n = 2re+l,
D n-I - Dn-2 .
table
gives
of odd-even
IIB[A]II for M = 1(1)6, along with
reductions
needed
to obtain
M
IIB[A]II
IIB[ A(i) ]II< 10"8"
II
B[A(k) ]ll<
k such that
lower bounds
(7.2)
(7.3)
estimates
I0-8
upper bound
(7.1)
w
These
I
I
2
= .5
5
5
6
2
2
3
-.67
6
6
7
3
-6
7
=. .86
6
7
8
4
I0
- .91
II
6
7
9
5
25 _ .96
26
6
8
I0
6
40 . .98
41
6
8
II
figures
direct methods
the next
again
lead to the conclusion
for elliptic
section
we consider
tings and orderings
equations
some
will
iterative
to take advantage
that the best use of the semibe with
methods
long thin regions.
which
of this property.
use matrix
In
split-
92
8.
ITERAT IVE METHODS
In this section
ference
we consider
approximation
particular
interest
to a general
will
tive step, the direct
linear
system.
we want
used
8.A.
our emphasis
equation.
as part of each
of a block
on methods
to be such that the method
Of
itera-
tridiagonal
for a parallel
already
dif-
considered
computer,
can be
Use of Elimination
applications
Jacobi
that several
to iterative
iteration
elliptic
equations
of A can be simplified
subsequent
Indeed,
When
converges;
suitable
will
be handled
the iteration
acceleration)
_I = I, v2 = 2/(2-
Jacobi
definite
x (i+l) = _i+l (B[A]x(i)
forming
for
If the structure
A'x = v', then
efficiently,
and since
any slower.
iteration
the Jacobi
that
is satisfied
IV3].
will not converge
of the block
A is also positive
more
elimination
Suppose
this condition
step,
block
the system Ax = v and
partitioning
by an elimination
convergence
(J - Sl or Chebychev
Consider
about
x (i+l) = B[A ix(i) + D[A]- Iv.
with
computations
IIB[A' ]II< IIB[A]II
of our theorems
methods.
[[B[A][ I< I, so the iteration
slow.
elliptic
employ,
solution
of a finite
efficiently.
the block
many
which
or semi-direct
Let us first note
have
solution
nonseparable
be methods
Continuing
this system
the iterative
will
often
semi-iterative
be quite
method
IV3]
+ DIAl -i v-
x (i- 1))+ x (i-l) ,
02),
Vi+ I = (I - 02Vi/4) - I , 0 = 0(B[A]),
will
converge
at a much
faster
rate and is a natural
candidate
for parallel
93
computation.
sequence
If we use _ =
IIB[A]II as an upper
bound
on P and use the
of parameters
*
Vl
then the J-SI
iteration
"preprocessing"
be smaller;
*
_i+l
= (I-
_2 vi/4
*
)- I,
for A' will
not be slower
Ax = v we
unfortunately
Juncosa
tive methods
and Mullikin
when
* = 2/(2 - _2) ,
= I, v2
than
can expect
that the actual
a complete
proof
[J2] discuss
that for A.
computation
of this claim
the effect
By
time will
is not at hand.
of elimination
on itera-
A = I - M, M e 0, m rr = 0, and .... m rs _ I with
strict
s
inequality
for at least one value
liptic h_undary
the interior
be useful
value problems
nodes,
of r.
is to order
and then to eliminate
for domains
with
curved
Thus
the point
particular
the boundary
the boundary
or irregular
that, with A' = I - M' representing
p(M') < 0(M).
Their
iteration
nodes
will
for el-
followed
nodes.
boundaries.
the system after
Jacobi
interest
by
This would
It is shown
eliminating
one point,
not be slower.
Also,
-I
if G = (I - L[M])
tion,
[B4].
U[M]
then 0(G') < o(G).
As
in the proof
for block elimination,
is the matrix
Similar
results
of Corollary
though
describing
follow
3.2 we obtain
not under
the more
the Gauss-Seidel
from work
on G-matrices
corresponding
general
itera-
results
assumption
II
B AIII
<
8.B.
Splittinss
The block
diagonal
Many
Jacobi
portion
of A, and using
other iterative
MlX(i+l)
= M2x(i)
and semidireet
iteration
methods
+ v.
methods
When
is created
by "splitting
the iteration
are also defined
M 1 is a block
can be used to solve
off" the block
D[A]x (i+l) = (D[A]-A)x (i) + v.
by a splitting
tridiagonal
the system
A = M 1 - M2,
matrix,
for x
various
(i+l)
direct
94
A particularly
[C4] is readily
effective
adapted
to parallel
_u
on a rectangle
of variable
procedure
= -V
discussed
computation.
" (a(x,y)vu)
= a(x, y)2 u(x,y)
and Golub
The equation
= f
R, a(x,y) > _ and sufficiently
w(x,y)
by Concus
smooth,lundergoes
and scaling
by a 2 to yield
a change
the equiv-
alent equation
-Aw + p(x,y)w
I
i
I
on R, p(x,y) = a 2 A (a2 ), q = a 2 f.
= (k - p)w n + q, k a constant,
(-Ah+
= q
The shifted
is solved
in the discrete
Kl)w (n+l) = (KI - P)w (n) + Q, where
the system
Chebychev
(-A h + Kl)b = c may be solved
acceleration
lel and rapidly
8.C.
is applicable,
iteration
(-A + k)Wn+ I
form
P is a diagonal
matrix.
by the fast Poisson
the overall
method
Since
methods
is highly
and
paral-
convergent.
Multi- line Orderinss
There are a number
solution
tions
of block
tridiagonal
for a rectangular
for 2-1ine
iteration
the resulting
square
grid;
classical
systems.
is shown
technique
tangle as part of the iteration.
systems
schemes which
with
in Figure
involve
the multi-line
of unknowns
itera-
and partitioning
8.1.
A diagram
of
8.2.
of the multi-line
form of the elliptic
a few lines we obtain
We begin
is shown
in Figure
or long and broad rectangle
A discrete
iterative
the ordering
on a 6×6 grid
matrix
The underlying
of more
methods
is to divide
a
into a set of long and thin rectangles.
equation
is then solved
By partitioning
over each
the grid
that can be solved well
subrec-
into groups
by general
block
of
95
I
3
5
7
9
II /
2
4
6
8
I0
12
13
15
17
19
21
23
14
16
18
20
22
24
25
27
29
31
33
35
26
28
30
32
34
36
Figure
J
8.1
1
I
I
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Figure
8.2
The
2-1ine
ordering.
96
I
Figure
2
3
4
20
21
22
2.3
24
19
32
34
25
I8
31 I 36_
35
26
17
30
29
28
27
t6
15
14
13
Figure
8.3
8.4.
33
r------'--I
The
single
5
12
spiral
6
II
ordering
97
methods,
particularly
The multi-line
is probably
computer
the semidirect
block
the simplest
and should
an n×n grid
method
and a k-line
matrix
with
plete
than
odd-even
cording
8. D.
reductions
is used,
given
useful
spiral
ordering
order diagrammed
produces
represented
by the tridiagonal
in Figure
a very
8.4.
in Figures
of
locations
tridiagonal
out
of this construccase.
When
incom-
of lines may be chosen
7.C for Poisson's
to solve systems
We note
ac-
equation.
matrix
involving
tridiagonal
SOR method
SOR method
8.3 and 8.4.
forming
is the
The spiral
region,
the central
order
which
portion
is
of the
A = M I - M2 we take M I to be the
2 2
For an nXn grid M I is n ×n and has IIB[MI]II
spiral
form,
is shown
[BI0].
reduction
can be used
quite
efficiently
M I.
the block peripheral
of the periodic
computation
In the splitting
that the single
ting to yield
for parallel
long and thin rectangular
tridiagonal center of A.
I
about _; thus incomplete odd-even
block
each
ILl] points
because
as for the general
of unknowns
essentially
eral block
matrices,
as one block
are needed
the number
A = M I - M2 is
in strategic
Lambiotte
With
Orderings
Another
matrix
or pipeline
for other methods.
zeros
purposes
in Section
acceleration
the splitting
By storing
log N reductions
reduction
Chebychev
tridiagonal
and N = nn' = n2/k.
to the guidelines
Spiral
single
N = n.
block
7.
on a parallel
n = kn',
for computational
that only log n odd-even
tion, rather
with
as a benchmark
of n' decoupled
k×k blocks
of Section
to implement
iteration,
has k×k blocks with
M I may be regarded
iteration
be considered
such that M I consists
which
Jacobi
methods
order
method
has been used with
[B9],
[E2].
and the convergence
experimentally
Here
another
split-
the blocks
are
rate of the periph-
to be similar
to the 2-line
98
The double
bine
features
of the single
(Figure
spiral
8.5, 8.6)
order with
is an attempt
advantages
to com-
of a 2-cyclic
For an n×n grid,
-peripheral
orders
can also
For the single
submatrix
tion, M I.
and double
spiral
orders,
of A should
be included
when
n is even
in the implicit
This can be done at little extra
row and column
similar
be defined.
of M2 has at most
modification
The mapping
ordering
should
from
two non-zero
and ensures
elements.
ordering
by a permutation
of the vector
With
is unknown
(i-l)n + j in the natural
ordering.
tors forming
the coordinates
that each
n is odd a
(i,j) of each unknown
or single
T =
(1,2,...,n 2)
an n×n grid,
Given
point
For simplicity
are computed
by taking
(I _ k < n2)
i(k) = I + q(k),
(I _ k < n2)
j(k) = k - nq(k),
(I _ k _ n 2).
let n = 2m-l.
p(i,j)
L(k-l)/nj,
= min(i,j,
Define
2m-i,
(i,j) is part of peripheral
2m-j),
number
(i _ i, j < n).
p(i,j),
(i,j)
T, two vec-
and remainder:
q(k) =
The point
When
to the peripheral
S = (S(1),S(2),...,S(n2)).
number
part of the itera-
be made.
the natural
is defined
cost,
into a new vector
quotient
the trailing
of A looks like
The e'd elements
spiral
ordering
1 2
n odd, one spiral contains _(n
+ 2n - I)
1 2
and the other contains _(n
- 2n + I) points.
Multi-spiral
oi:
ordering.
points
spiral
I _ p(i,j)
_ m.
99
I
2
3
4
5
6
25
26
27
28
29
7
19
20
21
22
30
8
J
Figure
18
_]36 I I,, 24
2_
31
9
17
1,35 I
_4
33
32
I0
16
15
14
13
12
II
Figure
8.5
8.6.
The
double
spiral
ordering
IO0
Peripheral
p has 8(m-p)
tains the single
point
The peripheral
peripheral
points
is taken
in it, except
for peripheral
m which
con-
(m,m) .
ordering
is per.
in clockwise
I, per.2,
order:
..., per. m, where
top, rhs, bottom,
each
lhs.
Define
p-I
f(i,j)
the number
= _k=l
of elements
g(i,j)
8(m-k) = 4(2m-p)(p-l),
contained
= f(i,j) +
in peripherals
I (i-p) +
8(m-p)
= 4p(2m-p)
+ I
(j-p) +
easily
S(k) = g(i,j)
-8m +
evaluated
of a vector may
permute
quite
may be generated
Of course,
"unnatural"
here.
A more
complete
ILl].
be performed
(if i _ j)
We
performed.
(if i>
these
the vector
should
if i > j
(2p + i + j).
j)
the necessary
or pipeline
- I, and
if i < j,
permutation,
computer.
on the CDC STAR by using
though
Once
be used
and can be
The actual
the built-in
sparingly
S is generated,
reordering
since
the elements
vector
they are
of A and v
directly.
the implementation
ordering
sented
been
on a parallel
instructions,
expensive.
defines
i
- (i-p) - (j-p) + I
-
The vector
1,2,...,p(i,j)
involves
only mean
comparison
of an iterative
much more
developmental
to show the potential
based
on actual
For such a discussion
method
based
attention
for parallel
machine
capabilities
of some other methods,
on an
than is pre-
computation.
has not
see Lambiotte
I01
8.E.
Parallel
We close
Gauss
our section
on iterative
tems with
some brief
remarks
IT2] with
extensions
by Heller,
we consider
the normalized
on the Parallel
Stevenson
tion is D (i) = Iof D.
This
_(A) = (I - _)/(I
Techniques
out, this
dratic
convergence
(parabolic
necessary
the constant
of odd-even
]ID-
is much
reduction.
making
D (i-l) II•
in [H4].
As
less attractive
problems,
]Id'
j - d j I]< (_(_
Thus
Stone
approximation
[S2]
than the qua-
in certain
discussed
situations
here)
and this would
that D' has already
been
it is
hope-
2
+ ^_
i
) +
4'
2
(4"-)(I
+ _F
computed
3.4
)
+
]ID' - DII : maxj] I d'.]- djlI -< I - }(_I_--_-%
+ ^I-_-_
F), and we may
as an initial
estimate
k(A) < i,
D (i) II_ _II D-
However,
itera-
for the L and U
Since d'
I cj-I - a'(d_
j - d j = a j dj -I
j
- i) Ic'
j-l' by Theorem
_
parallel
estimates.
Let A' = (a'.],I, c')] and suppose
:
let
up the Parallel
iterations
iterations
of related
initial
dj, cj),
a natural
_ are given
or the multi-line
suitable
For simplicity
in [H4] that, with
we have
to solve a sequence
fully provide
are Jacobi
linear convergence
equations
[H4].
sys-
due to Traub
D (0) is some initial
iterations
It is shown
+ _),
for reducing
points
and Traub
where
of three
two parts
systems.
tridiagonal
iteration
A = (£j, I, 0)(0,
L[A](D(i'I))-Iu[A],
the other
block bidiagonal
Gauss
Since D = I - L[A]D-Iu[A],
forms the first
Gauss method;
for block
form A = (aj, I, c j).
In the block LU factorization
D = diag(dl,...,dN).
methods
to D.
Now,
the Parallel
Gauss
take D'
iteration
is
102
most
useful
initial
When
when
X and X' are not too close
approximations
X and
X' are close
the iteration
For
will
should
tridiagonal
be available
to 1 the bound
to i, and in this case good
by the above
will
probably
not be used
matrices
we have
bound
on
IID' - DII.
be large, but in this case
due to slow convergence.
d'
I _ dI = 0
-I
-I
d'.J
d j = a j (d_ - I)
(dj. I - d j- l)dj -I cj -I
-I
+ (d _i )
(aj cj_ I - a'.
J c'.
J- i) ,
so that
2
]d' - d ] < I- _/_
J
J
1 + _,l-/Tf_
d'.
I J-1
- d
,I+
j-1
1 + _
Let A = maxjlajcj_ 1 -a'c'
j j-I ] _ (X + _')/4,
a = (i - 1,,_-X)//(1
so
by
induction
+ Ji'_),
6.
b -
< (1 + a +
2/(1
...
Id'
J -
6j =
+ _/_--_)
+ aJ-2)b
]ajcj-1
.
Then
< b/(l-a)
intermediate
tween
_. and X' , so that
if X
is close
should
complete
theorem
or
information
When
equation
next
time
we have
results,
is not so close
solve
u(t,0),
The Crank-Nicolson
the tridiagonal
(I + _P P i+l)
u i+l
k* be-
some
where,
may give more
systems•
iteration
for
with
i
Let u. _u(ik,jh),
J
specified•
for advancing
2_ P i)u i ,
+ aj+I/2'
+ ,,_)
the iteration
system
= (Ia__I/2
for
> 0, u = u(t,x),
i
pi = (_ a__i/_
Gauss
method
8j_ 1 + b,
•
tridiagonal
of the Parallel
u(t,l)
< a
the situation
but here
for block
6j
.J°
to I the bound
ut = (aUx) x , a = a(t,x)
0 = k/h 2.
step must
bound
the utility
0 _ x _ I, 0 < t, and u(0,x),
k = At, h = Ax,
Again
than the one derived
Let us now consider
the parabolic
X
upper
.
I"
5 _ 2_/(^/_'-X
= ,_I-_X*
(I_'X-X + ^/rf2"_)//2
5.
J < A/_
to I, a large
not be used.
0,
'
value
a'.c'
3 j-1
d J [,
hI -
J
By the
-
i
" aj+I/2)"
to the
103
We
therefore
have
A = (- 0ajit_/21
,
+
i+l
i+l
/o
p(ajit_/2 + aj+i/2)/2 , - paj+i/2/_);
let us now drop the superscript
A will
and we would
theorem,
always
be strictly
i.l for convenience.
diagonally
dominant
like to know when we also have
define
+ aj+i/2
= 2aj_i/2
bj = ax((i.l)k,
do_B/2 + aj_i/2
Using
> 0,
the mean
value
some algebraic
manipulation
%(A) = maxj(l
+
= 2aj_I/2
fy 2
_ hlax(t,x) I < I.
- h _j,
xj).~
it is seen
that
paj-I/21 (I + _2 h bj)-)-i
x (I+
to guarantee
+ h bj,
xj),-
_j = ax((i+l)k,
In order
%(A) < I.
a(t,x)
x. and _. by
]
]
aj_I/2
After
because
I
paj_l/2 (1 - _2 h _j))
%(A) _ I it is sufficient
If m I = maxlax(t,x)
l.
to assume
I this becomes
that h and k satis-
the condition
talk _ 2h.
Suppose
we want
This may be satisfied
With
the more
stringent
if we choose
condition
h and k such that,
cpaj_i/2
_ i + phbj/2,
cpaj_l/2
< i - ph_j/2.
m 0 = max a(t,x), a sufficient
X(A) _ (I + c) -2, c > 0.
condition
for each
j,
is cpm 0 < 1 - phml/2 , or
104
(8.1)
k _ 2h2/(2cm 0 + hml).
For c = 0, (8.1) reduces
to the previous
condition
on h and k, while
for
c = h we have k _ 2h/(2m 0 + ml) = O(h).
Now, one of the best
k may be chosen
allows
features
to be O(h).
this choice.
We have
Conversely,
then c < h(l - rml/2)/(rm0)
the Parallel
restricted
to small
(k = O(h)),
X(A) will
Gauss method
be close
for this particular
reduction
would
Gauss
to I.
Thus
be preferred.
When we
method
value
(c >> h) we are necessarily
be expected
to be very
appears
than originally
Once several
small
to warrant
of c still
(8.1) and rm I < 2,
When h and k are chosen
the method
is that
try to pick h and k so that
be effective
cannot
application
and _ may be sufficiently
variations.
= O(h).
will
seen that a small
if k = rh satisfies
time steps.
Parallel
of the Crank-Nicolson
in the usual
effective
way
since
to be less attractive
believed,
reductions
and odd-even
have
use of Parallel
been
performed,
Gauss
and its
105
9•
APPLICATIONS
In this
in which
the assumption
shown how
9.A.
to get around
discuss
two applications
of practical
importance
IIB[A]II < I is not met;
in the second
case
it is
this problem•
Curve Fitt ins
Cubic
rise
section we
spline
interpolation
to the tridiagonal
system
of the data
(xi,f i) , I _ i < N + i, gives
[C6, p. 238]
hisi_ 1 + 2(h i + hi_l)S i + hi_isi+ I
= 3(f[xi_l,Xi]hi
2 _i
hi
----"
X
i+l
-
though
l
more
they might
Cox
piecewise
HI
However,
polynomials
hi_ 1) and
general
give rise
[C7] presents
problems
to block
an algorithm
which
must
IIB[A]I[ = 2' which
need not behave
tridiagonal
solve
is quite
so nicely
satis-
even
matrices.
for least-squares
curve
fitting with
the system
CI T
0
_N,
X..
We have A = (hi, 2(h i + hi_l),
factory.
+ f[xi,xi+l]hi_l),
1
C2
I
1\
/
b2
i. v
•. "".
0
C2
T
C2m- 2
The v vectors
contain
coefficients
m
of the Chebychev
!
m"
polynomial
expansion
106
of the piecewise
polynomial,
and the _ vectors
ers derived
through
always
full rank, and if there are sufficiently
have
between
nite.
when
adjacent
pairs of knots
the number
the band
of knots
elimination.
9.B.
Finite
approximations
or data.
The C o'S
]
data points
all be positive
defi-
When
insufficient
is needed
IIB[A]II < I will
Indeed,
data
B[A]
in
not
is not even
to differential
equations
method.
Usually
example,
admittedly
Thus
some other
a simple
often
that A is positive
one, shows
of variables
from finite
assumptions
must be used
to establish
in an iterative
definite.
how an easily
II" llc_,
and also
made,
serves
The following
implemented
= v' such that
is not actually
element
do not satisfy
or convergence
can yield a new system A'x'
IIB[A]IIc_< i for some norm
constructed
technique
in a direct method
it is shown
if the change
change
IIB[A' ]II< I.
this will
to establish
show that
numerical
stability.
Let us begin with
Hermite
finite
.
partition.
that matrices
stability
of variables
many
and pivoting
that the condition
2.A we noted
IIB[A]II < I.
numerical
problem.
Elements
In Section
such as
is large.
full rank
partitioning
for the most obvious
multipli-
band elimination may be used well
and data points
It seems
Lagrange
least-squares
then the H o'S will
]
the H.'s will not have
]
for any reasonable
defined
Even
linear
In this case a straightforward
is available
hold
a constrained
contain
cubics,
element
Strang
the equation
-u" = I on the unit
and Fix [$4, p. 59] derive
equations
interval.
the 4th order
Using
accurate
107
u
- 2u
-6( j+l
+ u
h 2j
u'
j-I) +_I(
- u'.
J+12h
J'l) = I,
(9.1)
-
( ]+I
I
I
u. 2h
- Uj_l) + _-u'5
j ---30(uJ+
' I - 2u'j + u'j_l) = 0 .
Taking the natural
choice x. =
T
]
(uj ' u_)
]
, we obtain
the block
equation
T
axj_ I + bxj + cxj+ I = (I,0)
h2
a:
b
The matrix
T0h
=
is simply
A second
definite,
that the unknowns
set of unknowns,
which
are actually
]
from
a more
uj+_
and, multiplying
B-splines
natural
are out of scale.
are in scale,
by coalescing
choice
axj
is
Since
adjacent
the Hermite
nodes,
than the original•
J
the first
I
IIB[A]II : _ + 3/(4h).
but
T : (uj - _
h uj,
x.]
' uj + _h u_] ) T _ (uj_ 1/2 , uj+I/2 )T .
can be derived
,
•
A : (a, b, c) is positive
The problem
T
=c
cubics
these unknowns
We now have
]
equation
i + bxj + cxj+ I
of (9.1) by h2 and the second
= (h2
• I, h • 0)T
by h,
108
a : _
/6
1/30/,
1(12/5
(9.2)
12/5_
= _k-8/15
8/5 J,
,""'7/5
7I __
I/30 - -16_
I/
The blocks
are independent
"
of h and we have
for the first and last block
boundary
conditions;
For the more
ing equations
rows.
(Actually,
Neumann conditions
general
are derived
equation
IIBEA]II : I, with
give
this assumes
Pk(B[A])
Pk(B[A])
< I
Dirichlet
= i.)
-u" + Qu = f, Q a constant,
the follow-
[S4, p. 59]:
i
30---h
(-36 uj_ I - 3h u'j_l + 72 uj - 36 Uj+l + 3h u'j+l)
+ Qh (54 u
+ 13 h u'
+ 312 u + 54 u
- 13h
'
420
j- I
j- I
j
j+l
Uj+l)
--" F
I
_(3
°_
J
uj_ I
-h u'
+ 8h u' - 3 u
-h
'
j-I
j
j+l
Uj+l)
+ Qh2(-13
420
_---FI
u
j-I
- 3h u'
+ 8h u' + 13 u
- 3 h
'
J-I
j
j+l
Uj+l)
.
J
Again
tions,
taking
X To = (u - h u' u + hu,o ) T and suitably
3
J
_
J' j
_
J
scaling
the equa-
they become
T
dE j_ I + bxj + cxj+ I = (hFj,Fj)
where now the blocks
ting details,
are O(h 2) perturbations
for h sufficiently
small
of the blocks
this gives
in (9.2).
5
IIB[A]II = I - _
Omit-
Qh 2 + O(h4).
109
Since rescaling
alent
to a change
and change
of basis
should be applicable
finite element
for spline
collapse
Strictly
points
it is natural
solvable
in general:
Given
2.
These
of this thesis.
linear
block
cubic
condition
solved
carries
by
something
over
in the
approximations.
by a block
of the following
matrix
A,
diagonal
problems
are
find a block
IIB[AE]II < I.
IIB[A]II < I, find E such that
related
is equiv-
the process
is to prove
systems
tridiagonal
system
subspace,
equations
technique
the Hermite
E such that
are closely
to minimize
function
differential
to ask if either
matrix
In I., if
problems
matrices
yielding
an arbitrary
diagonal
in the linear
and show that the property
in terms of modifying
scaling,
I.
general
The underlying
approximations,
of nodal
in the original
to more
methods.
of variables
to the optimal
numbers,
IIB[AE]II <
diagonal
IIB[A]II.
scaling
and as such are outside
of
the scope
110
I0.
IMPLEMENTAT ION OF ALGOR IT_IMS
This
section
for particular
contains
applications,
for use on a vector
block
data
sizes,
computer.
assumed
be in two parts:
in which
the machine
vectors
and second
as contiguous
storage
The first model
uses
the fact
cussion,
as storage
in which
vectors
This double
A.
ground
material
10.A.
Choice
in practice.
features
computing
solved
satisfy
are defined
increment
among
= I).
other
the conditions
us to pinpoint
requirement
necessary
the
on vec-
for this dis-
of algorithms
n = I, is adeq_aately discussed
elsewhere
may be considered
for
([H4],
as back-
for this section.
of Algorithm
The salient
may be applied
with
model
in arithmetic
vectors
Implementation
[Jl], [L2], [MI], [$2]) and these references
time compari-
ignoring,
by the stringent
of STAR facilities
use the same
locations
allows
nXn
and idealized
progression
comparison
required
uniform
will
the machine
on STAR must
2.B and Appendix
systems,
simplified
for the CDC STAR-100
For a description
linear
we assume
of algorithm
of algorithms
and the execution
(arithmetic
information
that machine
see Section
tridiagonal
are defined
comparison
All algorithms
for a very
for a model
of data manipulation
tor operands.
to n = 2.
locations
for the second model.
effect
In the comparison
first
on the choice
detailed
to be core-contained,
son will
progression,
remarks
and a more
later specialized
layout,
things,
some general
point of our analysis
to block
Sometimes
in order
tridiagonal
it may
to reduce
costs are high
repeatedly)
(Generalities)
has been
linear
be possible
computing
costs
systems
effort must
or estimate
be made
of algorithms
that are likely
to take advantage
(such as one enormous
an extra
that a number
problem
of special
error.
When
or a large
to choose
to arise
the most
the
problem
III
efficient
algorithm.
but we feel
in many
general
of the computer
should
be implemented
problems.
matrices
The first
restricted
should
tion and LU factorization
should
on a parallel
issues
in general
can be greatly
tion.
Options
sizes,
scalings
restricted
or pivoting
class
or used
as a module
problems
as indicated
in our opinion,
should
The best choice
not clear at this
cient vector
are ap-
of odd-even
allocation
reduc-
and access,
and implementation
the block
size restric-
as a set of subroutines
for efficiency
for specific
method,
and spe-
or stability.
This
directly
to small-block
in the iterative
solution
of large
8.
The experience
of methods
These
algorithm
of band/profile
which,
manner.
a cross-over
methods
to be satisfactory.
for the CDC STAR,
addressing
blocks,
for moderate-sized
represent
sparse
gained may or may not
for moderate-sized
in a different
cases
are better
by vector
methods
than
may be applied
one is likely
influenced
under
tridiagonal
LU factorization
for use as a semidirect
of a parallel
time.
ILl] the band methods
computer
be treated
implementations
Either
or vector
for more
say no more
or profile
Storage
in Section
for implementation
size,
be used.
strategies
of algorithms
problems
be useful
of tests
block
a combination
simplified
special
to algorithms
some form of block
implementation
insertion
for three
general
band
computer
operations
include
deal with
computer
vector
heavily
in general,
good algorithms
algorithms
proceeding
computer,
should be used, and on a vector
methods.
to choose
of a small uniform
sequential
on a k-parallel
involved,
before
to blocks
On a standard
propriate;
cial
to make
cases.
problems
block
be difficult
that it is not at all difficult
Regardless
8×8.
This choice will
restrictions
blocks
between
is
effi-
and the odd-even
In Lambiotte's
opinion
but this choice
on the STAR
is
(cf. our
112
remarks
in Section
10.B).
present
such problems,
The third basic
including
ciently
by
iterative
There are several
inherent
eqoations,
based algorithms
multi-level
useful
must
parallelism.
should
be used
methods
package
developed
discuss
implementation
on STAR must
up costs.
memory,
while
_' is another
elimination
Since
highly
suppose
transfers
partition
the data requirements
scheme
system,
that PoissonBrandt's
considered.
elimination
that secondary
in other
IF3].
TT is given
Block
sizes
vector
lengths
and hence
of A
memory.
_', we would
tridiagonal
for the LU or Cholesky
procedure
[E3].
2.A),
in ASKA
the choice
minimize
will
IN1]
l_e
(see Section
levels, while
be in secondary
it
analysis
on the CDC STAR-100.
storage
this is a natural
storage
Noor and Voigt
between
When A TT is block
is a
contexts,
a structural
accordingly.
such that _ refines
on ATT,XTT, = VTT,.
localized,
so large
is that a few blocks
of A TT will
solvers.
for elliptic
suggest
and block
a partition
dif-
effi-
on the Poisson
to be considered.
schemes
finite
very
are available
of Stuttgart
A partitioned
situation
the bulk
that many
of implementation
by theASKA
solvers,
from, and all have consid-
algorithms
with matrices
also try to maximize
The usual
to choose
of hypermatrix
to optimize
based
as the hypermatrix
as follows:
shown
can be computed
partitioning
at the University
may not
is needed.
[BI3] must also be seriously
exploited
and let A TT be the matrix
have
methods
the first
that matrix
Known
has been extensively
are chosen
and ease
for dealing
[R4].
works
Other
be among
analysis
equations
kinds of algorithms
we note
technique
method
to elliptic
computers
are the fast Poisson
studies
or modification
adaptive
Finally,
complete
Recent
but flexibility
or vector
of algorithms
generalizations.
approximations
parallel
and a more
class
ference
erable
Other
start-
be in main
For example,
perform
if
block
so is ATT,
factorization
are then
Moreover,
the diagonal
113
blocks
of Am, will
order.
As noted
themselves
in Sections
< IIBIAs]If, and pivoting
dent
in main memory.
10.B.
Storage
Data
severely
decreased
computer
this means
of course,
though
possibly
IIB[A_]II < I then
to those
is precisely
part of planning
computer,
if data cannot
since
blocks
what
of low
IIB[A,
]II
of An, resi-
is desired.
operand.
for implementation
an algorithm's
computational
conveniently.
vectors
conform
must
We consider
On a vector
to the machine's
some consequences
on the CDC STAR-100,
imple-
speed can be
be accessed
that conceptual
of a vector
requirement
if
be restricted
is an essential
on any parallel
definition
2.A and 3.A.2,
will
This,
tridiagonal,
Requirements
layout
mentation
be block
assuming
of this
that the prob-
lem is core-contained.
Storage
available
isters and either
disc
storage
on the STAR,
512K or IM words
is also available
on STAR consists
n×n blocks
m =
is illustrated
of main memory.
but will
of s consecutive
Our data arrangement
for 64-bit words,
A sizable
not be considered
storage
for block
consists
locations,
tridiagonal
storage
x.l = 0.
vectors
secondary
here.
A vector
I _ s < 64K.
linear
systems
with
in fig. i0. i for the case n = 2, N = 7.
Flog2N], N* = 2m, and if N < N* then extend Ax = v with
equations
of 256 reg-
The matrix
A and vector
nN*.
Define
N* - N trivial
v can then be held
of length n2N * and one of length
uniform
Taking
in three
the subdiag-
2
onal blocks
as an example,
the n
components
I < j < n, I _ k _ N*, and the storage
a((i
- l)N*n + (j - I)N* + k) = aij,k.
tions are needed
cost
of setting
to store
up
the original
the storage
vectors
of ak are aij,k , I < i -_n,
vector
a is defined
Thus a total
system.
since
of
by
(3n2 + n)N*
We do not consider
each
of our algorithms
locathe
will
c
b
a
"
= 0
_I,I
c ii ,k
b 3.1 ,I
c II ,Z
bll,Z
cll'3
b
eL%I,L_
,L_
cli'5
bll,5
S.ll,5
_IZ ,k
_ %
_Z ,a
cl, z ,a
bl2'a
cll'5
s.%.Z,6
blz'7 _ 0
b I?..._
_'I2"'7 -_..0
0
b2.1,2
b 2.%,3
a?.l,Z
alk,3
aZ%,a
bzl'_
c7_1,5
c21 6
' ._ {3
bzl,6
c11'7
b21,7
bl_
¢. O
_ C)
"I
alZ ,l
a22,3
_'12, a
,l
c12. ,3
bzl ,l
bl1,3
cll '_"
bzz ,L_
bzz,5
b22,6
a,2.2 ,5
aZl,7
_ O
c_%.._
clz ,l
c
=
all ,I
_ 0
c'2.1'3
c 2.%.,5
bz%5
aZl,7
_2'7
c?-1,2
a21, _ =
_.Z%..,6
v",r2,6
Z ,5
c,ZI,I
bll,l
all' 5
I"
v,2.
,I
v ,2
Z
v ,3
c 12,3
b _% ,3
a12,3
a%__
,jl_.... "
c 12,%
b 12,2
a12,2
_. 0
cll''/
=. 0
c%_._.-._
b i_
-I
= 0
all,7
_%.,6
cI_"6
bll,6
b %..I
,q
ai%,6
_ k ,3
cz1,5
c2z'6 -_. 0
cll,7 _. 0
c Zl_
b ll_% _
Data layout £o_ 9lO .I.
_ Z, _ _ 7,
_,.
115
use
the same arrangement
systems.
and since
This particular
odd-even
reduction
the details
data layout has been
algorithm
more
efficient
the other competitive
algorithms
are mostly
by the choice
extra
unaffected
storage
Since
for trivial
can very widely
chosen
because
between
it makes
for the true STAR model,
(LU factorization,
of data
odd-even
layout,
the
while
elimination)
and can do without
equations.
the STAR forces
vector
lengths
to be _ 64K, we must have
2
n N* < 64K = 65,536.
For a given
value
of n, this restricts
the system
size as follows:
n
2
3
4
5
6
7
±----
8
I
64K/n 2
[6384
7281
4096
2621
1826
1337
1024
max N*
[6384
4096
4096
2048
1024
1024
1024
,
These will
A more
problem
store,
probably
not be serious
important
and solution
and supposing
times the original
restriction
will remain
(T + I)
storage,
+ n)
plus
Assuming
4
amounts
restrictions
5
that the
a 512K main
the program
the following
....$__..
cases.
6
to T
on N:
7
$
T:
1 18724
8738
5041
3276
2279
1702
1310
T:
2 12483
5825
3360
2184
1533
1134
873
9362
4369
2520
1638
1149
851
655
1 16384
8192
4096
2048
2048
1024
1024
....
_
:
T= 3
max N*
storage
in most
size is our assumption
core-contained.
we obtain
2
(3n2
on system
that temporary
n
512K
restrictions
T=
T = 2
8192!
4096
2048
2048
1024
1024
512
T= 3
8192 I
L
4096
2048
1024
1024
512
512
116
The amount
extent
of temporary
storage
depends
to which A and v are overwritten.
temporary
storage
reduction
would
considerations
memory
extent
complete
analysis
and timings
of odd-even
elimination.
Odd-even
for the basic
algorithms.
T _ 2.
of data manipulation
executed
mostly
unit
on STAR,
that vectors
between
in scalar mode:
memory
this may be overlapped
requires
from actual
in mind,
cycle
runs.
we suggest
scalar arithmetic
with
counting
arithmetic
in an assembly
Our estimate
of execution
four possible
minimal
I, a one-time
see Appendices
solution,
2, factor + solve,
factor,
solve,
be stored
A
language
program
time
probably
assuming
Actual
To a con-
operations.
implementations:
run times
and load/stores.
For details,
will
file.
be
(based on an
I/0 bound.
(Times given
complete
run time will
A, B.)
all scalar mode
time > 508N - 362
Version
for reg-
The only data manip-
and the register
is that the algorithm
are for n = 2, and represent
Version
to
locations.
With
be greater.
for the LU factorization
and the requirement
cycle count)
between
no
of scalars,
untested
this
and the
from practically
of the load/store
is to load/store
siderable
It can range
are the speed
The LU factorization,
ulation
have
the costs
transfer
as contiguous
implementation
typically
We now consider
ister-memory
on the algorithm
(other than O(n 2) registers)
T _ m for an inefficient
Major
used
partial
vector mode
version
a, time > 324N + 3421
version
b, time >-374N + 1769
version
c, time _ 420N - 280 (use for N < -_ 38)
time e 303N + 1039
(use for N > _ 38)
overlap
certainly
117
The three versions
of factorization
tions and increasingly
must be solved with
execution
Odd-even
elimination,
is that
scalar
executed
the conceptual
vectors
vectors,
(cf. [MI] for the tridiagonal
Odd-even
reduction,
allows
vectors
then no additional
STAR, however,
vector
to perform
compress
vector.
instructions,
separation
is needed
must
be applied
to vectors
for b in fig.
copies
is needed.
two basic
10.2.
of the bits
compress
and
The first
the second with
even
before
arithmetic
require-
operations:
is
is accomplished
a STAR merge
set up a control
I0 (a single vector
model
progression,
of the
(I) separate
(2) rejoin
each reduction
computer
In a true model
with
two STAR
instruction.
The odd-
step can be performed,
vector
operation
a
the even and odd
a, _b, _c, and _v" We illustrate
Having
vectors
data manipulation
If the vector
in general
into its even and odd components,
parts of a separated
of
case.)
locations
data manipulation
we need
to the strictest
in vector mode:
to be storage
Our
feature
on component
no additional
2.
I.
An important
all correspond
needed.
executed
mode:
opera-
than one system
to use Version
like ours based
and hence
vector
10.C will use Version
in vector
layout
fewer
If more
A, it is best
in Section
and any data
for machine
operations.
the same matrix
time comparison
this algorithm
ments
more
use increasingly
and
the separation
z consisting
of n2N*/2
on STAR), we execute
b _ b__l(1)per z
12
b(2j
compress
- I) _ b l(j),
(I < j _n
b _ b l(2n2N*)--per
N*)
z
1 2 ,
1 2.,_
b(2j) _ b!l(_n N_ + j), (I K j _n
i_ j.
The cost for each compress
I.I 2 ..
is n2N * + _(_n N_) + 92 and the total
cost
to
118
b
z
bl
final state
bll,l
I
bll,l
bll,l (I)
bll,2
0
bll,3
bll,3 (I)
bll,3
I
bll,5
bll,5(l )
bll,4
0
bll,7
bll'7 (I)
bll,5
bll,6
....
1
0
b12,1
bi2,3
b12' 1(1)
b12,3
(1)
(1)
bij,k
_"
'
k odd,
I _k
<N*
(1)
b12,7
1
b22,5
b22,5(1)
b12,8
_0
b2_2 7
b22_7
J
(2)
b21,1
1
bll,2
bll
' 1(2)
b21,2
b21,3
b21,4
0
1
0
bll,4
b11,6
bi1,8
b11'3(2)
b12,1(2
bi2,3
bij,
)
,
k odd,
....
i _ k _ N*/2
b21_
,
,
8
0
_
b22_3
b22,1
i
b21,2
bll,l (3)
b22,2
0
b21,4
bll,
b22,3
b22,4
I
0
b21,6
b21,8
2 (3)
(3)
b12' I(3)
b12,2
b22_8
0
b22
b22
,
Fig.
k(2)
_
10.2.
Odd-even
8
separation
(2)
-_
_
(3)
bij k
'
'
1 < k _ N*/4
2(3)
for n = 2, N = 7, N* = 8.
I19
odd-even
separate
extension
a, _b, _c, and _v is 17_(3n2 + n)N* + 736.
of A to N* block
step using
only 8 vector
had not been
extended
equations
compress
we would
makes
it possible
operations,
need
either
Our initial
to perform
this
2
of n .
If A
independent
2 (3n2 + n) vector
operations
2
to compress
the vectors
one component
subvector
at a time, or n
vector
2
operations
to generate
of z(l:N),
followed
Lambiotte
ILl] rejected
tor lengths
needed
startup
vector
8 vector
use of odd-even
of our extended
operations
to reduce vector
control
by the original
The net effect
of vector
a modified
and hence storage
compress
reduction
data layout
to manipulate
costs.
would
natural
the increased
despite
operations
are needed
for the first reduction.
startup
vec-
impor-
approach
costs.
In any
part of the
that this is roughly
vector
Data manipulation
the odd-indexed
time as is spent
of dealing
with
- 2)
We will
in the
a restrictive
can be considerable.
for the back
equations
59.5N* + 736m - 855.
half as much
so the overhead
definition
to pre-
I
For the case n = 2 this yields
operations,
to prepare
17 (3n2 + n)N* + 736,
is -_
17
"
cost for m-I reductions is _-(3n 2 + n)(N*
I < i < m-
+ 736(m-I).
see shortly
separation
the cost of the separation
and the total
(see fig.
and thus
increased
vector
for the arithmetic
N*l = 2m+l-i'
machine
we have
control
only the odd-even
In general,
for the ith reduction,
arithmetic
for the STAR,
step.
So far we have considered
pare
the number
from N to N*, and this may be a more
The modified
reduction
copies
operations.
is to reduce
On the other hand,
in some cases.
case, O(n 3) vector
of n
for n > I on this basis.
vectors
tant consideration
then be most
consisting
substitution
of A (i) in appropriate
10.2 for the final
state
of b).
Having
is simplified
form
by saving
for vector
computed
operations
x (i+l) = the
120
even components
of x (i) , we compute
the two sets into x (i).
more
expensive
the solution
The merge
than the compress
vector
the odd components
operation
operation,
and not blocks
of A.
on STAR
p-fold
reduction:
The cost of the merge
for p-fold
odd-even
reduction
can readily
reduction.
into p vectors,
data arrangement
The odd-even
which
requires
separation
p compress
the cost per reduction
would
m =
substitution
rlOgpN_.
using
from x
The back
p-I merge
instructions
(i+l)
(i)
to x
°
be
12N* + 123m - 147.
generalizes
instructions
requires
at cost of 3(_-that these
vectors
from our discussion
merging
per storage
vector;
N i* = pm+l-i '
p vectors
- l)nN*+
i
P
costs
of
to separation
(p + l)(3n 2 + n)N i* + 368p,
It is clear
is
and use of control
be generalized
with
operation
for m-I reductions
For n = 2 this yields
Suitable
is comparatively
but we are only dealing
at stage i is 3nN.*
+ 123, and the total cost
I
6n(N* - 2) + 123(m - i).
of x (i) and merge
into
123(p-I)
are minimized
l,
to go
by
p=2.
Jacobi
iterations:
Data manipulation
is unnecessary
when A and v or B[A]
-i
and DIAl
10.C.
v are stored according
Comparative
Timing
to our component
tridiagonal
systems
the CDC STAR-100.
and odd-even
earlier
statements
siderations
include
for operations
with
This will
ization
concerning
time comparison
2×2 blocks,
expand
reduction
vector
scheme.
Analysis.
we close with an execution
block
subvector
based
the crude
given
of algorithms
on timing
comparison
in Section
4.D,
the cost of various
startup
on short vectors,
costs,
creating
for general
information
of the LU factor-
and illustrates
methods.
Essential
a substantial
and data manipulation
for
many
con-
penalty
to deal with
12 1
restrictive
cussion
definitions
I.
which
simplified
vector
b.
we
operands
pivoting
10.B began
our dis-
on the following
only arithmetic
conven-
costs:
may be any sequence
is never
necessary;
b.
address
c.
load/store
d.
storage
management
costs,
and more
realistic
analysis
of consecutive
overlap
costs
in Section
Algorithms
here we present
the main
loop control
for scalar
storage
also be considered,
Execution
capabilities,
calculations,
of iterations.
as follows
locations
progression,
instruction
store will
of storage
ignore
ment as explained
10.B.
(m =
as a consequence
will
demand
Overlap,
consider
loop control
are given
l.a.
operands
storage
and scalar
calculation
and timing derivations
and make
of assumption
that vector
so we must
but not address
results
tests,
manageload/
or termination
in Appendix
direct methods
LU factorization,
are initially
estimated
using
scalar
operations
only
784
elimination,
94.5Nm + 12999m
using
- 93.5N
vector
- 1901
B;
comparisons.
Flog2N]):
the block
odd-even
and termination
operands,
locations,
times for the basic
1012N2.
is based
a.
Our second,
I.
Section
we assume
in arithmetic
consist
analysis
allow us to consider
a.
2.
operands.
of the latter problem.
Our initial
tions,
of vector
operations
only
(OEE):
(LU):
122
3.
odd-even
block
4.
reduction,
system
using
(OER):
p-fold
reduction,
145N +
19206m
eral problems
ratios
since either
- (19351preduction
vector
of using
combining
wise
.23
.13
OER:
OEE
.24
.II
of vector
startups
is faster when
the two methods
This reduces
solve
a savings
that vector
uses O(N log N) operations
although
method
it is always
inefficient.
reducing
A polyalgorithm
if m _ 6 then use LU, othersystem by LU, and back substi-
are the dominant
and
te_n for OER when
for the other methods
it is faster
slower
than LU for 455
Of course,
OEE
and soon becomes
_ N < 1023,
than OER.
is that the OER-
of those considered
Now consider
by successively
term for OEE when N _ 137.
vs. O(N)
Nevertheless,
Our conclusion
OER and LU,
of 13_.
startups
N _ 1532, and are the dominant
inefficient.
0.
the time for OER to 106.5 N + 14839m - 46908.5,
for N = 1023 it represents
We also note
N _
.II
this becomes
the remaining
time
to LU and OEE.
l_it_
OER works
is straightforward:
for gen-
Execution
is seen by comparing
N < I00.
and at some point
do m-5 reductions,
tute.
N......
= 8191
method
cheaper.
OER in preference
LU
lengths,
14717.5
as a competitive
OER:
for the latter
for the final
19579)
LU or OER is always
N = 1023
The importance
except
p e 3:
of p-fold
show the benefits
operations
I06.5N + 14839m-
(log2p)
We can first dispose
vector
LU polyalgorithm
is the best
direct
or reductions
with
here.
the cost of individual
eliminations
123
respect
to the total cost
each elimination
cheaper.
of the algorithm.
step has roughly
Thus each step costs
reduction,
equal
about
the cost per reduction
cost
N = 1023, m = I0
28_0
17¢
8191, m
l,/m of the total
434
22_
mately
equal,
while
the kth reduction
seqaences
startup
reductions
are only
for large N, when
! 7_
I
4_
!
For our comparison
x with
error by a factor
of I/N.
true arithmetic
of incomplete
of the total
of semidirect
be to approximate
a relative
Thus
cost
242,622
1,050,5:31
i
N = 2m - i)
costs
costs
This has
reduction,
cost when
and iterative
error
3_o
reduction
i/2k of the whole.
a small part
proceeds:
7¢
are dominant,
for the effectiveness
For odd-even
_
] 12_
costs
costs about
cost.
of total
(cost for kth red. = 106.5(2 m-k) + 14839 when
For small N, when
slightly
as the algorithm
as a fraction
,
13
elimination,
the last being
I
:
N
cost,
decreases
of the kth reduction
For odd-even
are approxi-
are dominant,
immediate
since
the omitted
N is large.
methods,
the goal will
of l/N, or to reduce
if we perform
either
con-
the initial
k odd-even
elimina-
tions or k reductions,
we want
IIB[Hk+l]ll _ I/N or
JlB[A(k+I)]II _ l/N,
assuming
Theorem
7.1 is then applied
to produce
IIB[A]II < I.
approximation
we have
by incomplete
elimination
or reduction.
IIB[Hk+ I]II, IiB[A(k+I) ]II _ 82k, and demanding
Taking
time for k eliminations
is
94.5Nk + 12999k + 10.5N - I05(2 k) + 1205,
and the time for k reductions
is
8 = I_[A]II,
82k < I/N yields
k > log2((log2N ) / (-log 2B)).
The execution
the desired
124
I06.5N +
Typical
savings
14839k-
for incomplete
96(2 m-k) + 1287.
reduction
over complete
= .9, pick k >- log21og2N
= 12_
N = 8191,
=
Comparison
pick k -> log21og2N
consider
direct methods.
= 12_
be reduced
pick k _ log21og2N
N = 1023, k = 4, savings
= 33_
N = 8191,
= 16_
iterative
k = 4, savings
We will
of
iteration
and Jacobi
iterations
needed
of I/N are estimated
the savings.
and their combination
Jacobi
iteration
costs
such as p = p(B[A]).
(16N + 2217)
use the Jacobi
decreases
semi-iteration
(cost = 38.5N + 4367
steps are demanded,
The number
methods
Jacobi
parameters
to 12N + 1840
B[A] and D[A]-Iv
factor
k = 5, savings
One step of the block
to estimate
+ .774
N = 8191,
one step of the block
overhead
be:
7<
= 254
to the OER - LU polyalgorithm
We next
while
k = 7, savings
N = 1023, k = 5, savings
1
= _,
would
+ 2.718
N = 1023, k = 7, savings
2
=_,
reduction
per iteration
semi-
costs
22.5N + 3031,
26.5N
+ 3408 plus
Execution
time may
by directly
computing
- 3 iterations).
in its simplest
semi-iteration
fo_n when
when many
by J-SI to reduce
to be
with
steps
the initial
only a few
are needed.
error
by a
125
The LU factorization
OER for N < 312.
Vector
and in OER for N _ 798.
tions,
is always
startup
a 3_ saving
- 70677,
costs
The OER-LU
solve the remaining
183N + 16233m
faster
than OEE, and is faster
are dominant
polyalgorithm
should
perform
a 164 saving
m-6 reduc-
The time is
over OER for N = 1023 but only
for N = 8191.
The execution
time for odd-even
elimination
N = 1023
Analysis
in OEE for N < 152,
system by LU, and back substitute.
yielding
than
as follows:
N = 8191
arithmetic
overhead
96
4_
96
4
results
vector startups
874
134
984
2_
of the total time for odd-even
tance of overhead
is allocated
reduction
shows
the greater
operations:
N = 1023
ar ithme tic
_overhead
N = 8191
arithmetic
_overhead
results
startups
33 _
23<
4 0_
4_
56_
44_
results
73
27_
startups
524
37_
Ii_
I_
88_
12_
_
62_
38_
impor-
126
limit, N -_ _
results
arithmetic
_overhead
startups ........
58@o
42_
-
58_
42_
I00_
The increase
in time for OER due to overhead
N = 8191, and 724 in the limit
layout of Section
overhead
result
The strong
locations
requirement
compress
cost
be ignored
"wasted"
of the execution
and vector
startup
that machine
leads to much
Most
as N-_ 0o. As a consequence
10.B, overhead
costs cannot
is 37_ for N = 1023, 61_ for
costs
and probably
vectors
effort
are quite
consist
small,
cannot
of individual
spent
in the arithmetic
operation
as fraction
of total cost
N = 8191
li_nit, N -_
12_
12_
12_0
multiply
43_
35_
32_
add/subtract
18_
15_
14
compress
20_o
29_
33_
merge
4_
6_
7
transmit
3_
3_
3_
operation
assumes
its low startup
As noted earlier,
tions and subsequent
storage
time.
divide
reflecting
be decreased.
operations:
N = 1023
The compress
but the
of contiguous
in execution
time of OER is actually
of the data
importance
as N increases,
again
costs.
most
back
more
of OER's
time is spent
substitutions.
in the first
few reduc-
127
P = .9,
1.5m iterations,
2
3'
I
-2'
.75m iterations,
p = --
0 =
2
p = _,
For
for example,
As suggested
J-SI is never
in Section
Jacobi
.5m iterations.
with
are used,
followed
be chosen
so that B2k_ < I/N, or
(*)
iterations.
If k steps
_ iterations
and back
k + iog2%
>- log2((log2N)
The time for the OER-J method
I06.5N +
14839k-
and it only remains
would
values
incomplete
elimination
of elimination
substitution,
or
or reduction
k and _ should
/ (-log2B)).
then be
96(2 m-k) + 1287 +
to minimize
N = 1023, the resulting
than OER.
7.A, we can combine
reduction
by
faster
_(19(2 m-k) + 2615),
the time subject
to
(*).
are k = 2, % = 5, which
2
B = _,
For
represent
a savings
2
of 37°_ over OER.
= 3, which
represent
Of course,
that
_ =
For
to determine
as part
IIB[A]II can be computed
large N even
incomplete
the resulting
k and
_ we need
directly
of the first
from these
this cost considerably
reduction,
IIB[A]II rather
and points
values
are k = 3,
of 15_ over OER.
to know
odd-even
While
it is true
that half
reduction.
any benefits
out the importance
estimate.
+ 3881,
the rows of
An estimate
rows at a cost of 7.5N . 308.
reduces
than a computational
_.
from A at a cost of 46.5N
at a cost of 15N + 308, we recognize
B[A] are available
of
a savings
IIB[A][I can be computed
or from B[A]
of
_ = _, N = 8191,
gained
For
from using
of an analytic
estimate
128
This ends our initial
assumption
overlap
scalar
mate
that vector
comparison
operands
of scalar
operations,
load/store
operations.
Odd-even
than simply
recompute
a few cases,
we
are consecutive
and consider
and raise
elimination
try to indicate
the
locations,
is to lower
substantially
presented
where
storage
the time estimate
remains
the figures
We now make
allow
the cost of loop control
The net effect
for the LU factorization
reduction.
of algorithms.
earlier,
the algorithms
and
the time estifor odd-even
unchanged.
which
Rather
will be done in
spend most
of their
time.
Revised
indication
we note
i.
time estimates
of actual
changes
from
runtime
LU factorization,
Odd-even
(a 4>
3.
Odd-even
Time ratios
These
set of timings,
600N-
(OEE)
m =
increase
give a better
all underestimates;
rlog2N].
150
overlap)
98.5Nm + 13403m - 101.5N - 2261
due to overhead
(OER)
should
but are probably
due to use of instruction
reduction.
(a I0-70_
larger
(LU)
elimination,
increase
below.
on STAR,
the first
(a 40_ decrease
2.
are given
costs)
183N + 16233m - 16108
due to overhead
costs;
larger
N incurs
increases).
show the effect
of decreasing
the LU time and increasing
OER time
N = 1023
N = 8191
limit, N -_ oo
.31
OER :LU
.54
.34
OER :OEE
.32
.17
0.
the
129
cost of ith reduction
i =
N = 1023
33_
........
|
as fraction
2
3
19_
of total cost
4
5
6
124
8_
7_
6_
6_
4_
2_
6_
3_
J
N = 8191
45_
23_
12_
limit, N _ _
5"0_
25_
13_
,
Of the total time, 86_ is spent
I--2_
-
.,
in the reduction
phase and
144 in the back
substitution.
One of the effects
tance of vector
Of course,
be made.
display
startup
when
costs,
polyalgorithm
N = 1023
N = 8191
time
14_
2_
vector
time
86_
98_
results
634
91_
startups
23#0
startup
costs
reduction
results.
are already
small not much
and incomplete
reduction
reduction,
savings
N = 1023
2
the impor-
7_0
(8 = IIB[A]II )"
Incomplete
8 = .9
is to reduce
as seen below.
scalar
Incomplete
similar
of the OER-LU
over OER
N = 8191
k
savings
k
7
I0_
7
5
21_
5
4
27_
4
savinss
54
94o
I
13_
improvement
+ Jacobi
can
iterations
130
Incomplete
reduction
+ Jacobi
iteration,
N- 1023
= .9
k
7
3
9
2
I
saving
over OER
N = 8191
savings
k
__
sav_
22_
4
6
5
364
2
6
13<
5
47_
2
4
21_
8_o
2
i
2
As seen earlier,
odd-even
reductions
considerable
roughly
the semidirect
make
advantage
doubling
methods
are best applied
up a significant
to using
the savings
incomplete
obtained
when
the omitted
part of the OER time.
reduction
by simple
plus
incomplete
Jacobi
There
is
iterations,
reduction.
131
APPENDIX
VECTOR COMPUTER
This
section
our comparison
derived
Laboratory
of the STAR's
All times should
manuals
I0.
Information
given
and timing measurements
of 40 nanosecond
be accurate
represents
somewhat
needed
here
made
for
is
at Lawrence
a complete
computers
the usual
discussion
instructions,
for each instruction.
though
actual
conflicts.
storage
locations,
is usually
Other
this restriction.
scalar
point
as noted,
due to memory
is crucial.
do not have
with
5_ except
restriction
storage
floating
needed
of N consecutive
The vect_length
TI ASC in particular,
(64 bit)
cycles
to within
on STAR consists
for consecutive
pipeline
in Section
of the CDC STAR-100
[C2], and by no means
may be degraded
i < N < 65,536.
the need
features
times are for full-word
as the number
A vector
INSTRUCTIONS
facilities.
Execution
performance
those
of algorithms
from CDC reference
Livermore
given
describes
A
not serious,
vector
computers,
Both machines
instructions
but
the
are
as well as vector
ins truc t ions.
Useful
broadcast
features
constants
cast constant,
of vector
- Most vector
held
control
on STAR include:
instructions
in a register)
one or both of the vector
cution
instructions
operands.
allow
a scalar
to be transmitted
(the broad-
repeatedly
Use of this facility
to form
decreases
exe-
time in a few cases.
vectors
- Bit vectors
such as suppressing
or determining
which
storage
operand
can modify
to selected
components
the effect
components
of vector
instructions,
of the result
are to be used
vector
in an operation.
132
Use in vector
data manipulation
(compress
and merge)
is explained
later.
offset - Modification
sign control
- Certain
to be modified
result
must
until
instructions.
time
of the operands
time
is the number
is the time after
but not yet placed
at the precise
time given;
in a register.
of cycles
after
execution
is possible,
but
by actual
testing,
determined
the sign
follow.
Issue
Shortstop
is first available,
it is placed
allow
of instructions
execution.
be used
operations
of operands.
execution.
classes
(resister)
to initiate
vector
before
Individual
scalar
to the base address
issue when
the exact
which
if not,
the result
is placed
effect
the
The result
cannot
be used
in a register.
on execution
time
Overlapped
is best:
we have not done.
result
instruction
o__perat
ion
issue
shortstop
to resister
62
add, normalized
2
6
II
66
subtract,
normalized
2
6
II
6B
multiply,
significant
2
I0
15
6F
divide,
3
-
43
73
square
3
-
71
78
register
2
4
9
(register)
significant
root
transmit
load/store
tions may be stacked
is processed
required
time is the number
STAR
scalar
issue when
in a register.
Result-to-register
the result
of cycles
instructions.
in the load/store
sequentially
at a rate
unit
Up
to three
at any given
of 19 cycles
load/store
time.
instruc-
The stack
per load and 16 cycles
133
per store minimum,
been referenced
references
assuming
it remains
to the ba_k
to other memory
and reload
and the conflict
of 66 cycles
further
affect
references
load requires
are required
issue
result
to register
at
to store an item
19
7F
s tore
3
-
16
Execution
time for a branch
on the particular
the target
slightly
instruction,
instruction
pessimistic
if we
whether
from 8 to 84
a branch
take 40 cycles
is actu-
stack,
etc.
to be the time
branch.
Execution
time followed
by sequential
appearance
time followed
Vector
part of the startup
by processing
instructions
time.
for instructions
time consists
of result
may not be overlapped
of omitted
below,
items
D8 - DC may vary
of an initial
vector
time of operands
In the table
and V is the number
Times
can range
is in the instruction
instructions.
generated.
ioad/s tore
unit bus__
28
operation
tor.
will
for a single
bank has
this time
3
for a typical
vector(s)
during
load
depending
startup
Once a memory
7E
We will be only
up
for 31 cycles;
to finish
operation
ally made, whether
vector
conflicts.
it.
instructions.
cycles,
Start
and a minimum
STAR
instruction
branch
busy
are delayed
banks.
least 31 cycles,
no memory
components,
if a scalar
except
result
or
is
for a negligible
N is the length
of the input
due to use of a control
as much
start-
as +204.
vec-
134
STAR
vector
ins truc tion
opera tion
time
82
add , normalized
_N
+ 71
K.
c. +- a. + b.
86
subtract,
normalized
i
_N + 71
c.l +-a.l -b i
8B
multiply,
significant
N +159
c. +- a. × b.
8F
divide,
2N +167
c. +- a. + b.
significant
e ffec t
l
l
i
i
l
93
square
98
root
transmit
l
i
i
2N +155
c.
+- _.
1
i
_N + 91
c. +- a.
l
l
1
l
I
99
absolute
D8
z_N + 91
c.I+- ''fail
maximum
6N - 2V + 95
DA
summation
6.5N-
c +-max a.
i
N
c +a.
i-I
l
DB
product
6N - V + 118
DC
inner product
6N-
vector
data manipulation
selected
components
value
instructions.
of one vector
vector _ and a control
vector i,
2V + 122
V +
130
The compress
into a shorter
the instruction
N a
c +- Hi= I i
N
c +a.bo
l
l
_i=I
instruction
result
vector.
essentially
transmits
Giw_n
executes
a
the
code
j_l
for i = I, 2, ..., N
if z. = I then
I
The result
two vectors
vector c has length
{c. +- a ; j +- j+l}
j
i
"
bitsum_).
into a third, according
The merge
to the control
instruction
vector:
"shuffles"
135
length(c)
= length(z)
= length(a)
+ length(b)
j_- i; k+-I
for i = I, 2, ..., length_)
In the table below,
items broadcast
if z. = I then
l
[c. +- a
; j +- j+l}
l
j
else
_ci +- bk ; k _ k+l}
M I -- lengthy),
M 2 = length(c),
R = number
of input
from register.
STAR
ins t_uct ion
Execution
times
opera t ion
BC
compress
BD
merge a,b _ c per z
for these
a -_ c per z
instructions
time
I
M I + _M 2 + 92
3M 2 - 2R + 123
may vary as much
as +20_.
136
APPENDIX
SUMMARY
We collect
timing
OF ALGORITHMS
the algorithms
formulae
in general
terms
to block
tridiagonal
information
for the CDC STAR-100.
is given
that our time estimates
The algorithms
form size n × n.
I.
factor
in the main
text,
and present
for a vector
computer.
systems
2 × 2 blocks
using
timing
of methods
based
on the
with
A comparison
in Section
10.C.
have not been verified
arithmetic
This
is then
It must be emphasized
by direct
solve Ax = v, A = (aj, b j, Cj)N,
The basic
tract, M = multiply,
AND TIMINGS
discussed
specialized
latter set of timings
B
operations
testing
with
are:
on STAR.
blocks
of uni-
(S = add or sub-
D = divide)
an n × n block
b
i
i
3
cost = _(n 2 - n)D + _(2n
- 3n 2 + n)(M + S)
2.
having
cost
3.
b, solve bg = f, where
_ n' (nD+
product
cost
4.
factored
f is n × n'
(n2 - n) (M + S))
of n × n and n × n' blocks
= n' (n2M +
difference
(n2 -n)S)
of two n × n' blocks
cost = n' (nS)
Depending
scalar
operation
on context,
the symbols
or a vector
operation
S, M, D can mean
with vector
either
length
a single
specified
by
the algorithm.
For each algorithm
translation
we give a code
to individual
scalar
in terms
or vector
of blocks
operations
of A,
since
is generally
the
obvious.
137
In those cases where
vided.
additional
detail
Two basic sets of timings
seems
to be necessary
are given,
five formulae
it is proin all for
each algorithm:
i.
counting
arithemtic
operations
only,
can be any sequence
of storage
locations
sion, and ignoring
culation,
pivoting,
loop control,
store costs for scalar
assuming
operands,
vectors
in arithmetic
instruction
termination
machine
overlap,
tests
progres-
address
for iterations,
and any other
calload/
forms of storage
management
(a)
for systems
vector
(b)
with
n × n blocks
in general
terms
for a
c_nputer
for systems
with
n X n blocks,
with
timing data
2 X 2 blocks,
with
STAR data
for
the CDC STAR-100
(c)
2.
a more
for systems
realistic
contiguous
model
storage
tions, counting
loop control
used
locations,
requiring
allowing
management
to some extent,
overlap
ignoring
tests
vectors
of scalar
and load/store
but still
and termination
machine
costs,
pivoting,
for systems
with
n X n blocks,
with
STAR data
(b)
for systems
with 2 X 2 blocks,
with
STAR data.
notes
are provided
in many cases.
to be
operaand
ad-
for iterations
(a)
in Section
Timings
l(c) and 2(b) are
10.C.
A few remarks
estimates
of STAR,
storage
dress calculations
Additional
with
on the timings
for 2 × 2 blocks,
are necessary.
the only loop control
Since we eventually
that will
be counted
want
is
138
that on the main
assumed
to be expanded
The address
cuted
scalar
loops.
the odd-even
would
this probably
represents
terms consist
of vector
operations
can sometimes
ignored
can be completely
be to increase
the O(m)
only a small relative
startup
destroy
costs
be exe-
overlapped
with
Their
effect
terms, m =
increase
and are already
the efficiency
will
are
here would
in the LU factorization.
lieve that our basic conclusions
calculations.
or n-vectors
code.
that have been
and probably
operations
methods
on n X n matrices
into straightline
calculations
in scalar mode,
arithmetic
Operations
since
large.
of a vector
not be altered
on
Flog2N], but
the O(m)
While
scalar
code, we be-
by including
address
139
I.
Block
LU
factorization,
using
scalar
operations
mostly
Algorithm:
dl
= bl
for
; fl = Vl
i = 2,
...
solve
, N
di_l(Ui_
di = bi -
solve
for
Storage
If
vi
- ai
dN
xN =
fN
N-I,
xi
gi
handling
desired
we
d.,
l
u°l and
f. are
l
for
with
gi'
overwrite
due
a
saved
The
to
Version
load/store
factorization
loop
n stores
second
fi
ci
basic
loop
needs
and
would
-_ a i ' u i -_ ci,
to
save
subsequent
either
loads.
di
If
be
Xi.l
in execution
algorithm,
with
time.
estimate
unit:
solution.
loads
loop
bl,
loads
stores
initialization
+
increase
I, a one-time
initialization
2.
l = a i d -I
i-i
, i
di x i
the
of n
the
...
d i -_ bi,
factorization
2
total
then
i = N-I,
of
Xi+l
:
a corresponding
Variations
, I
ui
can
solve
gi-i
...
xi -_ gi -_ fi -_ vi.
f.i or
(ci_ I fi_l )
a i ui_ I
fi =
i =
I gi_l ) =
stores
xN
vI
Ci.l,
ai,
bi,
ui_ I, gi-I
vi
of minimum
runtime
and
140
back
substitution
loop loads ui, gi
stores x.
l
At
19 cycles
total
per load and 16 per store,
time _ {19(4n2 + 2n) + 16(n2 + 2n)}
(N-l)
2
+ 19(n
+ n) + 16(n)
= (92n 2 + 70n)(N-1)
for n = 2, total
Version
time >- 508N - 362
2, factor + solve,
Factorization
Version
+ 16n2 + 35n
a.
A = (_j, I, 0)(0,
2
requires n (3N- 2) loads
factorization
loop stores
-I
dj, 0)(0,
I, uj)
for A.
d.i only,
and computes
-i
_j = aj dj_ I, uj = dj
c.0 by vector
operations,
vector
length
total time _ 19n2(3N-
i
3 - 15n2 + n)(M+S),
2) + 16n2N + I (5n2 - n)D + _(14n
D = T.N + _., etc.
for n = 2, total
Version
b.
factorization
operations,
total
time >- 323.5N
vector
+ 3421
loop stores
d i, ui, computes
length = N.
time >- 19n2(3N
- 2) + 16n2(2N
- I)
I
_
i
+ _(3n 2
n)D + _(8n 3 - 9n2 + n)(M+S)
for n = 2, total
Version
c.
time _ 373.5N + 1769
factorization
loop stores
total time >- 19n2(3Nfor n = 2, total
Solution
elimination
loop loads
solve dj
gj
fj
2) + 16n2(3N-
_''l
2)
time > 420N - 280
of Ax = v, given
forward
di, ui,
the factorization
loads
v I initially
_i' vi' stores
with
vector
i
f.
o_erations.
of A.
%°Iby vector
= N.
141
back substitution
loads
xN initially
loop loads ui, gi' stores
total time e 19(2n 2(N-I)
x.
l
+ 2nN) + 16(2nN)
1 2 + n)D + v(2n3
1
+ _(n
+ 3n2 - 5n) (M+S)
for n -- 2, total time -> 302.5N + 1039
Timings :
i.
Version
I, for initial
overlap,
load/stores
(a)
operations
without
+ 9n 2 - 5n)(t M + _)}N
[n2tD + (2n3 + n2)(t M + ts) .}
(b)
with
(c)
for n = 2, with
Version
arithmetic
ignored.
[I(3n2 + n)t D + I(14n3
-
2.
comparison,
STAR data:
I, for second
(70n 3 + l14n 2 - 2n)N - (60n3 + 76n 2)
STAR data:
comparison,
1012N - 784
allowing
overlap
and including
load/s tores.
In the absence
of facilities
code was written
computed
from
here.
The main
initialize
for the special
that code.
by the simpler
for direct
code
time estimates
loop
substitution
timing
overlap
is probably
conflicts.
so they will
not be given
200 cycles
400(N- I)
250
loop
total time estimated
This
and well-illustrated
are
initialize
back
are tedious
in [L2],
- loads + loop set up
factorization
on STAR, a scalar
case n = 2 and time estimates
The details
tridiagonal
testing
200(N-I)
to be
optimistic
600N - 150 cycles
due to unforeseen
memory
and
142
Notes :
I.
Gaussian
scalar
elimination,
operations,
taking advantage
requires
of the block
the same execution
structure,
using
time as the block
LU
factor izat ion.
2.
If u.,
3
1 _ j _ N-I,
is saved
in the factorization,
then
may
]]B[AN]I]..
bn a0mDutnd
f011ooin_
thafirst18oDat n as_E 8_
((n-l)_+ + Tmax)(nN)
using
the vector maximum
on STAR.
+
instruction.
(n-l)_+ + ¢_
max
This yields
13N + 166 for n = 2
143
2.
Odd-even
elimination,
We give
solely
using
the algorithm
and timing
in terms of N and m.
by setting
m =
Algorithm:
vector
Timing
operations
for N = 2m - I, deriving
for other
values
formulae
of N is approximated
rlog2N].
a (I) = a., etc.
J
J
for i = 1,2,...,m
(r = 2i-i)
vector
!ensths
for each
j = 1,2,...,N
solve b J•(i)(_j _(J q0j) - (aJ.(i) c.3 (i) v.3 (i) )
take
ej = _{j
0, q0j
N, N-r
0 if j < 1 or j > N
b .(i+l) = b .(i) - a. (i)
- c. (i)
3
3
j
_j-r
j
_j+r
N-r
v. (i+l)
J
(i+l)
a.
J
(i+l)
cj
N-r
for each
= v. (i) - a. (i) e0. - c. (i)
3
j
3- r
j
_j+r
_
(i)
- -a.
_.
J
j-r
(i)
= -c.j
_j+r
N-2r
N-2r
j = 1,2,...,N
solve b (m+l) x
= v (m+l)
3
J
3
Storage
handling:
temporary
vectors
are needed
b.
(i)j and
for _j, _j, q0jcomputed
b • (i+l)_,
J
b (i) , a. (i+l) -* a. (i) , etc.
]
3
3
for the factorization
at each stage,
and we can overwrite
Timing :
i.
arithmetic
(a)
only
1
[l(5n 2 + n) TD + _ (38n3 + 3n2
+ 6(38n3
+
- 9n 2 - 5n) TS]Nm
1
[I(5n2 + n)o D + _(38n 3 + 3n 2
+ _(38n Z
5n) TM
9n 2
5n)_ M
5n)_ S) ] m
of
144
-
[I(3n2 - n) TD + _(46n 3
I
+ _(46n 3
+
3n 2 + 5n) TM
27n 2 + 5n) TS}
i
_(2nB) TM .+ (2n3 - 2n2) TS + _(n2
1
+ _(-10n 3 + 3n 2
with
+ n)_ D
5n)_M
1
_+ _(-10n 3 + 15n2
(b)
N
5n)_ S }
STAR data,
I
I(38n34 + 19n2 " n)Nm + _(8740n 3 + 2343n 2 - 649n)m
-i (46n 3 + n 2 + n)N - _(2282n
I
3 - 2037n 2 + 649n)
(c)
for n = 2, with
STAR data,
94.5 Nm + 12999m-
2.
With
overwriting
to develop
of a.
]
the products
then use a transmit
the discussion
(i+l)
93.5N-
-_ a.
]
(i)
1901.
, c.
]
(i+l)
one row at a time
operation
of odd-even
on the whole
reduction
-_ c.
]
(i)
, it is necessary
in a temporary
row.
to follow.
vector,
For specifics
The added
cost
and
see
for
loop control and transmit instructions is
m-I
I
=
n2
40m + ......
i= 1 2n(_n(N-2 i) + 91)
40m +
(N(m-2) + I) + 182n(m-1),
or 4Nm + 404m - 8N - 360 for n = 2.
(a)
with
STAR data,
total
time is
1
23n 2
1
2343n 2
_(38n 3 +
- n)Nm + _(8740n 3 +
+ 443n + 240)m
-I(46n
3-
(b)
+ 9n 2 + n)N - --I(2282n3
- 2043n 2 + 174 In)
u
for n = 2, with
STAR data,
98.5Nm
- 101.5N - 2261.
+ 13403m
total
time is
145
Notes :
I.
I[B[A]II may be computed
of ((2n-l)_+
in the first
+ Tmax) (nN) +
for n = 2 on STAR.
((2n-l)_+
time through
+ _nax ) .
IIB[Hi][l may be computed
+ 308 at the appropriate
time
through
the loop at a cost
This yields
15N + 308
at a cost of 15(N-
the loop.
2m-i)
146
3.
Odd-even
As
for
reduction,
with
odd-even
N - 2m -
even
though
detailed
tion
form,
up
for
the
the
basic
= A,
i =
not
in
..., m-I
separate
8 compress
the
the
layout
N*
(i 0
(N i =
=
rlog2N_
and
in
choice.
and
see
timing,
A (i),
m
algorithm
proper
form,
z =
operations
take
w (I) = v,
counted
odd-even
data
vector
I, 2,
N
is
using
control
cost
general
Flog2N+l]
A (I)
vector
elimination,
For
For
Algorithm:
set
I.
m =
10.B.
using
Section
the
The
storage
timings
are
timing
formulae,
algorithm
handling
derived
is given
described
in Sec-
4.D.
= 2m
1 0
...),
length
but
=
instruction
2m+l-i
one
-
I, Ni*
n2N * bits
on
STAR
= 2 m+l-i)
w(i)
operations,
cost
= _(3n
2 +
n)N.*l +
736
(i)
factor
b2j+l
, (0 _ j
_ Ni+ I)
(i)
factorization
need
overwrites
a
cost
vector
lengths
(i)
solve
b2j+l
- Ni+ 1
(i)
(_2 j+l
_2j+l
(i)
need
a
temporary
cost
=
(2n +
need
a2j+l
vector,
I) (nD +
a
(i)
w2j+l
),
(0
_ j _ Ni+ I)
lengths
, etc.
length
(n 2 - n)(M
Ni+l*
+
S)),
= Ni+l*
= b2j (i) - a2j (i)
overwrites
)
(i)
overwrites
vector
%°2j+l
c2j+l
(i)
S),
(i)
V2j+I
(a2j+l
+
.
(i)
(i)
=
b j (i+l)
b2j+l
temporary
vector,
length
Ni+l*
1 2
I
= _(n
- n) D + _(2n 3 - 3n 2 + n)(M
V2j-I (i) _ c2 j (i)
b2j(i)
temporary
vector,
length
Ni+ I
_2j+l (i) , (I
< j
_Ni+
in
I)
147
cost
= 2n3(M
for
+
S)
simplicity,
(iq-l)
take
(i)
w.j
vector
lengths
(i)
= w2j
= Ni+l*
(i)
- a2 j
(i)
-
q°2j- I
(i)
c2j
q02j+ i
'
(i)
overwrites
w2j
need
a
temporary
cost
=
2n 2(M
for
a. (i+l)
3
vector,
+
simplicity,
take
(i)
k =
for
vector
(i)
a2j
i,
...,
Z =
I,
,
...,
, (2 _ j _ Ni+ I)
product
developed
for
r = 2,
...,
_I _,2j-I (i) ' (I
=
(i)
t£j
n 3M +
+
need
two
c J (i+l)
=
for
for
n)D
...,
I _ j
vector
length
length
vectors,
length
scalar
I
+ _(2n 3 +
< Ni+ I)
, (i < j
< Ni+ I)
¢ Ni+l *)
= Ni+l ,
= nNi+ I .
= nNi+ I and
Ni+ I.
i)
mode
3n 2 - 5n)(M
+
S)
(i)
-
_2j+l
(i+l)
x.j
(i)
- _2j+l
(i)
cost
time-
1
_2j+l
overwrites
a
(i+l)
(i)
=
_ _n;
vector
= wl(m)
(i)
x2j+l
(I _
(n 3 - n 2 )S,
a.
J
at
(i)
r_,2j-i
j (i) V2j+ I (i) , (I _ j _ Ni+ I -
I
2
_(n
+
i = m-l,
-t _,j'
< j
@
akr,2 j
transmits,
b I (m) xl(m)
cost
+
temporary
=-c2
as
solve
n
row
n
=
=
one
n
akl,2 j (i)
=
t_j
cost
= Ni+l*
n
t%j
ak£,j
lengths
(i)
_2j-I
overwrites
Ni+ I
S)
= -a2j
for
length
¢D2j+l
= 2n2(M
+
S),
vector
length
- Ni+l*
(i+l)
Xj+l
, (0
_ j
< Ni+ I)
148
x.
J
(i+1)-_ temporary
cost = transmit,
vector
length
(i)
merge
x2j+l
= nNi+l*
(i)
, temporary
cost = 3nNo* +
i
_ x
123
Timings :
I.
arithmetic
(a)
only,
1
_l(5n2 + n) TD + _(38n 3 + 15n2 - 5n) TM
I
+ _(38n 3 + 3n 2 - 5n)Ts }(N-I)
+
[_(5n
I
2 + n)g D + 6(38n3
+ 15n2
5n)_M
1
+ _(38n 3 + 3n 2 _ 5n)_s}(m-l)
+
(b)
_l(n2 + n) tD + I(2n3
with
STAR data:
+ 3n2 - 5n)(tM + ts) }
i
_(38n 3 + 31n 2 - n)(N-I)
+
_i
6(8740n3
+ 5103n 2 - 649n)(m-l)
+
(10n 3 + 38n 2 - 2n)
(c)
2.
for n = 2, with
with
loop control
using
STAR data,
STAR data:
and storage
manipulation
(compress,
I
n, add 80m + _(55n 2 + 43n) (N-l) +
(a)
forgeneral
(b)
for n = 2, add 76.5n +
total
I06.5N + 14839m-
time = 183N +
1394m-
16233m-
1390.5,
16108
so
14717.5
merge,
transmit)
(182n + 950) (m-I)
149
Notes
i.
:
An
estimate
first
loop
This
but
at
yields
simply
value
for
a cost
7.5N
the maximum
of
this
IIB[A]II
+
of
be
computed
((2n-l)T+
+
308
for
n = 2 on
row
sum
taken
II[A]II could
is more
may
be
expensive
found
and
by
in
the
Tmax)(Nn/2)
STAR.
over
every
computing
defeats
the
The
first
+
block
of
+
of
rows
odd-even
the
_max ).
IIB[A]II
row.
remaining
purpose
through
((2n-l)_+
estimate
other
the
time
The
of
is
true
B[A],
reduction.
150
4.
p-fold
reduction
m
The algorithm
N, take m =
proper
and timings
are given
FlogpN] - m/log 2 p, m =
for N = p
rlog2N] ; m- =
- I, p >- 3.
rlOgp N+I_
For general
is actually
choice.
Algorithm:
(see Section
m+l-i
Ni = p
6.A for details),
- I.
for i = I, ..., m-I
for each k = I, ..., Ni+ I + I,
LU process
on
for each k = I, ..., Ni+ I
UL process
generate
on _+I
A (i+l)
solve b I (m) Xl
for i = m-l,
w (i+l)
(m)= wl(m)
..., I
for each k = I, ..., Ni+ 1 +
back
substitute
1
for Xkp.j , 1 _ j _ p-I
Timing :
i.
(a)
for general
n,
1
_(5n2 + n)T D + _(26n 3 + 3n 2 - 5n)T M
1
-
+ _(26n 3
+
-
3n 2
I
[5n2 + n)_ D + _(26n 3 + 3n 2
i
+ _(26n 3 - 3n 2
+
(b)
with
5n)_ S} (N-p+l)
5n)_ M
5n)_s}(m-l) (p-l)
I
{l(n2 + n) tD + _(2n 3 + 3n2
5n)(tM+t S)}
I
21n 2
STAR data, _(26n 3 +
- n)(N - p + I)
+ I(5980n3
+ 2769n 2 - 649n)
+ (10n3 + 38n 2 - 2n)
(m-l)(p-l)
the
151
(c)
using
STAR data, n = 2:
145N + (19206m
- 145N + 19206m __)-°g2P
2.
data manipulation
derived
costs are outlined
the complete
timing
- 19351)(p-1)
(19351p
in Section
formulae,
-
19579)
10.B.
as the method
+ 228
We have not
is expected
to be
too slow.
Notes :
I.
In the arithmetic
than
(p+l)-fold
and 3-fold
reduction
reduction
reduction
for N < N(n).
for p e 3, p-fold
for any n.
yields
is faster.
LU factorization
included
timing,
will
Similar
an N(n)
reduction
Comparison
of 2-fold
we have not determined
probably
be faster
in the time estimates.
should
than either
hold when
N(n),
odd-even
the block
_eduction
overhead
cheaper
(odd-even)
such that if N _ N(n)then
Although
results
is always
costs
method
are
152
5.
Mixed
odd-even
Algorithm-
reduction
do k odd-even
and LU factorization
reductions,
LU factorization,
and back
Timing:
(based on N = 2m - I)
i.
{I(5n2 + n)T D + I(38n3
(a)
5n)G M
i
{i(3n2 + n) tD + _(14n 3 + 9n 2 - 5n)(t M + tS) }(2m-k
+
- {n2tD +
difference
rithm will
be an expression
be chosen
between
_n = 14839,
I06.5N + 14839m-
odd-even
reduction
of the form _n(m-k)
to maximize
this expression.
_n = 905.5,
vn
and the mixed
algo-
- _n(2 m-k) - _n' and k
For n = 2 using
= 13028, k = m-5,
STAR
yielding
46908.5
For n = 2 and STAR data,
the polyalgorithm
time
183N + 16233k + 417(2 m-k) + 33, and the optimal
yielding
(b)
i)
(2113 + n 2)(t M + tS)}
The timing
data, we have
2.
5n) Ts}(N + I - 2m-k)
I
{I(5n2 + n)G D + _(38n 3 + 15n2
I
+ _(38n 3 + 3n 2 _ 5n)Gs}k
+
(c)
substitute.
+ IDn2 - 5n)T M
i
+ _(38n 3 + 3n 2
should
C
)
(k+ i)
(k+I)
solve A -k+l- x
= w
by the
183N + 16233m - 70677
is
choice
is k = m-6,
153
6.
Incomplete
Algorithm:
reduction
do k odd-even
back
redmctions,
substitute
to obtain
Timing:
(based on N = 2m - I)
I.
_I(5n2 . n) TD + I(38n3
(a)
i
+ _(38n3
+
y = y
(I)
(1)
+ 3n 2 - 5n)T S} (N + i-
I
_i(5n2 + n)_ D + _(38n 3 + 15n2
= x.
2re'k)
5n)_ M
5n)_s}k
+
1 (n2 + n)T D + _(2n
I
3 + 3n 2 - 5n)(T M + TS) }(2m-k - I)
+
I
3
_l(n2 + n)_ D + _(2n
+ 3n 2 - 5n)(_ M + _S )}
general
n, STAR data:
i(38n 3 + 31n 2 - n)N + I(8740n3
+ 5103n 2 - 649n)
- (9n3 + 6n2)(2 m-k - i) + I(460n3
2.
_ x
+ 15n2 - 5n)TM
I
+ _(38n 3 + 3n 2
(b)
D [A(k+ I)] Y (k+l)• = w (k+l) , and
solve
(c)
using
STAR data,
(a)
general
n = 2:
k
+ l191n 2 - 649n)
I06.5N + 14839k - 96(2 m-k) + 1287
n, STAR data:
I(38n 3 + 86n 2 + 42n)N + I(8740n 3 + 5103n 2 + 443n +
I
- _(36n 3 + 79n2 + 51n)(2 m'k - I)
I
+ _(460n 3 + l191n 2 - 1651n) + 80(k+l)
(b)
n = 2, STAR data:
183N + 16233k-
176.5(2 m-k) + 1113.5
5700) k
154
7.
Jacobi
basic
iteration, semi-iteration
step:
--
y(i+l)
for each
B[A]y(i)
(identical
for both
I.
(a)
I
_(n 2 + n)(TDN
I.
(b)
= 2.
(a)
(c)
Optional
(I)
= 2.
(b)
(i)
- ajyj_ I
vector
(i)
- cjYj+I
computer
models):
+ _D ) + 6(2n 3 + 15n2
using
4l(2n 3 + 19n2
I.
V
j = I, ..., N
(i+l) _
solve bjyj
-vj
Timing
+ D[A]-I
5n)((T M + _s)N +
STAR data,
n)N + _(460n 3 + 3951n 2
using
preprocessing,
(_M + _S ))
STAR data,
times
n = 2:
649n)
22.5N + 3031
for n = 2, STAR data
solve bj(_j vj q0j) =(aj cj vj),
cost = 38.5N + 4367.
yj
(i+l) _
-e0j-
The basic
step is then
(i)
_jYj-I
(i)
- VjYj+I
'
cost = 12N + 1840.
(2)
factor bj, cost = 3.5N + 367.
original,
but cost
is 19N + 2634.
plete reduction/Jacobi
Semi-iteration,
basic
_(m) = result
Wm+ I = i/(Ip = p(B[A]).
This
algorithm.
step:
of Jacobi
step
from y(m)
(m-i)
y(m) = _m+l
The basic
(9(m) _ Y
(p2/4)_n)),
(m-I)
) + Y
step
is unchanged
is used
from the
in the mixed
incom-
155
Timing,
n = 2, using
STAR data:
26.5N
+ 3408,
or 16N + 2217 with
the first
preprocessing.
Notes :
I.
Varga IV3, p.
139] shows
that for Jacobi
semi-iteration,
,Ie(i),, < _+_b_(_b_ll--))ii/_
" e(0)",
=
_
e
A conservative
(i)
estimate
1
=y
+-
(i)
P-----
- x.
for the number
IIe(n) II< 1/Nil e(0) II is n - 21ogN /
the execution
for the better
time
,
for an iterative
direct methods.
of iterations
(-log(_b-l)).
solution
needed
to obtain
It follows
that
is O(N log N) vs. O(N)
156
8.
Incomplete
Algorithm:
reduction
do k odd-even
L Jacobi
plus
Jacobi
iterations
reductions,
iterations
to improve
As a result
of solving
D[A (k+l) ]y(k+l)
D[A (k+l) ].
This means
the cost
I.
(c)
I06.5N + 14839k-
2.
(b)
183N + 16233k-
solve
D[A (k+l)]y(k+l)
y (k.l) , and back
= w(k+l) , we have
is (using STAR
= w (k+l) , do
substitute
factors
data, n = 2)
96(2 m-k) + 1287 + %(19(2 m-k) + 2615).
176.5(2 m-k) + 1113.5
+ _(19(2 m-k) + 2615).
for
157
REFERENCE S
[BI]
I. Babuska, "Numerical stability in mathematical
analysis,"
Consress 1968, North-Holland,
Amsterdam, pp. 11-23.
[B2]
I. Babuska, Numerical stability in the solution of tridiagonal
matrices, Rept. BN-609, Inst. for Fluid Dynamics and Appl. Math.,
Univ. of Md., 1969.
[B3]
I. Babuska, "Numerical stability in problems of linear
SlAM J. Numer. Anal., vol. 9, 1972, pp. 53-77.
[B4]
D.W.
Bailey and D. E. Crabtree, "Bounds
Alg. Appl., vol. 2, 1969, pp. 303-309.
[B5]
R.E.
Bank, "Marching algorithms
in [BI6, pp. 293-307].
[B6]
G.H.
Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick,
R. A. Stoker, "The llliac IV computer,"
IEEE Trans. on Comp.,
vol. C-17, 1968, pp. 746-757.
[B7]
G.M.
Dept.
[B8]
F.L.
Bauer, "Der Newton-Prozess
als quadratisch
konvergente
Abkurzung des allgemeinen
linearen stationaren
Iterationsverfahrens
Baudet, Asynchronous
of Computer Science,
algebra,"
for determinants,"
and block
Gaussian
IFIP
Lin.
elimination,"
iterative methods for multiprocessors,
Carnegie-Mellon
Univ., Nov. 1976.
I. Ordnung (Wittmeyer-Prozess),"
1955, pp. 469-470.
Z. Ansew.
Math.
Mech.,
vol. 35,
[B9]
A. Benson and D. J. Evans, "The successive peripheral block overrelaxation method," J. Inst. Maths. Applics., vol. 9, 1972,
pp. 68-79.
[BI0]
A. Benson and D. J. Evans, "Successive peripheral overrelaxation
and other block methods," J. Comp. Phys., vol. 21, 1976, pp. 1-19.
[BIll
G. Birkhoff and A. George,
in [TI, pp. 221-269].
[BI2]
W. J. Bouknight, S. A. Denenberg, D. E. Mclntyre, J. M, Randall,
A. H. Sameh, D. L. Slotnick, "The llliac IV system," Proc. IEEE,
vol. 60, 1972, pp. 369-388.
[BI3]
A. Brandt, Multi-level
method, Rept. RC 6026,
Heights, N. Y., 1976.
[BI4]
J.L.
Brenner, "A bound for a determinant with dominant
diagonal," Proc. AMS, vol. 5, 1954, pp. 631-634.
"Elimination
by nested
dissection,"
adaptive techniques I. The multi-grid
IBM T. J. Watson Res. Cen., Yorktown
main
158
[BI5]
C.G.
Broyden,
"Some
elimination
process,"
pp. 273-286.
condition
number
J. Inst. Maths.
[BI6]
J.R. Bunch and D. J. Rose, eds.,
Academic Press, N. Y., 1976.
[BI7]
B.L.
Buzbee,
bounds
Application
Applics.,
Sparse Matrix
of fast Poisson
approximation
of parabolic
Sci. Lab., 1972.
for the Gaussian
problems,
Rept.
vol.
12, 1973,
Computations,
solvers
to the numerical
LA-4950-T,
Los Alamos
[BI8]
B.L.
Buzbee and A. Carasso,
bolic problems for preceding
pp. 237-266.
[BI9]
B.L.
Buzbee, F. W. Dorr, J. A. George and G. H. Golub, "The
direct solution of the discrete Poisson equation on irregular
regions," SIAM J. Numer. Anal., vol. 8, 1971, pp. 722-736.
[B20]
B.L.
Buzbee, G. H. Golub and C. W. Nielson, "On direct methods
for solving Poisson's equations,"
SlAM J. Numer. Anal., vol. 7,
1970, pp. 627-656.
[CI]
A. Carasso, "The abstract backward
Anal., vol. 2, 1971, pp. 193-212.
[C2 ]
Control Data Corporation,
CDC STAR-100 Instruction
Arden Hills, Minn., Jan. 1976; in conjunction with
made at Lawrence Livermore Laboratory.
[C3]
A.K.
[C4]
P. Concus and G. H. Golub, "Use of fast direct methods for the efficient solution of nonseparable
elliptic equations,"
SlAM J. Numer.
Anal., vol. I0, 1973, pp. 1103-1120.
[C5]
P. Concus, G. H. Golub and D. P. O'Leary, "A generalized conjugate
gradient method for the numerical solution of elliptic partial differential equations,"
in [BI6, pp. 309-332].
[C6]
S.D. Conte and C. de Boor,
Hill, N. Y., 1972.
[C7]
M.G.
Cox, "Curve fitting with piecewise polynomials,"
Maths. Applics., vol. 8, 1971, pp. 36-52.
[C8]
Cray Research,
[DI]
M.A.
Diamond and D. L. V. Ferreira, "On a cyclic reduction method
for the solution of Poisson's equations," SlAM J. Numer. Anal.,
vol. 13, 1976, pp. 54-70.
[D2]
F.W.
Dorr, "The direct solution of the discrete Poisson equation
on a rectangle,"
SlAM Review, vol. 12, 1970, pp. 248-263.
Cline,
personal
Inc.,
"On the numerical computation
of paratimes," Math. Comp., vol. 27, 1973,
beam equation,"
communication,
"Cray-I
SlAM J. Math.
execution times,
measurements
1974.
Elementary
Computer,"
Numerical
Chippewa
Analysis,
Falls,
McGraw-
J. Inst.
Wis.,
1975.
159
[D3]
F.W.
Dorr, "An example of ill-conditioning
in the numerical solution of singular perturbation problems," Math. Comp., vol. 25,
1971, pp. 271-283.
[El]
S.C. Eisenstat, M. H. Schultz and A. H. Sherman, "Applications
of
an element model for Gaussian elimination,"
in [BI6, pp. 85-96].
[E2]
D.J.
Evans, "A new iterative method for the solution of sparse
systems of linear difference equations,"
in [R6, pp. 89-100].
[E3]
R.K.
Even and Y. Wallach, "On the direct solution of Dirichlet's
problem in two dimensions,"
Computing, vol. 5, 1970, pp. 45-56.
[FI]
D.C.
Feingold and R. S. Varga, "Block
and generalizations
of the Gerschgorin
Math., vol. 12, 1962, pp. 1241-1250.
[F2]
D. Fischer, G. Golub, O. Hald, C. Leiva and O. Widlund, "On
Fourier-Toeplitz
methods for separable elliptic problems," Math.
Comp., vol. 28, 1974, pp. 349-368.
IF3]
G. von Fuchs, J. R. Roy and E. Schrem, "Hypermatrix solution
large sets of symmetric positive definite linear equations,"
Math. in Appl. Mech. and Engng., vol. I, 1972, pp. 197-216.
of
Comp.
[GI]
J.A. George, "Nested
SIAM J. Numer. Anal.,
mesh,"
[HI]
L.A.
Hageman and R. S. Varga, "Block iterative methods for cyclically reduced matrix equations," Numer. Math., vol. 6, 1964, pp. 106119.
[H2]
D.E.
Heller, "Some aspects of the cyclic reduction algorithm
block tridiagonal
linear systems," SlAM J. Numer. Anal., vol.
1976, pp. 484-496.
[H3]
D.E.
Heller, A survey of parallel algorithms
algebra, Dept. of Comp. Sci., Carnegie-Mellon
Review, to appear.
[H4]
D.E.
Heller, D. K. Stevenson and J. F. Traub, "Accelerated
iterative methods for the solution of tridiagonal
systems on parallel
computers,"
J.ACM, vol. 23, 1976, pp. 636-654.
[H5]
R.W.
Hockney, "A fast direct solution of Poisson's equation
Fourier analysis," J.ACM, vol. 12, 1965, pp. 95- 113.
[H6]
R.W.
Hockney, "The potential calculation and some applications,"
Methods in Computational
Physics, vol. 9, 1970, pp. 135-211.
[H7 ]
E. Horowitz,
perturbation
pp. 816-825.
diagonally dominant matrices
circle theorem," Pacific J.
dissection of a regular finite
vol. I0, 1973, pp. 345-363.
element
in numerical
Univ., 1976;
for
13,
linear
SlAM
using
"The application of symbolic mathematics
to a singular
problem," Proc. ACM Annual Conf., 1972, vol. II,
160
[H8]
A.S.
Householder,
Blaisdell, N. Y.,
The Theory
1964.
[Jl]
T.L.
Jordan, A new parallel algorithm for diagonally
tridiagonal matrices, Los Alamos Sci. Lab., 1974.
IJ2]
M.L.
Juncosa and T. W. Mullikin, "On the increase of convergence
rates of relaxation procedures for elliptic partial difference
equations,"
J.ACM, vol. 7, 1960, pp. 29-36.
[KI]
W. Kahan, Conserving confluence curbs ill-condition,
Comp. Sci., Univ. of Calif., Berkeley, 1972.
[K2]
H.T.
Kung, "On computing reciprocals
Math., vol. 22, 1974, pp. 341-348.
[K3]
H.T.
Kung and J. F. Traub, All algebraic
fast, Dept. of Comp. Sci., Carnegie-Mellon
ILl]
J.J. Lambiotte,
Jr., The solution of linear systems of equations
on a vector computer, Dissertation,
Univ. of Virginia, 1975.
[L2]
J.J. Lambiotte,
Jr. and R. G. Voigt, "The solution of tridiagonal
linear systems on the CDC STAR-100 computer, " ACM Trans. Math
Software, vol. I, 1975, pp. 308-329.
[MI]
N. Madsen and G. Rodrigue, A comparison of direct methods for tridiagonal systems on the CDC STAR-100, Lawrence Livermore Lab.:,
1975.
[M2]
M.A.
Malcolm and J. Palmer, "A fast direct method for solving
CACM
vol . 17 , 1974 , pp
class of tridiagonal
linear systems, " __,
17.
[M3]
W • Miller, "Software for roundoff analysis, " ACM Trans.
Software, vol. I, 1975, pp. 108-128.
IN1]
A.K.
Noor and S. J. Voigt, "Hypermatrix scheme for finite element
systems on CDC STAR-100 computer, " Computers and Structures, vol
5, 1975, pp. 287-296.
[01]
J.M. Ortega and R. G. Voigt, Solution of partial differentia]:
equations on vector computers, ICASE, Hamton, Va., 1977.
[02]
A.M.
Ostrowski, "_ber die Determinanten
mit _berwiegender
Hauptdiagonale," Comment. Math. Helv., vol. I0, 1937, pp. 69-96.
[PI]
G • Peters and J. H. Wilkinson, "On the stability of Gauss-Jordan
elimination with pivoting," CACM, vol. 18, 1975, pp. 20-24.
JR1]
J.K.
Reid, ed., Larse
Press, London, 1970.
Sparse
of Matrices
in Numerical
of power
Analysis,
dominant
Dept.
series,
of
" Numer.
functions can be computed
Univ., July 1976.
Sets of Linear
Equations,
a
14-
Math
Academic
161
JR2 ]
J.K. Reid, "On the method of conjugate gradients for the solution of large sparse systems of linear equations,"
in [RI,
pp. 231-254 ].
[R3]
F. Robert, "Blocks-H-matrices
tives classiques par blocks,"
pp. 223-265.
[R4]
D.J.
Rose and J. R. Bunch, "The role of partitioning
in the numerical solution of sparse systems," in JR6, pp. 177-187].
JR5]
D.J.
Rose and G. F. Whitten, "A recursive
strategies,"
in [BI6, pp. 59-83].
[R6]
D.J.
Rose and R. A. Willoughby,
eds., Sparse Matrices
Applications,
Plenum Press, N. Y., 1972.
JR7]
J.B.
Rosser, The direct solution of difference analogs of
Poisson's equation, rep. MRC-TSR-797,
Math. Res. Cen., Madison,
Wis., 1967.
[SI]
J. Schr_der and U. Trottenberg,
"Reduktionsverfahren
fur Differenzengleichungen
bei Randwertaufgaben
I," Numer. Math., vol. 22,
1973, pp. 37-68.
[Sla]
J. SchrSder, U. Trottenberg
and H. Reutersberg,
"Reduktionsverfahren f_r Differenzengleichungen
bei Randwertaufgaben
II," Numer.
Math., vol. 16, 1976, pp. 429-459.
[$2]
H.S.
Math.
[$3]
H.S.
Stone, ed., Introduction
to Computer Architecture,
Research Associates,
Palo Alto, Calif., 1975.
[$4]
G. Strang and G. J. Fix, An Analysis of the Finite
Prentice-Hail,
Englewood Cliffs, N. J., 1973.
[$5]
P.N.
et convergence des methods iteraLin. Alg. Appl., vol. 2, 1960,
Stone, "Parallel tridiagonal equation
Software, vol. I, 1975, pp. 289-307.
Swarztrauber,
"The direct
equation on the surface
1974, pp. 46-54.
solution
of a sphere,"
analysis
of dissection
solvers,"
and Their
ACM Trans.
Element
of the discrete
J. Comp.
Science
Phys.,
Method,
Poisson
vol.
15,
[$6]
P.N.
Swarztrauber,
"A direct method for the discrete solution
separable elliptic equations," SlAM J. Numer. Anal., vol. II,
1974, pp. 1136- 1150.
of
[$7]
P.N.
Swarztrauber and R. A. Sweet, "The direct solution of the
discrete Poisson equation on a disc," SlAM J. Numer. Anal., vol.
I0, 1973, pp. 900-907.
[$8]
P. Swarztrauber and R. Sweet, Efficient FORTRAN subprograms
for
the solution of elliptic partial differential
equations, Rept.
NCAR-TN/IA-109,
Nat. Cen. for Atmospheric
Res., Boulder, Col.,
1975.
162
[$9]
R.A.
tion
428.
Sweet, "Direct methods for the solution of Poisson's equaon a staggered grid," J. Comp. Phys., vol. 12, 1973, pp. 422-
[Sl0]
R.A.
Sweet, "A generalized
cyclic reduction
Numer. Anal., vol. II, 1974, pp. 506-520.
[SIll
R.A.
Sweet, "A cyclic reduction algorithm for block tridiagonal
systems of arbitrary dimension," contributed
paper, Symp. on
Sparse Matrix Computations,
Argonne Nat. Lab., Sept. 1975.
[TI]
J.F. Traub, ed., Complexity of Sequential
Algorithms,
Academic Press, N. Y., 1973.
IT2]
J.F. Traub, "Iterative solution of tridiagonal
or vector computers,"
in [TI, pp. 49-82].
IT3]
J.F.
Traub, ed., Analytic
Press, N. Y., 1976.
IV1]
J.M.
Varah, "On the solution of block tridiagonal
systems arising
from certain finite difference equations," Math. Comp., vol. 26,
1972, pp. 859-868.
IV2 ]
J.M.
Varah, "A comparison of some numerical methods for two-point
boundary value problems," Math. Comp., vol. 28, 1974, pp. 743-755.
IV3]
R.S.
Varga, Matrix Iterative
Cliffs, N. J., 1962.
IV4]
R.S.
Varga, "On recurring theorems on diagonal
Alg. Appl., vol. 13, 1976, pp. 1-9.
[WI]
W.J.
Watson, "The TI ASC, a highly modular and flexible supercomputer architecture,"
AFIPS Fall 1972, AFIPS Press, Montvale,
N. J., vol. 41, pt. I, pp. 221-229.
[W2]
O.B.
Widlund, "On the use of fast methods for separable finite
difference equations for the solution of general elliptic problems,"
in JR6, pp. 121-131].
[W3]
J.H. Wilkinson, "Error analysis of direct methods
sion," J.ACM, vol. 8, 1961, pp. 281-330.
[W4]
W.A.
Fall
777.
[yl]
D.Y.Y.
analytic
Computational
Analysis,
algorithm,"
and Parallel
systems
Complexity,
SlAM J.
Numerical
on parallel
Academic:
Prentice-Hall,
Englewood
dominance,"
of matrix
Lin.
inver-
Wulf and C. G. Bell, "C.mmp, a multi-mini-processor,"
AFIPS
1972, AFIPS Press, Montvale, N. J., vol. 41, pt. 2, pp. 765-
Yun, "Hensel meets Newton - algebraic
setting," in IT3, pp. 205-215].
constructions
in an