Semi-Discrete Matrix-Free Formulation of 3D Elastic Full

1
Semi-Discrete Matrix-Free Formulation of 3D
Elastic Full Waveform Inversion Modeling
Stephen Moore, Devi Sudheer Chunduri, Sergiy Zhuk, Tigran Tchrakian,
Ewout van den Berg, Albert Akhriev, Alberto Costa Nogueira Junior, Andrew
Rawlinson, Lior Horesh
IBM Research
Abstract. Full waveform inversion (FWI) is an emerging subsurface
imaging technique, used to locate oil and gas bodies. The key challenges
that hinder its adoption by industry are both algorithmic and computational in nature, including storage, communication, and processing of
large-scale system matrices, which impose cardinal impediments upon
computational scalability. In this work we will present a complete matrixfree algorithmic formulation of a 3D elastic time domain spectral element
solver for both the forward and adjoint wave-fields as part of a greater
cloud based FWI framework. We will discuss computational optimisation (SIMD vectorisation, use of Many Integrated Core architectures,
etc) and present scaling results for two HPC systems, namely an IBM
Blue Gene/Q and an Intel based system equipped with Xeon Phi coprocessors.
1
Introduction
In the field of exploration geophysics, Full Waveform Inversion (FWI) is regarded
as the state-of-the-art seismic imaging technique. The goal of which is to detect
and characterize subsurface geological structures such as ore minerals, hydrocarbons, geothermal reservoirs and aquifers. As the approach utilizes a comprehensive representation of the interaction between wave physics and subsurface
properties, it offers unique advantages in terms of generality, fidelity, complexity
and robustness, and is hence capable of imaging arbitrarily heterogeneous compressional and shear wave velocity profiles of the subsurface constituents [1]. As
a wavefield propagates through a medium of heterogeneous elastic properties,
waves are reflected off different material regions in the subsurface. Data acquisition is performed by recording the response of the subsurface to either active
(e.g. via airguns, waterguns, vibrator truck, boomer, explosives) or passive excitation of the subsurface. Data are collected using an array of geophones deployed
according to a predefined survey design [2], where the data itself is in the form
of seismic reflection (seismograms) traces.
By creating a computer model of the surveyed region (with a choice of subsurface material properties, or ‘earth model’) and assuming that a governing
equation such as the acoustic or elastic wave equations are representative of the
physics, a ‘forward’ simulation of the wavefield subject to the same sources can
be performed. From this simulation, synthetically generated seismograms can
be created and the discrepancy between the experimental and computational
seismograms can be assessed. The level of misfit, can be quantified in various
ways depending upon the assumptions taken regarding the noise characteristics.
The misfit functional (or noise model) is instrumental in directing the non-linear
optimization process of inversion in updating the material properties. Since the
inverse problem is by nature ill-posed, it is essential to incorporate a-priori information (e.g. structure, sparsity) as well as constraints (e.g. positivity) as to
constraint the solution space. Thus, the inversion’s objective is to find a feasible solution that minimizes a compromise between the aforementioned data
misfit and a-priori information. At this point one may infer that the current aposteriori subsurface estimate model is a sufficient approximation to reality, or
alternatively, devise means to quantify uncertainty. As 3D and 4D formulations
of the elastic FWI problem are notoriously of large-scale, implying that only
matrix-free inversion schemes are computationally tractable. From an inversion
perspective, this means that both Jacobians and Hessian approximations can
only be accessed implicitly in a matrix-vector product fashion [3].
Of key computational importance in the update of the subsurface model is
the gradient of the misfit functional, which is computed through a combination
of a forward wavefield and an ‘adjoint’ wavefield [4,5]. The former can be viewed
as computed by marching forward in time and the latter computed by marching
backward in time. Some of the important computational bottlenecks are associated with storage, communication, and processing of large-scale system matrices.
To alleviate these problems, in this study, a complete matrix-free formulation
is introduced. Specifically, we propose a hybrid optimize/discretize strategy for
subsurface model (Lam´e parameters) estimation in the displacement-stress formulation of the 3D elastic wave equation. Our formulation relies upon a high
order spectral element discretization of the wave equation, which entails a system of ODEs. The latter, so-called Spectral Element Model (SEM) is continuous
in time and defines the dynamics of the projection coefficients representing the
solution of the elastic wave equation in a polynomial basis. We further introduce
a derivation of the adjoint equations for the SEM, which enables the incorporation of a generic data misfit functional, measuring (continuous in time), the
discrepancy between the simulated wavefield and observed data.
The key contributions of the study are a scalable computational strategy and
a novel approach for matrix-free evaluation of the adjoint wavefield using a hybrid distributed-shared memory approach. The formulation enables computation
of the adjoint wavefield, and consequently the gradient (or proximal function of
which) of the objective function, while avoiding storage of the forward displacement field in memory. This algorithmic setup grants scalability in both weak
and strong scaling tests. This solver is one component of a greater effort aimed
at the development of a cloud-based FWI framework.
2
Methods
The elastic wave equation [1] is a statement of Newton’s second law of motion
applied to a continuous solid material and is defined as
ρ∂tt ui = ∂xj σij + fi
(1)
where x ∈ Ω ⊂ R3 denotes a computational solid domain with boundary Γ ,
t ∈ [0, T ], i, j ∈ [1, 2, 3], ρ(x) is the mass density, u(x, t) is the displacement
vector field, σ(x, t) is the stress tensor field, and f (x, t) is a seismic source modeled in the form of a point source f (x, t) = θ(t)δ(x − xs ) applied at xs using
a standard Ricker wavelet θ(t). Approximating the earth model with a linear
isotropic relation (i.e. Hooke’s law), the stress tensor is related to displacement
gradients as
σij = λδij ∂xk uk + µ ∂xi uj + ∂xj ui
(2)
where λ(x) and µ(x) are the first and second Lam´e parameters respectively.
In fact, µ can be physically interpreted as the shear modulus and λ = κ −
2/3µ, where κ is the bulk modulus. The elastic wave equation is subject to the
standard initial conditions ui = ∂t ui = 0, free surface boundary conditions
σij nj = 0 and Clayton and Enquist absorbing boundary conditions σij nj =
−ρv∂t ui respectively, where nj denotes the normal to the boundary Γ and v
denotes the elastic wave speed (which will be P wave speed for displacement
components parallel to nj and S wave speed for components perpendicular to
nj ). Using Galerkin projection the weak form of the elastic wave equation is
Z
Z
Z
Z
ψρ∂tt ui dΩ + ∂xj ψσij dΩ = ψfi dΩ + ψσij dΓj
(3)
Ω
Ω
Ω
Γ
where ψ denotes the choice of basis functions.
2.1
Discretization
Using the spectral element method for the spatial discretization, the computational domain is defined as a tesselation of Ne hexahedral elements (Figure 1(c))
with the displacement field approximated as ui (x, t) ≈ ψpqr (x)ui,pqr (t), such
that
Z
Z
ψabc ρψpqr ∂tt ui,pqr dΩ +
Ωe
Γe
Z
Z
ψab ρvψpq ∂t ui,pq dΓ +
Ωe
∂xj ψabc σij dΩ =
ψabc fi dΩ
Ωe
(4)
In this specific formulation the basis functions ψpqr are taken to be a tensor
product of the family of N th order Lagrange polynomials ψpqr = `p `q `r (Figure
1(b)). After applying a geometric transformation to a reference element ξi ∈
[−1, +1] (Figure 1(a)), performing the integration with Gauss-Legendre-Lobatto
(GLL) quadrature (Figure 1(c)) [6], and then global assembly, gives the system
of ODEs
M¨
u + Cu˙ + F(m, u) = s
(5)
1.0
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−1.0
−0.5
(a)
0.0
ξ
0.5
1.0
(b)
(c)
Fig. 1. Illustrations of (a) the transformation of a hexahedral element from x to ξ
coordinates, (b) a family of 8th order Lagrange polynomials defined for ξi ∈ [−1, +1],
(c) an 8th order spectral element depicting the locations of the GLL quadrature/nodal
points.
where u is the global set of displacements and m the global set of Lam´e parameters λ and µ defined for GLL points from all spectral elements in the grid. The
mass M and damping C matrices, force F and source s vectors are given by
M=
Ne N
+1
X
X
wf wg wh ψabc ρψpqr Jξ
1,f ,ξ2,g ,ξ3,h
e=1 f,g,h=1
C=
Ne N
+1
X
X
wf wg ψab ρvi ψpq Jξ
1,f ξ2,g
e=1 f,g=1
F=
Ne N
+1
X
X
wf wg wh ∂ξk ψabc ∂ξk xj σij,pqr Jξ
1,f ,ξ2,g ,ξ3,h
e=1 f,g,h=1
s=
Ne N
+1
X
X
wf wg wh ψabc fi Jξ
1,f ,ξ2,g ,ξ3,h
(6)
e=1 f,g,h=1
where J defines the Jacobian of the transformation mapping to the reference element, and wf denote the weights associated with the GLL quadrature. Due to
the cardinal interpolation properties of the Lagrange polynomials `a (ξf ) = δaf
and the specific choice of GLL quadrature, the mass and damping matrices
(which don’t involve spatial derivatives of the Lagrange polynomials) are diagonal and hence trivially assembled and stored as 1D arrays. It is worthwhile
mentioning that F(m, u) ≡ K(m)u where K is the well known (sparse, but
non-diagonal), stiffness matrix in solid mechanics, which linearly depends on m,
that is K(m1 + m2 ) = K(m1 ) + K(m1 ). Finally, in order to perform the time
integration, an explicit form of the Newmark Method [7] is used:
un+1 = un + ∆tu˙ n + 1/2∆t2 u
¨n
ˆ˙ n+1 = u˙ n + 1/2∆t2 u
u
¨ n+1
ˆ˙ n+1 − F(un+1 )
u
¨ n+1 = (M + 1/2∆tC)−1 sn+1 − Cu
ˆ˙ n+1 + 1/2∆t2 u
u˙ n+1 = u
¨ n+1
(7)
where it is important to note that due to the diagonality of M and C the inverse
required for the update of u
¨ n+1 in (7) is trivial and can in fact be precomputed
and stored as a 1D array once M and C have been assembled.
In order to update the earth model, a generic data misfit functional relating
the solution of the discrete forward elastic model to the observed data is defined
as
Z TX
Ns
J(m) =
r(d(xs , t) − V u(t))dt
(8)
0
s=1
where the r(x) > 0 if x 6= 0 is a noise model quantifying the error between V u
and the observed quantities d, V maps the discrete wavefield, u into the space of
observations, or equivalently, maps the grid of GLL points approximating Ω into
the set of sensor locations {x1 . . . xNs }. Estimation of the earth model, m, given
the data d by variational approaches (gradient descent or Quasi-Newton methods) requires the computation of ∂mj J. The latter is achieved by introducing
the following adjoint equation
M¨
u† − Cu˙ † + F(m, u† ) = V T (d − V u),
u† (T ) = u˙ † (T ) = 0
(9)
where u solves (5). We also define βj as a solution of the following scalar ODE
β˙ j = u† (t)T ∂mj K u(t),
βj (0) = 0
(10)
Finally, through integration by parts, it is easy to prove that
∂mj J(m) = βj (T ) .
(11)
The key drawback of the approach for computing ∂mj J presented in the previous section is that computation of βj requires the storage of either u or u†
in the memory, as the former is computed forward in time and integration of
the adjoint model (10) requires the latter to be computed backwards in time.
Storage of either discrete wavefield at each GLL point at every time step would
prohibitively limit scalability of the code, as multiple runs of the forward and
adjoint time marching will be required for computation of each ∂mj J. Alternatively, we propose that during the first run (i.e. solution of (5)) the residual
r is computed and stored at each time step. As the number of sensors Ns is
significantly smaller than the total number of GLL points in the grid, storage
of the wavefield at these points for every time step is trivial. The second run
solves (9) backward in time solely for the purpose of computing u† (0) and u˙ † (0)
which is then stored. Finally, (5), (9), and (10) are simultaneously marched forward until time T at which point ∂mj J can be evaluated. Note that (9) is solved
forward in time by using the stored u† (0) and u˙ † (0) as the initial conditions.
This approach requires little storage, but assumes that time integration can be
performed sufficiently fast.
2.2
Parallel Implementation
The spectral element and Newmark discretization methods allow for a relatively
simple explicit time marching scheme, avoiding the need for sparse matrix data
Fig. 2. An illustration of a computational domain (left) that is discretized into spectral
elements and decomposed amongst multiple compute nodes (middle), where a given
processes’ portion of elements (right) are further separated into interior (blue) and
interface (red) elements.
structures or the solution of linear systems. The parallel implementation of this
solver involves a hybrid distributed-shared memory approach implemented with
MPI and OpenMP. The distributed memory aspect involves a standard domain
decomposition approach where individual processes are responsible for the time
marching over a unique subset of spectral elements (Figure 2). This approach
implies that GLL points shared by spectral elements on neighbouring processes
will be duplicated and so when the forces F(m, u) are assembled at each time step
in the forward or adjoint solve, the forces need to be exchanged and summed
between any processes that contain a duplicated GLL point. In order to hide
the latency of this data exchange, a processes’ elements can be grouped into
interior and interface elements (Figure 2), processing the interface elements first,
exchanging forces on duplicate GLL points with non-blocking MPI sends and
receives, and processing the interior elements while the messages are in transit.
The shared memory parallelization is applied mainly to the computation of
the forces, as this is by far the most computationally intensive part of the algorithm. The computation involves the use of multiple threads to process the different spectral elements within an interior or interface region, where an OpenMP
thread loops over the GLL points in an element, computes the displacement gradients and stresses at each GLL point, performs the integration, and updates the
forces array, which is shared by all threads. The update involves a synchronized
write to the memory, which is achieved using OpenMP locks by maintaining
a lock for each shared GLL point; a more flexible and scalable approach compared to using either a critical or named critical sections. Pseudocode for the
evalutation of the forces is presented in Algorithm 1.
2.3
Experimental Setup
Experiments were performed on two different systems, namely an IBM Blue
Gene/Q and the Intel based ‘Stampede’ supercomputers. Each Blue Gene/Q
compute node contains a 16 core 1.6GHz A2 processor, with 16GB of DDR3
for region=interface:interior
#pragma omp for
for e=1:N_e // in region
for p,q,r=1:N+1 // each GLL point
Compute displacement gradients
Compute stresses
Integrate
omp_set_lock at GLL point pqr
Update global forces array
omp_unset_lock at GLL point pqr
end
end
if region==interface
MPI_Isend then MPI_Irecv forces on duplicate GLL points
end
MPI_WaitAll
end
Algorithm 1: Pseudocode for computation of forces at each time step.
memory. Each core supports four-way Symmetric Multithreading (SMT) and in
our implementation uses 64 threads per node. Each core has a Quad Floating
Point Unit (FPU) supporting 4-way double precision SIMD operations. The code
was compiled with the IBM XLC compiler. The Stampede compute nodes contain two Intel Xeon E5-2680 8 core (Sandy Bridge) at 2.7Ghz, one Intel Xeon Phi
SE10P 61 core (Knights Corner) coprocessor at 1.09GHz, and 32GB of DRAM.
The Sandy Bridge core has 256-bit vector registers and can be programmed using the 256-bit vector intrinsics. The Xeon Phi coprocessor is a Many Integrated
Core (MIC) architecture, the basis of which is a light-weight x86 core with inorder instruction processing, coupled with heavy-weight 512bit SIMD registers
and instructions. With these two features the Phi die can support 61 cores, and
can execute 8 double precision SIMD instructions. The code was compiled and
run using Intel Composer XE 2013 and Intel MPI Library 4.1.
The nature of the algorithm is such that most computations required at a
GLL point, such as the displacement gradients in the ξi and xi coordinates for
example:
N
+1 X
∂ui
∂`f
=
ui ∂ξ1
∂ξ1
ξ1,p
f =1
,
∂ui
∂ui ∂ξ1
∂ui ∂ξ2
∂ui ∂ξ3
=
+
+
∂x1
∂ξ1 ∂x1
∂ξ2 ∂x1
∂ξ3 ∂x1
(12)
can be performed in a vectorizable fashion for i ∈ [1, 2, 3], with similar expressions for displacement gradients in other directions, and with similar operations
for the computation of stresses and the integration. Taking advantage of this
fact we used the QPX intrinsics on Blue Gene/Q such as vec mul, vec madd for
example, to perform multiply and multiply-add operations respectively, using
3 of the 4 available execution slots. Despite not using the full capacity of the
floating point unit, this approach significantly improves performance without
requiring development of a more complex algorithm. With the Sandy Bridge
core the same approach is also used, but with the AVX intrinsics. With the MIC
architecture on the other hand, the ability to perform 8 double precision SIMD
intstructions would mean that this approach would make far less efficient use of
the SIMD registers. To test its full capacity a restriction was made to 7th order
Lagrange polynomials in the discretization (hence N + 1 = 8) and the algorithm
for the computation of forces was restructured so that rather than vectorize
the computation of quantities in i ∈ [1, 2, 3], terms in the Lagrange polynomial
f ∈ [1−N +1] (12) were vectorized with KNC mutliply and a reduction instrinsic
such as mm512 mul pd and mm512 reduce add pd respectively.
3
Results & Discussion
64
1.1
Observed
Ideal
32
Parallel Efficiency
1.0
Speedup
16
8
4
0.9
0.8
0.7
2
1
32
Observed
Ideal
64
128
256
512
Number of Nodes
(a)
1024
2048
0.6
32
64
128
256
512
Number of Nodes
1024
2048
(b)
Fig. 3. Strong scaling results for the forward solver presenting (a) speedup and (b)
parallel efficiency for an internode run, testing the distributed memory parallelization
on Blue Gene/Q.
To test the scalability of the solver, both strong and weak scaling runs were
performed on each system for a simulation of the forward wavefield only (since
the nature of the algorithm means that the adjoint and β fields will scale in
exactly the same way). On each Blue Gene/Q compute node 1 MPI process
was instantiated and multiple threads were instantiated on the 16 cores. To
test the distributed memory parallelization, Figures 3(a) and 3(b) present the
speedup and parallel efficiency for a grid comprising approximately 5.5 billion
degrees of freedom which is distributed over increasingly more compute nodes.
As can be observed the solver scales well out to 2048 compute nodes (which
is in fact two Blue Gene/Q ‘racks’). To test the shared memory parallelization
64
1.10
Observed
Ideal
1.00
Parallel Efficiency
32
16
Speedup
Observed
Ideal
8
4
0.90
0.80
0.70
0.60
2
1
1
0.50
2
4
8
16
32
Number of Threads per Node
(a)
64
0.40
1
2
4
8
16
32
Number of Threads per Node
64
(b)
Fig. 4. Strong scaling results for the forward solver presenting (a) speedup and (b)
parallel efficiency for an intranode run, testing the shared memory parallelization on
Blue Gene/Q.
Figures 4(a)-4(b) present the speedup and parallel efficiency within a single
compute node for a grid comprising approximately 6.4 million degrees of freedom.
As can be observed the solver scales reasonably well to 16 threads per node.
With 32 and 64 threads per node however, the hardware threading capability is
being utilized, which shows an improvement but decreasing parallel efficiency,
as expected. To test the weak scalability Figure 5 presents the run time for 500
time steps with grid portions comprising 6.4 million degrees of freedom per MPI
process, implying that with 1024 compute nodes the total grid size comprises
approximately 6.5 billion degrees of freedom. As can be observed the solver
performs reasonably well requiring approximately 13% greater run time for a
grid approximately one thousand times larger.
Figures 6(a) and 6(b) present the speedup and parallel efficiency for the grid
comprising approximately 5.5 billion degrees of freedom on Stampede. As can
be observed the solver scales reasonably well out to 512 compute nodes. The
speedup on Stampede is less than that on Blue Gene/Q on for large node runs,
the reason for which could be that the topology of nodes allocated on Stampede
are in general irregular, whereas as on Blue Gene/Q the nodes allocated for a job
have a structured topology resulting in good MPI communication performance.
As previously mentioned, Stampede compute nodes are equipped with Intel Xeon Phi (MIC) accelerators. In general, there are two ways in which MIC
160
Observed
Average
155
Run Time [s]
150
145
140
135
130
1
2
4
8 16 32 64 128 256 5121024
Number of Nodes
Fig. 5. Weak scaling results for the forward solver presenting the run time for 500 time
steps on Blue Gene/Q.
16
1.1
Observed
Ideal
Observed
Ideal
1.0
Parallel Efficiency
Speedup
8
4
0.9
0.8
2
0.7
1
32
64
128
256
Number of Nodes
(a)
512
0.6
32
64
128
256
Number of Nodes
512
(b)
Fig. 6. Strong scaling results for the forward solver presenting (a) speedup and (b)
parallel efficiency for an internode run, testing the distributed memory parallelization
on Stampede.
accelerator could be used. Once a portion of the grid is allocated to a node,
the elements in the domain could be distributed between host and MIC, for
instance, host operating on interface elements which involves MPI communication, and the interior elements processing could be offloaded to MIC. However,
this approach involves the high overhead of data transfer across host and MIC
each time step to update the displacement gradients and global force array. The
other approach is to assign a portion of the grid to the MIC, and an MPI process running on each MIC. This could be accomplished using the symmetric MPI
methodology between host and MIC. It becomes crucial in this approach to balance the compute load between CPU and MIC. Since the core computation part
in the code involves irregular memory accesses, and hence poor last level cache
locality, the memory access problem becomes even severe on MIC where a lot of
threads compete for memory interface controller. According to our experiments,
we identified that on average, the code runs on MIC (using 120 threads) 3 times
slower than the host (using 16 cores). So, the approach used to balance the load
is to have 3 MPI tasks run on the CPU each using 5 threads and 1 MPI task runs
on the MIC. Figure 7(a) presents strong scaling results for both cases for a grid
comprising approximately 115 million degrees of freedom, using only the host
to run 3 MPI processes with 5 OpenMP threads per process, and additionally
running 1 MPI task with 120 threads on the MIC. As the results show, using
the MIC, results in around 34% improvement in performance.
9
240
CPU
CPU+MIC
120
8
Observed
Ideal
Speedup
log2 Run Time
60
7
6
30
15
8
5
4
4
3
1
2
2
4
8
Number of Nodes
(a)
16
32
1
1
2
4
8 15 30 60 120 240
Number of Threads per Node
(b)
Fig. 7. Strong scaling results for the forward solver comparing CPU and symmetric
CPU+MIC, presenting (a) runtime for an internode run testing the distributed memory
parallelization and (b) speedup for a MIC intranode run testing the shared memory
parallelization on on Stampede.
To test the shared memory parallelization Figures 7(b) presents the speedup
within a MIC for a grid comprising approximately 32 million degrees of freedom.
As can be observed the solver scales reasonably well to 60 threads per MIC. With
120 and 180 threads per MIC, the hardware threading capability (of which the
MIC has support up to 4 hardware threads per core) are utilized, which show
an improvement, but decreasing parallel efficiency. Since the code uses 512-bit
vector instrinsics, and since the vector unit is shared across the threads in a core,
there is no improvement when 240 threads are used.
4
Conclusions
A communication-avoiding, matrix-free formulation and parallelization strategy
for a 3D elastic spectral element forward and adjoint solvers has been introduced.
The formulation presents a viable strategy to overcome one of the key computational bottlenecks associated with full waveform simulation and inversion. The
proposed algorithmic framework requires minimal storage and communication
and shows remarkable scalability on large computational domains. Furture work
will include the addition of the variational approach (using ∂mj J to update the
earth model and thereby completing the FWI algorithm) and integrating the
solver with the cloud-based delivery model.
References
1. Fichtner, A.: Full Seismic Waveform Modelling and Inversion. Advances in Geophysical and Environmental Mechanics and Mathematics. Springer (2010)
2. Haber, E., Van Den Doel, K., Horesh, L.: Optimal design of simultaneous source
encoding. Inverse Problems in Science and Engineering (ahead-of-print) (2014) 1–18
3. Horesh, L., Schweiger, M., Arridge, S., Holder, D.: Large-scale non-linear 3d reconstruction algorithms for electrical impedance tomography of the human head.
In: World Congress on Medical Physics and Biomedical Engineering 2006, Springer
(2007) 3862–3865
4. Chavent, G.: Analyse fonctionnelle et identification de coefficients r´epartis dans les
´equations aux d´eriv´ees partielles. Iria (1971)
5. Fichtner, A., Bunge, H.P., Igel, H.: The adjoint method in seismology - ii. applications: traveltimes and sensitivity functionals. Physics of the Earth and Planetary
Interiors 157 (2006) 105–123
6. Parter, S.V.: On the legendre-gauss-lobatto points and weights. J. Sci. Comput.
14(4) (December 1999) 347–355
7. Kane, C., Marsden, J.E., Ortiz, M., West, M.: Variational integrators and the
newmark algorithm for conservative and dissipative mechanical systems. Internat.
J. Numer. Methods Engrg 49 (1999) 1295–1325