What is Molecular Dynamics? Biophysics Application Project Emiliano Ippoliti

Parallel Algorithms for Short-Range Molecular Dynamics
What is Molecular Dynamics?
Biophysics Application Project
Emiliano Ippoliti
September 14, 2011
CONTENTS
CONTENTS
Contents
1
2
3
4
5
Introduction
1.1 The role of computer experiments
1.2 What is molecular dynamics? . . .
1.3 Some historical notes . . . . . . .
1.4 Limitations . . . . . . . . . . . .
1.4.1 Use of classical forces . .
1.4.2 Realism of forces . . . . .
1.4.3 Time and size limitations .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The basic machinery
2.1 Modeling the physical system . . . . . . . . . . . . .
2.1.1 Lennard-Jones potential . . . . . . . . . . . .
2.1.2 Potential truncation and long-range corrections
2.2 Periodic boundaries conditions . . . . . . . . . . . . .
2.2.1 The minimum image criterion . . . . . . . . .
2.3 Time integration algorithm . . . . . . . . . . . . . . .
2.4 The Verlet algorithm . . . . . . . . . . . . . . . . . .
2.4.1 Velocity Verlet algorithm . . . . . . . . . . . .
2.4.2 Leap-frog algorithm . . . . . . . . . . . . . .
2.4.3 Error terms . . . . . . . . . . . . . . . . . . .
Simple statistical quantities
3.1 Potential energy . . . . . .
3.2 Kinetic energy . . . . . . .
3.3 Total energy . . . . . . . .
3.4 Temperature . . . . . . . .
3.5 Radial distribution function
3.6 MSD and VACF . . . . . .
3.7 Pressure . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
5
7
7
7
8
.
.
.
.
.
.
.
.
.
.
9
9
9
11
12
12
13
14
14
15
15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
18
19
19
20
20
Computational aspects of MD
4.1 MD algorithms . . . . . . . . . . .
4.2 Long- vs short-range force model . .
4.3 Neighbor lists and link-cell method .
4.4 Parallel algorithms . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
23
24
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Computational aspects of our model
27
5.1 Reduced units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2
CONTENTS
6
Atom-decomposition algorithm
6.1 Expand and fold operations
6.2 The simpler algorithm . . .
6.3 Using Newton’s 3rd law .
6.4 Load-balance . . . . . . .
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
33
34
35
7
Force-decomposition algorithm
36
7.1 The simpler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Using Newton’s 3rd law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Load-balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8
Spatial-decomposition algorithm
8.1 The communication scheme . .
8.2 The simpler algorithm . . . . . .
8.2.1 Communication scaling
8.2.2 Computational scaling .
8.2.3 Data structures’ update .
8.3 Using Newton’s 3rd law . . . .
8.4 Load-balance . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
40
42
43
44
44
45
45
47
3
1
1
1.1
INTRODUCTION
Introduction
The role of computer experiments
Computer experiments play a very important role in science today. In the past physical sciences
were characterized by an interplay between experiment and theory. In experiment, a system is
subjected to measurements, and results, expressed in numeric form, are obtained. In theory,
a model of the system is constructed, usually in the form of a set of mathematical equations.
The model is then validated by its ability to describe the system behavior in a few selected
cases, simple enough to allow a solution to be computed from the equations. In many cases,
this implies a considerable amount of simplification in order to eliminate all the complexities
invariably associated with real world problems, and make the problem solvable.
In the past, theoretical models could be easily tested only in a few simple special circumstances. So, for instance, in condensed matter physics a model for intermolecular forces in a
specific material could be verified in a diatomic molecule, or in a perfect, infinite crystal. Even
then, approximations were often required to carry out the calculation. Unfortunately, many physical problems of extreme interest (both academic and practical) fall outside the realm of these
”special circumstances.” Among them, one could mention the physics and chemistry of organic
molecules, involving a large amount of degrees of freedom.
The advent of high speed computers — which started to be used in the 50s — altered the
picture by inserting a new element right in between experiment and theory: the computer experiment. In a computer experiment, a model is still provided by theorists, but the calculations
are carried out by the machine by following a ”recipe” (the algorithm, implemented in a suitable
programming language). In this way, complexity can be introduced (still with caution!) and
more realistic systems can be investigated, opening a road towards a better understanding of real
experiments.
Needless to say, the development of computer experiments altered substantially the traditional relationship between theory and experiment. On one side, computer simulations increased
the demand for accuracy of the models. For instance a molecular dynamics simulation allows us
to evaluate the melting temperature of a material, modeled by means of a certain interaction law.
This is a difficult test for the theoretical model to pass—and a test that has not been available in
the past. Therefore, simulation ”brings to life” the models, disclosing critical areas and providing
suggestions to improve them.
On the other side, simulation can often come very close to experimental conditions, to the extent that computer results can sometimes be compared directly with experimental results. When
this happens, simulation becomes an extremely powerful tool not only to understand and interpret the experiments at the microscopic level, but also to study regions which are not accessible
experimentally, or which would imply very expensive experiments, such as under extremely high
pressure.
Last but not least, computer simulations allow thought experiments — things which are just
impossible to do in reality, but whose outcome greatly increases our understanding of phenomena
— to be realized. Fantasy and creativity are important qualities for the computer simulator!
4
1.2
1.2
What is molecular dynamics?
1
INTRODUCTION
What is molecular dynamics?
We call molecular dynamics (MD) a computer simulation technique where the time evolution
of a set of interacting atoms is followed by integrating their equations of motion. In molecular
dynamics we follow the laws of classical mechanics, and most notably the Newton’s 2nd law:
Fi = mi ai
(1)
for each atom i in a system constituted by N atoms. Here, mi is the atom mass, ai = d2 ri /dt2 its
acceleration, and Fi the force acting upon it, due to the interactions with other atoms. Therefore,
molecular dynamics is a deterministic technique: given an initial set of positions and velocities,
the subsequent time evolution is in principle1 completely determined. In more pictorial terms,
atoms will ”move” into the computer bumping into each other wandering around (if the system
is fluid), oscillating in waves in concert with their neighbors, and so on, in a way pretty similar
to what atoms in a real substance would do.
The computer calculates a trajectory in a 6-N dimensional phase space (3N positions and
3N momenta). However, such trajectory is usually not particularly relevant by itself. Molecular
dynamics is a statistical mechanics method: it is a way to obtain a set of configurations distributed
according to some statistical distribution function or statistical ensemble. According to statistical
physics, physical quantities are represented by averages over configurations distributed according
to a certain statistical ensemble. A trajectory obtained by molecular dynamics provides such a
set of configurations. Therefore, a measurement of a physical quantity by simulation is simply
obtained as an arithmetic average of the various instantaneous values assumed by that quantity
during the MD run.
Statistical physics is the link between the microscopic behavior and thermodynamics. In the
limit of very long simulation times, one could expect the phase space to be fully sampled and
in that limit this averaging process would yield the thermodynamic properties. In practice, the
runs are always of finite length and one should exert caution to estimate when the sampling may
be good (”system at equilibrium”) or not. In this way, MD simulations can be used to measure
thermodynamic properties.
1.3
Some historical notes
Molecular dynamics is concerned with simulating the motion of molecules to gain a deeper
understanding of several physical phenomena that derive from molecular interactions. These
studies include not only the motion of many molecules as in a fluid, but also the motion of a
single large molecule consisting of hundreds or thousands of atoms, as in a protein.
Computers are a critically important tool for these studies because there simply is no other
way to trace the motion of a large number of interacting particles. The earliest of these computations were done in the 1950s by Berni Alder and Tom Wainwright at Lawrence Livermore
National Laboratory [1]. They studied the distribution of molecules in a liquid, using a model in
1
In practice, the finiteness of the integration time step and arithmetic rounding errors will eventually cause the
computed trajectory to deviate from the true trajectory.
5
1.3
Some historical notes
1
INTRODUCTION
which the molecules are represented as ”hard spheres” that interact like billiard balls. Using the
fastest computer at that time, an IBM 704, they were able to simulate the motions of 32 and 108
molecules in computations requiring 10 to 30 hours. Now it is possible to perform hard sphere
computations on systems of over a billion particles.
Another crucial paper appeared in 1967 [2]. Loup Verlet introduced the concept of neighbor
list and the Verlet time integrator algorithm, which we will meet in the next lessons, to calculate the phase diagram of argon and to compute correlation functions to test theories of the
liquid state. Verlet studied a molecular model more realistic than the hard sphere one, known as
Lennard-Jones model (the same model we will study in this course).
The Lennard-Jones model was employed also by Lawrence Hannon, George Lie, and Enrico
Climenti in 1986 at IBM Kingston to study the flow of fluids [3]. In these computations the fluid
was represented by ∼ 104 interacting molecules. Even though this is miniscule compared with
the number of molecules in a gram of water, the behavior of the flow was like that in a real fluid.
Another class of molecular dynamics computations is concerned with the internal motion of
molecules especially proteins and nucleic acids, vital components of biological systems. The
goal in this case is to gain a better understanding of the function of these molecules in biochemical reactions. Interestingly, it has been found that quantum effects do not have much influence
on the overall dynamics of proteins except at low temperatures. Thus, classical mechanics is sufficient to model the motions, but still the computational power required for following the motion
of a large molecule is enormous. For example, simulating the motion of a 1,500 atom molecule,
a small protein, for a time interval of 10−7 seconds is a 6-hour computation on a modern Intel
Xeon quadcore processor.
Martin Karplus and Andrew McCammon, in an interesting article ”The Molecular Dynamics
of Proteins” [4] describe a discovery concerning the molecule myoglobin that could only have
been made through molecular dynamics. The interest in myoglobin comes about because it stores
oxygen in biological systems. Whales, for example, are able to stay under water for long periods
of time because of a large amount of myoglobin in their bodies. It was known that the oxygen
molecule binds to a particular site in the myoglobin molecule but it was not understood how the
binding could take place. X ray crystallography work shows that large protein molecules tend to
be folded up into compact three dimensional structures, and in the case of myoglobin the oxygen
sites were known to lie in the interior of such a structure. There did not seem to be any way
that an oxygen atom could penetrate the structure to reach the binding site. Molecular dynamics
provided the answer to this puzzle. The picture of a molecule provided by X ray crystallography
shows the average positions of the atoms in the molecule. In fact these atoms are in continuous
motion, vibrating about positions of equilibrium. Molecular dynamics simulation of the internal
motion of the molecule showed that a path to the site, wide enough for an oxygen atom, could
open up for a short period of time.
Scientific visualization is particularly important for understanding the results of a molecular
dynamics simulation. The millions of numbers representing a history of the positions and velocities of the particles is not a very revealing picture of the motion. How can one recognize
the formation of a vortex in this mass of data or the nature of the bending and stretching of a
large molecule? How can one gain new insights? Pictures and animations enable the scientist to
literally see the formation of vortices, to view protein bending, and thus to gain new insights into
6
1.4
Limitations
1
INTRODUCTION
the details of these phenomena.
Computations that involve following the motion of a large number of interacting particles,
whether the particles are atoms in a molecule, or molecules of a fluid or solid or even particles in
the discrete model of a vibrating string or membrane are similar in the following respect. They
involve a long series of time steps, at each of which Newton’s laws are used to determine the new
positions and velocities from the old positions and velocities and the forces. The computations
are quite simple but there are many of them. To achieve accuracy, the time steps must be quite
small, and therefore many are required to simulate a significantly long real time interval. In
computational chemistry the time step is typically about a femtosecond (10−15 seconds), and the
total computation may represent ∼ 10−8 seconds which could cost about 100 hours of computer
time. The amount of computation at each step can be quite extensive. In a system consisting
of N particles the force computations at each step may involve O(N 2 ) operations. Thus it is
easy to see that these computations can consume a large number of machine cycles. In addition,
animation of the motion of the system can make large demands on memory.
1.4
Limitations
Molecular dynamics is a very powerful technique but has — of course — limitations. We quickly
examine the most important of them.
1.4.1
Use of classical forces
One could immediately ask: how can we use Newton’s law to move atoms, when everybody
knows that systems at the atomistic level obey quantum laws rather than classical laws, and that
Schr¨odinger’s equation is the one to be followed?
A simple test of the validity of the classical approximation is based on the de Broglie thermal
wavelength [5] defined as
s
Λ=
2π~2
M kB T
(2)
where M is the atomic mass and T the temperature. The classical approximation is justified if
Λ a, where a is the mean nearest neighbor separation. If one considers for instance liquids at
the triple point, Λ/a is of the order of 0.1 for light elements such as Li and Ar, decreasing further
for heavier elements. The classical approximation is poor for very light systems such as H2 , He,
Ne.
Moreover, quantum effects become important in any system when T is sufficiently low. The
drop in the specific heat of crystals below the Debye temperature [6] or the anomalous behavior
of the thermal expansion coefficient, are well known examples of measurable quantum effects in
solids.
Molecular dynamics results should be interpreted with caution in these regions.
1.4.2
Realism of forces
How realistic is a molecular dynamics simulation?
7
1.4
Limitations
1
INTRODUCTION
In molecular dynamics, atoms interact with each other. These interactions originate forces
that act upon atoms, and atoms move under the action of these instantaneous forces. As the atoms
move, their relative positions change and forces change as well.
The essential ingredient containing the physics is therefore constituted by the forces. A
simulation is realistic, i.e. it mimics the behavior of the real system, only to the extent that
interatomic forces are similar to those that real atoms (or, more exactly, nuclei) would experience
when arranged in the same configuration.
As we will see in the next section, forces are usually obtained as the gradient of a potential energy function depending on the positions of the particles. The realism of the simulation
therefore depends on the ability of the potential chosen to reproduce the behavior of the material
under the conditions at which the simulation is run.
The problem of selecting, or constructing, potentials (the so-called force-field problem) is
still under study by the scientific community and it is beyond the scope of these short notes.
1.4.3
Time and size limitations
Typical MD simulations can nowadays be performed on systems containing hundred thousands,
or, perhaps, millions of atoms, and for simulation times ranging from a few nanoseconds to more
than one microsecond. While these numbers are certainly respectable, it may happen to run into
conditions where time and/or size limitations become important.
A simulation is ”safe” from the point of view of its duration when the simulation time is much
longer than the relaxation time of the quantities we are interested in. However, different properties have different relaxation times. In particular, systems tend to become slow and sluggish
in the proximity of phase transitions, and it is not uncommon to find cases where the relaxation
time of a physical property is orders of magnitude larger than times achievable by simulation.
A limited system size can also constitute a problem. In this case one has to compare the size
of the MD cell with the correlation lengths of the spatial correlation functions of interest. Again,
correlation lengths may increase or even diverge in proximity of phase transitions, and the results
are no longer reliable when they become comparable with the box length.
This problem can be partially alleviated by a method known as finite size scaling. This consist
of computing a physical property A using several box with different sizes L, and then fitting the
results on a relation
c
A(L) = A0 + n
L
using A0 , c, n as fitting parameters. A0 then corresponds to limL→∞ A(L), and should therefore
be taken as the most reliable estimate for the ”true” physical quantity.
8
2
2
2.1
THE BASIC MACHINERY
The basic machinery
Modeling the physical system
The main ingredient of a simulation is a model for the physical system. For a molecular dynamics
simulation this amounts to choosing the potential, a function U (r1 , . . . , rN ) of the positions of
the nuclei, representing the potential energy of the system when the atoms are arranged in that
specific configuration. This function is translationally and rotationally invariant, and is usually
constructed from the relative positions of the atoms with respect to each other, rather than from
the absolute positions.
Forces are then derived as the gradients of the potential with respect to atomic displacements:
Fi = ∇ri U (r1 , . . . , rN )
(3)
This form implies the presence of a conservation law of the total energy E = K + U where K is
the instantaneous kinetic energy:
X1
mi vi2
(4)
K=
2
i
where vi is the velocity of the particle i.
The simplest choice for U is to write it as a sum of pairwise interactions:
XX
U (r1 , . . . , rN ) =
φ(|ri − rj |)
i
(5)
j>i
The clause j > i in the second summation has the purpose of considering each atom pair only
once. In the past most potentials were constituted by pairwise interactions, but this is no longer
the case. It has been recognized that the two-body approximation is very poor for many relevant
systems, such as metals and semiconductors. Various kinds of many-body potentials are now of
common use in condensed matter simulation. The development of accurate potentials represents
an important research line.
In this course, however, we will take into account only the most commonly used pairwise
interaction model: the Lennard-Jones pair potential, which has been proved to correctly describe
monoatomic liquid systems like for example liquid Argon, as we will see in the next section.
2.1.1
Lennard-Jones potential
The Lennard-Jones 12-6 potential (LJ) is given by the expression
σ 12 σ 6
−
φLJ (r) = 4
r
r
(6)
for the interaction potential between a pair of atoms. The total potential of a system containing
many atoms is then given by (10). This potential has an attractive (i.e. a negative value since
we put the arbitrary zero of the energy at the value of the potential when the to particles are at
9
2.1
Modeling the physical system
2
THE BASIC MACHINERY
Figure 1: The Lennard-Jones 12-6 potential given by (6).
an infinite distance apart) tail at large r, it reaches a minimum around 1.122σ, and it is strongly
repulsive at shorter distance, passing through 0 at r = σ and increasing steeply as r is decreased
further (Fig. 1).
The term ∼ 1/r12 dominating at short distance, models the repulsion between atoms when
they are brought very close to each other. Its physical origin is related to the Pauli principle:
when the electronic clouds surrounding the atoms starts to overlap, the energy of the system
increases abruptly. The exponent 12 was chosen exclusively on a practical basis: equation (6)
is particularly easy to compute. In fact, on physical grounds an exponential behavior would be
more appropriate.
The term ∼ 1/r6 dominating at large distance, constitute the attractive part. This is the term
that gives cohesion to the system. A 1/r6 attraction is originated by van der Waals dispersion
forces, originated by dipole-dipole interactions in turn due to fluctuating dipoles. These are rather
weak interactions, which however dominate the bonding character of closed-shell systems, that
is, rare gases such as Ar or Kr. Therefore, these are the materials that a LJ potential could mimic
fairly well2 . The parameters and σ are chosen to fit the physical properties of the material.
On the other hand, a LJ potential is not at all adequate to model situations with open shells,
where strong localized bonds may form (as in covalent systems), or where there is a delocalized
2
Many-body effects in the interaction are however present also in these systems and, in all cases, pair potentials
more accurate than LJ have been developed for rare gases.
10
2.1
Modeling the physical system
2
THE BASIC MACHINERY
”electron sea” where the ions sit (as in metals). In these systems the two-body interactions
scheme itself fails very badly. We will not deal with these cases further.
However, regardless of how well it is able to model actual physical systems, the LJ 126 potential constitutes nowadays an extremely important model system. There is a vast body
of papers that investigated the behavior of atoms interacting via LJ on a variety of different
geometries (solids, liquids, surfaces, clusters, two-dimensional systems, etc). One could say that
LJ is the standard potential to use for all the investigations where the focus is on fundamental
issues, rather than studying the properties of a specific material. The simulation work done on LJ
systems helped us (and still does) to understand basic points in many areas of condensed matter
physics and consequently of the biophysics, and for this reason the importance of LJ cannot be
underestimated.
When using the LJ potentials in simulation, it is customary to work in a system of units where
σ = 1 and = 1. All the codes in this course follow this convention.
2.1.2
Potential truncation and long-range corrections
The potential in eq. (6) has an infinite range. In practical applications, it is customary to establish
a cutoff radius Rc and disregard the interactions between atoms separated by more than Rc . This
results in simpler programs and enormous savings of computer resources, because the number of
atomic pairs separated by a distance r grows as r2 and becomes quickly huge.
A simple truncation of the potential creates a new problem though: whenever a particle pair
”crosses” the cutoff distance, the energy makes a little jump. A large number of these events is
likely to spoil energy conservation in a simulation. To avoid this problem, the potential is often
shifted in order to vanish at the cutoff radius3 :
(
φLJ (r) − φLJ (Rc ) if r ≤ Rc
(7)
φ(r) =
0
if r > Rc
Physical quantities are of course affected by the potential truncation. The effects of truncating a full-ranged potential can be approximately estimated by treating the system as a uniform
(constant density) continuum beyond Rc . For a bulk system (periodic boundary conditions along
each direction, see next section) this usually amounts to a constant additive correction. For example, the potential tail (attractive) brings a small additional contribution to the cohesive energy,
and to the total pressure.
Commonly used truncation radii for the LJ potential are 2.5σ and 3.2σ. It should be mentioned that these truncated Lennard-Jones models are so popular that they acquired a value on
their own as reference models for generic two-body systems (as also discussed at the end of the
previous section). In many cases, there is no interest in evaluating truncation corrections because
the truncated model itself is the subject of the investigation. This is not our case, and we will use
long-range correction terms in our code. Moreover, we choose to adopt the smaller commonly
used truncation radius (2.5σ).
3
Shifting eliminates the energy discontinuity, but not the force discontinuity. Some researchers altered the form
of the potential near Rc to obtain a smoother junction, but there is no standard way to do that.
11
2.2
2.2
Periodic boundaries conditions
2
THE BASIC MACHINERY
Periodic boundaries conditions
What should we do at the boundaries of our simulated system? One possibility is doing nothing
special: the system simply terminates, and atoms near the boundary would have less neighbors
than atoms inside. In other words, the sample would be surrounded by surfaces.
Unless we really want to simulate a cluster of atoms, this situation is not realistic. No matter
how large the simulated system is, its number of atoms N would be negligible compared with
the number of atoms contained in a macroscopic piece of matter (of the order of 1023 ) and the
ratio between the number of surface atoms and the total number of atoms would be much larger
than in reality, causing surface effects to be much more important than what they should.
A solution to this problem is to use periodic boundary conditions (PBC). When using PBC,
particles are enclosed in a box, and we can imagine that this box is replicated to infinity by rigid
translation in all the three Cartesian directions, completely filling the space. In other words, if one
of our particles is located at position r in the box, we assume that this particle really represents
an infinite set of particles located at
r + la + mb + nc,
(l, m, n = −∞, ∞),
where l, m, n are integer numbers, and we have indicated with a, b, c the vectors corresponding
to the edges of the box. All these ”image” particles move together, and in fact only one of them
is represented in the computer program.
The key point is that now each particle i in the box should be thought as interacting not only
with other particles j in the box, but also with their images in nearby boxes. That is, interactions
can ”go through” box boundaries. In fact, one can easily see that (a) we have virtually eliminated
surface effects from our system, and (b) the position of the box boundaries has no effect (i.e. a
translation of the box with respect to the particles leaves the forces unchanged).
Apparently, the number of interacting pairs increases enormously as an effect of PBC. In
practice, this is not true because potentials, like LJ one, usually have a short interaction range.
The minimum image criterion discussed next simplifies things further, reducing to a minimum
the level of additional complexity introduced in a program by the use of PBC.
Finally, it could be worth to point out that even if PBC allow us to describe an infinite system
of particles that approach more and more a real system, the PBC system is still not an exact representation of reality: in fact, PBC introduce an artificial symmetry, the translational invariance,
not present in the real system.
2.2.1
The minimum image criterion
Let us suppose that we are using a potential with a finite range, like LJ one: when separated by
a distance equal or larger than a cutoff distance Rc two particles do not interact with each other.
Let us also suppose that we are using a box whose size is larger than 2Rc along each Cartesian
direction.
When these conditions are satisfied, it is obvious that at most one among all the pairs formed
by a particle i in the box and the set of all the periodic images of another particle j will interact.
12
2.3
Time integration algorithm
2
THE BASIC MACHINERY
To demonstrate this, let us suppose that i interacts with two images j1 and j2 of j. Two
images must be separated by the translation vector bringing one box into another, and whose
length is at least 2Rc by hypothesis. In order to interact with both j1 and j2 , i should be within a
distance Rc from each of them, which is impossible since they are separated by more than 2Rc .
When we are in these conditions, we can safely use is the minimum image criterion: among
all possible images of a particle j, select the closest, and throw away all the others. In fact, only
the closest is a candidate to interact: all the others certainly do not.
This operating conditions greatly simplify the set up of a MD program, and are commonly
used. Of course, one must always make sure that the box size is at least 2Rc along all the
directions where PBCs are in effect.
2.3
Time integration algorithm
The engine of a molecular dynamics program is its time integration algorithm, required to integrate the equation of motion of the interacting particles and follow their trajectory. Time integration algorithms are based on finite difference methods, where time is discretized on a finite
grid, the time step ∆t being the distance between consecutive points on the grid. Knowing the
positions and some of their time derivatives at time t (the exact details depend on the type of
algorithm), the integration scheme gives the same quantities at a later time t + ∆t. By iterating
the procedure, the time evolution of the system can be followed for long times.
Of course, these schemes are approximate and there are errors associated with them. In
particular, one can distinguish between
• Truncation errors, related to the accuracy of the finite difference method with respect to the
true solution. Finite difference methods are usually based on a Taylor expansion truncated
at some term, hence the name. These errors do not depend on the implementation: they
are intrinsic to the algorithm.
• Round-off errors, related to errors associated to a particular implementation of the algorithm. For instance, to the finite number of digits used in computer arithmetics.
Both errors can be reduced by decreasing ∆t. For large ∆t truncation errors dominate, but
they decrease quickly as ∆t is decreased. For instance, the Verlet algorithm discussed in the
next section has a truncation error proportional to ∆t4 for each integration time step. Round-off
errors decrease more slowly with decreasing ∆t, and dominate in the small ∆t limit. Using 64bit precision (corresponding to ”double precision” on the majority of today’s workstations) helps
to keep round-off errors at a minimum.
The popular Verlet algorithm integration method, which will be adopted in the codes of this
course, is presented in the section below. For more detailed information on time integration
algorithms, the interested reader is referred to refs. [7] for a general survey, and to ref. [8] for a
deeper analysis.
13
2.4
2.4
The Verlet algorithm
2
THE BASIC MACHINERY
The Verlet algorithm
In molecular dynamics, the most commonly used time integration algorithm is probably the
so-called Verlet algorithm [2]. The basic idea is to write two third-order Taylor expansions
for the positions r(t), one forward and one backward in time. Calling v the velocities, a the
accelerations, and b the third derivatives of r with respect to t, one has:
r(t + ∆t) = r(t) + v(t)∆t + (1/2)a(t)∆t2 + (1/6)b(t)∆t3 + O(∆t4 )
r(t − ∆t) = r(t) − v(t)∆t + (1/2)a(t)∆t2 − (1/6)b(t)∆t3 + O(∆t4 )
(8)
(9)
Adding the two expressions gives
r(t + ∆t) = 2r(t) − r(t − ∆t) + a(t)∆t2 + O(∆t4 )
(10)
This is the basic form of the Verlet algorithm. Since we are integrating Newton’s equations, a(t)
is just the force divided by the mass, and the force is in turn a function of the positions r(t):
a(t) = −(1/m)∇U (r(t))
(11)
As one can immediately see, the truncation error of the algorithm when evolving the system by
∆t is of the order of ∆t4 , even if third derivatives do not appear explicitly. This algorithm is at
the same time simple to implement, accurate and stable, explaining its large popularity among
molecular dynamics simulators.
2.4.1
Velocity Verlet algorithm
A problem with this version of the Verlet algorithm is that velocities are not directly generated.
While they are not needed for the time evolution, their knowledge is sometimes necessary. Moreover, they are required to compute the kinetic energy K, whose evaluation is necessary to test
the conservation of the total energy E = K + U 4 . This is one of the most important tests to
verify that a MD simulation is proceeding correctly. One could compute the velocities from the
positions by using
r(t + ∆t) − r(t − ∆t)
(12)
v(t) =
2∆t
However, the error associated to this expression is of order ∆t2 rather than ∆t4 . In fact, from (8)
and (9) we have:
v(t) =
r(t + ∆t) − r(t − ∆t)
+ (1/6)b(t)∆t2 + O(∆t3 )
2∆t
(13)
To overcome this difficulty, some variants of the Verlet algorithm have been developed. They
give rise to exactly the same trajectory, and differ in what variables are stored in memory and
at what times. For example, an even better implementation of the same basic algorithm is the
4
One of the most important features that makes the Verlet algorithm so popular is the fact that the conservation
of energy is exactly respected, even at large time steps.
14
2.4
The Verlet algorithm
2
THE BASIC MACHINERY
so-called velocity Verlet scheme, where positions, velocities and accelerations at time t + ∆t are
obtained from the same quantities at time t in the following way:
1
r(t + ∆t) = r(t) + v(t)∆t + a(t)∆t2
2
∆t
1
v(t +
) = v(t) + a(t)∆t
2
2
a(t + ∆t) = −(1/m)∇U (r(t + ∆t))
v(t + ∆t) = v(t +
∆t
1
) + a(t + ∆t)∆t
2
2
Note how we need 9N memory locations to save the N positions, velocities and accelerations,
but we never need to have simultaneously stored the values at two different times for any one of
these quantities.
2.4.2
Leap-frog algorithm
In the codes developed in this course another variant of the Verlet algorithm will be employed,
which is very similar to the previous one: the so-called leap-frog algorithm. Leap-frog integration is equivalent to calculating positions and velocities at interleaved time points, interleaved in
such a way that they ”leapfrog” over each other. For example, the position is known at integer
time steps and the velocity is known at integer plus half time steps. The equations for leap-frog
algorithm can be written
r(t + ∆t) = r(t) + v(t +
v(t +
∆t
)∆t
2
∆t
∆t
) = v(t −
) + a(t)∆t.
2
2
(14)
(15)
Note that at the start (t = 0) of the leap-frog iteration, at the first step t = ∆t/2, computing
v(t + ∆t
), one needs the velocitiy at t = −∆t/2. At first sight this could give problems, because
2
the initial conditions are known only at the initial time (t=0). Usually, the initial velocity is used
in place of the v(−∆t) term: the error on the first time step calculation then is of order O(∆t2 ).
This is not considered a problem because on a simulation of over a large amount of time steps,
the error on the first time step is only a negligible small amount of the total error, which at time
t = n∆t is of the order O(eLn∆t ∆t2 ).
2.4.3
Error terms
The local error in position of the Verlet integrator is O(∆t4 ) as described above, and the local
error in velocity is O(∆t2 ).
The global error in position, in contrast, is O(∆t2 ) and the global error in velocity is O(∆t2 ).
These can be derived by noting the following:
error(r(t0 + ∆t)) = O(∆t4 )
15
2.4
The Verlet algorithm
2
THE BASIC MACHINERY
and from (10)
r(t0 + 2∆t)) = 2r(t0 + ∆t)) − r(t0 ) + a(t0 + ∆t)∆t2 + O(∆t4 ).
Therefore
error(r(t0 + 2∆t)) = 2 error(r(t0 + ∆t)) + O(∆t4 ) = 3 O(∆t4 ).
In deriving the above equation, pay attention that the errors on r(t0 + ∆t)) and r(t0 ) have not to
be summed, as one does when he deals with ”random” errors, but you have to keep the minus
sign since the error of the first term depends on the error of the second one according a precise
function and consequently they goes always to ”the same direction”. Similarly at the previous
calculation, we have:
error(r(t0 + 3∆t)) = 6 O(∆t4 )
error(r(t0 + 4∆t)) = 10 O(∆t4 )
error(r(t0 + 5∆t)) = 15 O(∆t4 ),
which can be generalized by induction to
error(r(t0 + n∆t)) =
n(n + 1)
O(∆t4 ).
2
If we consider the global error in position between r(t) and r(t + T ), where T = n∆t, it is clear
that
2
T
T
error(r(t0 + T )) =
+
O(∆t4 ),
2
2∆t
2∆t
and therefore, the global (cumulative) error over a constant interval of time is given by
error(r(t0 + T )) = O(∆t2 ).
Because the velocity is determined in a non-cumulative way5 from the positions in the Verlet
integrator, the global error in velocity is also O(∆t2 ).
In molecular dynamics simulations, the global error is typically far more important than the
local error, and the Verlet integrator is therefore known as a second-order integrator.
5
In the Verlet algorithm the velocity plays no part in the integration of the equations of motions. However,
velocities often are necessary for the calculation of certain physical quantities like the kinetic energy. This deficiency
can either be dealt with using the Velocity Verlet (or Leap-frog) algorithm, or estimating the velocity using the
position terms by the equation (12), which has a local error O(∆t2 ). Since from this equation velocity depends only
on the positions and not on the velocities evaluated at previous times, the ”non-cumulative way” expression is used
for the determination of the velocity in the Verlet algorithm.
16
3
3
SIMPLE STATISTICAL QUANTITIES
Simple statistical quantities
Measuring quantities in molecular dynamics usually means performing time averages of physical
properties over the system trajectory. Physical properties are usually a function of the particle
coordinates and velocities. So, for instance, one can define the instantaneous value of a generic
physical property A at time t:
A(t) = f (r1 (t), . . . , rN (t), v1 (t), . . . , vN (t))
and then obtain its average
hAi =
NT
1 X
A(t)
NT t=1
where t is an index which runs over the time steps from 1 to the total number of steps NT .
There are two equivalent ways to do this in practice:
P
1. A(t) is calculated at each time step by the MD program while running. The sum t A(t)
is also updated at each step. At the end of the run the average is immediately obtained by
dividing by the number of steps. This is the preferred method when the quantity is simple
to compute and/or particularly important. An example is the system temperature.
2. Positions (and possibly velocities) are periodically dumped in a ”trajectory file” while the
program is running. A separate program, running after the simulation program, processes
the trajectory and computes the desired quantities. This approach can be very demanding
in terms of disk space: dumping positions and velocities using 64-bit precision takes 48
bytes per step and per particle. However, it is often used when the quantities to compute
are complex, or combine different times as in dynamical correlations, or when they are
dependent on other additional parameters that may be unknown when the MD program is
run. In these cases, the simulation is run once, but the resulting trajectory can be processed
over and over.
By our codes we will give examples of the both strategies.
3.1
Potential energy
The average potential energy U is obtained by averaging its instantaneous value, which is usually
obtained straightforwardly at the same time as the force computation is made. For instance, in
the case of two-body interactions
XX
U (t) =
φ(|ri (t) − rj (t)|)
(16)
i
j>i
Even if not strictly necessary to perform the time integration (forces are all is needed), knowledge of U is required to verify energy conservation. This is an important check to do in any MD
simulation.
17
3.2
Kinetic energy
3
SIMPLE STATISTICAL QUANTITIES
We mentioned in the previous chapter that when one deals with short-range interactions, the
potential and force are usually truncated at some cutoff distance Rc for computational expediency. Consequently the effective potential in our case will be:
(
φLJ (rij ) if rij ≤ Rc
φ(rij ) =
(17)
0
if rij > Rc
Such truncation implies that certain physical quantities like potential energy and pressure has
to be corrected in order to obtain a value closer to the value one would get if he takes into
account the exact potential. An approximate correction can be obtained by assuming the spatial
correlations beyond the cutoff distance are unity. The student is referred to Refs. [7] and [9] for
these so-called ”standard long-range corrections” (sLRC). It should be noted that this is not a
good assumption in inhomogeneous fluids. We will use the following expression for the longrange correction to be applied at each term of the outermost sum in (16):
" 3 #
Z ∞
9
1
σ
1
σ
1
2
drij rij
φLJ (rij ) = 8πρσ 3
−
(18)
φLRC = 4πρ
2
9 Rc
3 Rc
Rc
where ρ is the bulk number density (N/V ).
3.2
Kinetic energy
The instantaneous kinetic energy is of course given by
K(t) =
1X
mi [vi (t)]2
2 i
(19)
and is therefore extremely easy to compute. We will call K its average on the run.
3.3
Total energy
The total energy E = K + U is a conserved quantity in Newtonian dynamics. However, it is
common practice to compute it at each time step in order to check that it is indeed constant with
time. In other words, during the run energy flows back and forth between kinetic and potential,
causing K(t) and U (t) to fluctuate while their sum remains fixed.
In practice, there could be small fluctuations in the total energy, in a typical amount of, say,
one part in 104 or less. These fluctuations are usually caused by errors in the time integration (see
section 2.3), and can be reduced in magnitude by reducing the time step if considered excessive.
Ref. [8] contains an in-depth analysis of total energy fluctuations using various time integration
algorithms.
Slow drifts of the total energy are also sometimes observed in very long runs. Such drifts
could also be originated by an excessive ∆t. Drifts are more disturbing than fluctuations because
the thermodynamic state of the system is also changing together with the energy, and therefore
18
3.4
Temperature
3
SIMPLE STATISTICAL QUANTITIES
time averages over the run do not refer to a single state. If drifts over long runs tend to occur,
they can be prevented, for instance by breaking the long run into smaller pieces and restoring the
energy to the nominal value between one piece and the next. A common mechanism to adjust
the energy is to modify the kinetic energy via rescaling of velocities.
A final word of caution: while one may be tempted to achieve ”perfect” energy conservation
by reducing the time step as much as desired, working with an excessively small time step may
result in waste of computer time. A practical compromise would probably allow for small energy
fluctuations and perhaps slow energy drifts, as a price to pay to work with a reasonably large ∆t.
See also [7].
3.4
Temperature
The temperature T is directly related to the kinetic energy by the well-known equipartition formula, assigning an average kinetic energy kB T /2 per degree of freedom:
3
K = N kB T
2
(20)
An estimate of the temperature is therefore directly obtained from the average kinetic energy
K (see 3.2). For practical purposes, it is also common practice to define an ”instantaneous
temperature” T (t), proportional to the instantaneous kinetic energy K(t) by a relation analogous
to the one above.
3.5
Radial distribution function
One of the most important static properties that can be used to characterize liquids in general
and LJ liquids in particular is the radial distribution function (rdf) so defined:
g(r) =
X
1
hδ(r − |ri − rj |)i
ρ4πr2 dr ij
(21)
where δ(x) is the function: δ(0) = 1, and δ(x) = 0 for x 6= 0. The rdf is important for three
reasons:
1. for pairwise additive potentials as LJ one, knowledge of the rdf is sufficient information to
calculate thermodynamic properties, particularly the energy and pressure;
2. there are very well developed integral-equation theories that permit estimation of the rdf
for a given molecular model;
3. the rdf can be measured experimentally, using neutron-scattering techniques.
19
3.6
3.6
MSD and VACF
3
SIMPLE STATISTICAL QUANTITIES
Mean square displacement and velocity autocorrelation function
Two of the most important dynamic properties that are used to characterize liquids are meansquare displacement (MSD) and velocity autocorrelation function (VACF).
The MSD of atoms in a simulation can be easily computed by its definition
∆r2 (t) = h|r(t) − r(0)|2 i
(22)
where h. . . i denotes here averaging over all the atoms (or all the atoms in a given subclass). Care
must be taken to avoid considering the ”jumps” of particles to refold them into the box when
using periodic boundary conditions as contributing to diffusion.
The MSD contains information on the atomic diffusivity. If the system is solid, MSD saturates to a finite value, while if the system is liquid, MSD grows linearly with time. In this
case it is useful to characterize the system behavior in terms of the slope, which is the diffusion
coefficient D:
1
h|r(t) − r(0)|2 i
(23)
D = lim
t→∞ 2dt
where d is the dimensionality of the system (usually 2 or 3).
The VACF is defined as
C(t) = hv(t)v(0)i
(24)
and it is related to D by the equation:
1
D=
d
3.7
Z
∞
dthv(t) · v(0)i
(25)
0
Pressure
The measurement of the pressure in a molecular dynamics simulation is based on the Clausius
virial function
N
X
W (r1 , . . . , rN ) =
ri · FTi OT
(26)
i=1
FTi OT
is the total force acting on atom i. Its statistical average hW i will be obtained, as
where
usual, as an average over the molecular dynamics trajectory:
Z
N
1 t X
hW i = lim
dτ
ri (τ ) · mi ai (τ )
t→∞ t 0
i=1
where use has been made of Newton’s law (1). Integrating by parts,
Z
N
1 t X
hW i = − lim
dτ
mi |vi (τ )|2 .
t→∞ t 0
i=1
This is twice the average kinetic energy, therefore by the equipartition law of statistical mechanics
hW i = −dN kB T
20
(27)
3.7
Pressure
3
SIMPLE STATISTICAL QUANTITIES
where kB the Boltzmann constant.
Now, one may think of the total force acting on a particle as composed of two contributions:
FTi OT = Fi + FEXT
i
(28)
is the external
where Fi is the internal force (arising from the interatomic interactions), and FEXT
i
force exerted by the container’s walls. If the particles are enclosed in a parallelepipedic container
of sides Lx , Ly , Lz , volume V = Lx Ly Lz , and with the coordinates origin on one of its corners,
the part hW EXT i due to the container can be evaluated using the definition (26):
hW EXT i = Lx (−P Ly Lz ) + Ly (−P Lx Lz ) + Lz (−P Lx Ly ) = −dP V
where −P Ly Lz is, for instance, the external force FxEXT applied by the yz wall along the x
directions to particles located at x = Lx , etc. Eq. (27) can then be written
* N
+
X
ri · Fi − dP V = −dN kB T
i=1
or
1
P V = N kB T +
d
* N
X
+
ri · Fi
(29)
i=1
This important result is known as the virial equation. All the quantities except the pressure P
are easily accessible in a simulation, and therefore (29) constitutes a way to measure P . Note
how (29) reduces to the well-known equation of state of the perfect gas if the particles are noninteracting.
In the case of pairwise interactions via a potential φ(r), it is left as an exercise to the reader
to verify that (29) becomes
+
*
1 X X dφ .
(30)
rij P V = N kB T −
d
dr i
j>i
rij
This expression has the additional advantage over (29) to be naturally suited to be used when periodic boundary conditions are present: it is sufficient to take them into account in the definition
of rij .
As for the potential energy, also the expression of the pressure has to be corrected if a truncated potential is used in the numerical simulation. By using the same assumption mentioned
in the paragraph 3.1 one can derive the following correction to be applied at each term of the
outermost sum in the pressure formulas above corresponding to the contribution of each pair ij:
" 3 #
Z ∞
9
1
4
σ
2
σ
2
PLRC = 4πρ2
drij rij
rij · Fij = 8πρ2 σ 3
−
(31)
2
9 Rc
3 Rc
Rc
21
4
4
COMPUTATIONAL ASPECTS OF MD
Computational aspects of MD
4.1
MD algorithms
Classical MD is commonly used for simulating the properties of liquids solids and molecules.
Each of the N atoms or molecules in the simulation is treated as a point mass and Newton’s
equations are integrated to compute their motion. From the motion of the ensemble of atoms
a variety of useful microscopic and macroscopic information can be extracted such as transport
coefficients, phase diagrams and structural or conformational properties. The physics of the
model is contained in a potential energy functional for the system from which individual force
equations for each atom are derived.
MD simulations are typically not memory intensive since only vectors of atom information
are stored. Computationally, the simulations are ”large” in two domains
1. the number of atoms
2. the number of time steps
The length scale for atomic coordinates is Angstroms; in three dimensions many thousands or
millions of atoms must usually be simulated to approach even the submicron scale. In liquids and
solids the time step size is constrained by the demand that the vibrational motion of the atoms
be accurately tracked. This limits time steps to the femtosecond scale and so tens or hundreds of
thousands of time steps are necessary to simulate even picoseconds of ”real” time. Because of
these computational demands considerable effort has been expended by researchers to optimize
MD and even to build special-purpose hardware for performing MD simulations. The current
state-of-the-art is such that simulating ten- to hundred-thousand atom systems for nanoseconds
takes hours of CPU time on machines such as Jugene or Juropa supercomputer at JSC.
Since MD computations are inherently parallel, as we will see in the next pages, there has
been considerable effort in the last years by researchers to exploit this parallelism on various
kind of machines. The experience has proved that the message-passing model of programming6
for multiple-instruction/multiple-data (MIMD) parallel machines is the only one that provides
enough flexibility to implement all the data structure and computational enhancements that are
commonly exploited in MD codes on serial (and vector) machines.
Several more and more efficient algorithms were developed to performed specific kinds of
classical MD. In this course we will focus on some of such algorithms which are appropriate for
a general class of MD problems that has two salient characteristics:
1. The first characteristic is that forces are limited in range, meaning each atom interacts
only with other atoms that are geometrically nearby. Solids and liquids are often modeled
this way due to electronic screening effects or simply to avoid the computational cost of
including long-range Coulombic forces. For short-range MD, the computational effort per
time step scales as N , the number of atoms.
6
MPI is a library specification for message-passing model, become —de facto— the standard for messagepassing model of programming.
22
4.2
Long- vs short-range force model
4
COMPUTATIONAL ASPECTS OF MD
2. The second characteristic is that the atoms can undergo large displacements over the duration of the simulation. This could be due to diffusion in a solid or liquid, or conformational
changes in a biological molecule. The important feature from a computational standpoint
is that each atom’s neighbors change as the simulation progresses. While the algorithms
we discuss could also be used for fixed-neighbor simulations (e.g. all atoms remain on
lattice sites in a solid), it is a harder task to continually track the neighbors of each atom
and maintain efficient O(N ) scaling for the overall computation on a parallel machine.
Moreover, one wants the algorithms to work well on problems with small numbers of atoms,
not just for large problems where parallelism is often easier to exploit. This is because the vast
majority of MD simulations are performed on systems of a few hundred to several thousand
atoms where N is chosen to be as small as possible while still accurate enough to model the
desired physical effects. The computational goal in these calculations is to perform each time
step as quickly as possible. This is particularly true in non-equilibrium MD where macroscopic
changes in the system may take significant time to evolve, requiring millions of time steps to
model. Thus, it is often more useful to be able to perform a 1,000,000 time step simulation of a
1000 atom system fast rather than 1000 time steps of a 1,000,000 atom system, though the O(N )
scaling means the computational effort is the same for both cases. To this end, we consider model
sizes as small as a few hundred atoms in this course.
On the other hand, for very large MD problems, it is important that the parallel algorithms
are scalable to larger and faster parallel machines, i.e. they scale optimally with respect to N and
P (the number of processors).
4.2
Long- vs short-range force model
The computational task in classical MD simulation is to integrate the set of coupled differential
equations (Newton’s equations) that in general can be written:
X
dvi
=
F2 (ri , rj ) + F3 (ri , rj , rk ) + . . .
(32)
mi
dt
j
dri
= vi
dt
where mi is the mass of atom i, ri and vi are its position and velocity vectors, F2 is a force
function describing pairwise interactions between atoms, F3 describes three-body interactions,
and many-body interactions can be added. The force terms are derivatives of energy expressions
(the potentials) in which the potential energy of atom i is typically written as a function of the
positions of itself and other atoms. In practice, only one or a few terms in equation (32) are kept
and F2 , F3 , etc. are constructed so as to include many-body and quantum effects. To the extent
the approximations are accurate these equations give a full description of the time-evolution of
the system7 .
7
Thus, the great computational advantage of classical MD, as compared to the so-called ab initio electronic
structure calculations, is that the dynamic behavior of the atomic system is described empirically without having to
solve the computational expensive but ”exact” Schr¨odinger’s equation at each time step.
23
4.3
Neighbor lists and link-cell method
4
COMPUTATIONAL ASPECTS OF MD
The force terms in equation (32) are typically non-linear functions of the distance rij between pairs of atoms and may be either long-range or short-range in nature. For long-range
forces, such as Coulombic interactions in an ionic solid or biological system, each atom interacts
with all others. Directly computing these forces scales as N 2 and is too costly for large N . Various approximate methods, valid in different conditions, overcome this difficulty. They include
particle-mesh algorithms [10], which scale as f (M )N where M is the number of mesh points,
hierarchical methods [11] which scale as N log(N ), and fast-multipole methods [12] which scale
as N . Modern parallel implementations of these algorithms have improved their range of applicability for many-body simulations, but because of their expense, long-range force models are
not dealt in this course.
By contrast, short-range force models, and in particular the Lennard-Jones one introduced in
the paragraph 2.1.1, are our main concern. They are chosen either because electronic screening
effectively limits the range of influence of the interatomic forces being modeled or simply to
truncate the long-range interactions and lessen the computational load. In either case, the summations in equation (32) are restricted to atoms within some small region surrounding atom i.
As we have seen in the paragraph 2.1.2, this is typically implemented using a cutoff distance Rc ,
outside of which all interactions are ignored. The work to compute forces now scales linearly
with N . Notwithstanding this savings, the vast majority of computation time spent in a shortrange force MD simulation is in evaluating the force terms in equation (32), even in the case (as
the LJ model) only the F2 term has to be evaluated. The time integration typically requires only
2-3% of the total time. To evaluate the sums efficiently requires knowing which atoms are within
the cutoff distance Rc at every time step. The key is to minimize the number of neighboring
atoms that must be checked for possible interactions since calculations performed on neighbors
at a distance r > Rc are wasted computation. There are two basic techniques used to accomplish
this on serial (and vector machines). We discuss them since our parallel algorithms incorporate
similar ideas.
4.3
Neighbor lists and link-cell method
The first idea to minimize the number of neighboring atoms that must be checked for possible
interactions is the one of neighbor lists. It was originally proposed by Verlet [2]. For each atom,
a list is maintained of nearby atoms. Typically, when the list is formed, all neighboring atoms
within an extended cutoff distance Rs = Rc + δ are stored. The list is used for a few time
steps to calculate all force interactions. Then it is rebuilt before any atom could have moved
from a distance r > Rs to r < Rc . Though δ is always chosen to be small relative to Rc , an
optimal value depends on the parameters (e.g. temperature, diffusivity, density) of the particular
simulation. The advantage of the neighbor list is that once it is built, examining it for possible
interactions is much faster than checking all atoms in the system.
The second technique commonly used for speeding up MD calculations is known as the linkcell method [13]. At every time step, all the atoms are binned into 3-D cells of side length ` where
` = Rc or slightly larger. This reduces the task of finding neighbors of a given atom to checking
in 27 bins — the bin the atom is in and the 26 surrounding ones. Since binning the atoms only
requires O(N ) work, the extra overhead associated with it is acceptable for the savings of only
24
4.4
Parallel algorithms
4 COMPUTATIONAL ASPECTS OF MD
having to check a local region for neighbors.
The fastest serial short-range MD algorithms use a combination of neighbor lists and linkcell binning. In the combined method, atoms are only binned once every few time steps for the
purpose of forming neighbor lists. In this case atoms are binned into cells of size ` ≥ Rs . At
intermediate time steps the neighbor lists alone are used in the usual way to find neighbors within
a distance Rc of each atom. This is a significant savings over a conventional link-cell method
since there are far fewer atoms to check in a sphere of volume 4πRs3 /3 than in a cube of volume
27Rc3 . Additional savings can be gained due to Newton’s 3rd law:
Fij = −Fji
(33)
by only computing a force once for each pair of atoms (rather than once for each atom in the
pair). In the combined method this is done by only searching half the surrounding bins of each
atom to form its neighbor list. This has the effect of storing atom j in atom i’s list, but not atom
i in atom j’s list, thus halving the number of force computations that must be done.
4.4
Parallel algorithms
In the last 20 years there has been considerable interest in devising parallel MD algorithms.
The natural parallelism in MD is that the force calculations and velocity/position updates can be
done simultaneously for all atoms. To date, two basic ideas have been exploited to achieve this
parallelism. The goal in each is to divide the force computations in equation (32) evenly across
the processors so as to extract maximum parallelism. All the algorithms that have been proposed
or implemented so far are variations on these two methods.
In the first class of methods a pre-determined set of force computations is assigned to each
processor. The assignment remains fixed for the duration of the simulation. The simplest
way of doing this is to give a subgroup of atoms to each processor. We call this method an
atom-decomposition of the workload, since the processor computes forces on its atoms no matter
where they move in the simulation domain. More generally, a subset of the force loops inherent
in equation (32) can be assigned to each processor. We term this a force-decomposition. Both
of these decompositions are analogous to ”Lagrangian gridding” in a fluids simulation where
the grid cells (computational elements) move with the fluid (atoms in MD). By contrast, in the
second general class of methods, which we call a spatial-decomposition of the workload, each
processor is assigned a portion of the physical simulation domain. Each processor computes only
the forces on atoms in its sub-domain. As the simulation progresses, processors exchange atoms
as they move from one sub-domain to another. This is analogous to an ”Eulerian gridding” for a
fluids simulation where the grid remains fixed in space as fluid moves through it.
Within the two classes of methods for parallelization of MD, a variety of algorithms have
been proposed and implemented by various researchers. The details of the algorithms vary
widely from one parallel machine to another since there are numerous problem-dependent and
machine-dependent trade-offs to consider, such as the relative speeds of computation and communication.
Atom-decomposition methods, also called replicated-data methods because identical copies
of atom information are stored on all processors, have been often used in short-range MD simu25
4.4
Parallel algorithms
4 COMPUTATIONAL ASPECTS OF MD
lations of molecular systems. This is because the duplication of information makes for straightforward computation of additional three- and four-body force terms. Parallel implementations of
almost outdatedt biological MD programs such as CHARMM [14] and GROMOS [15] use this
technique.
Spatial-decomposition methods, also called geometric methods are now more common in
the literature because they are well-suited to very large MD simulations and when long-range
interaction are involved. Very recent parallel implementations of state-of-the art biological MD
programs such as AMBER [16], GROMACS [17] and NAMD [18] use this technique.
26
5
5
COMPUTATIONAL ASPECTS OF OUR MODEL
Computational aspects of our model
5.1
Reduced units
Unit systems are invented to make physical laws look simple and numerical calculations easy.
Take Newton’s law: F = ma. In the SI unit system, this means that if an object of mass x (kg)
is undergoing an acceleration of y (m/s2 ), the force on the object must be xy (N).
However, there is nothing intrinsically special about the SI unit system. One (kg) is simply
the mass of a platinum-iridium prototype in a vacuum chamber in Paris. If one wishes, one can
˜ which say is 1/7 of the mass of the Paris prototype:
define his or her own mass unit — e.g. (kg),
˜
1 (kg) = 7 (kg).
˜ is oneOs
˜ choice of the mass unit, how about the unit system? One really has to make a
If (kg)
decision here, which is either keeping all the other units unchanged and only making the (kg)?(
˜ transition, or, changing some other units along with the (kg)?( kg)
˜ transition.
kg)
Imagine making the first choice, i.e. keeping all the other units of the SI system unchanged,
˜ That is all right,
including the force unit (N), and only changes the mass unit from (kg) to (kg).
˜ law must be re-expressed as F = ma/7, because if
except in the new unit system the NewtonOs
˜ is undergoing an acceleration of y (m/s2 ), the force on the object is xy
an object of mass 7x (kg)
(N).
There is nothing inherently wrong with the F = ma/7 expression, which is just a recipe
for computation — a correct one for the newly chosen unit system. Fundamentally, F = ma/7
and F = ma describe the same physical law. But it is true that F = ma/7 is less elegant than
F = ma. No one likes to memorize extra constants if they can be reduced to unity by a sensible
choice of units. The SI unit system is sensible, because (N) is picked to work with other SI units
to satisfy F = ma.
˜ as the mass unit? Simple, just define
How may we have a sensible unit system but with (kg)
˜
˜
˜
(N) = (N)/7 as the new force unit. The (m)-(s)-(kg)-(N)-unit
system is sensible because the
simplest form of F = ma is preserved. Thus we see that when a certain unit in a sensible
unit system is altered, other units must also be altered correspondingly in order to constitute
a new sensible unit system, which keeps the algebraic forms of all fundamental physical laws
unaltered8 .
In science people have formed deep-rooted conventions about the simplest algebraic forms
of physical laws, such as F = ma, K = mv 2 /2, E = K + U , etc. Although nothing forbids
one from modifying the constant coefficients in front of each expression, one is better off not
to. Fortunately, as long as one uses a sensible unit system, these algebraic expressions stays
invariant.
Now, imagine we derive a certain composite law from a set of simple laws. On one side,
we start with and consistently use a sensible unit system A. On the other side, we start with
and consistently use another sensible unit system B. Since the two sides use exactly the same
algebraic forms, the resultant algebraic expression must also be the same, even though for a
given physical instance, a variable takes on two different numerical values on the two sides as
8
A notable exception is the conversion between SI and Gaussian unit systems in electrodynamics, during which
a non-trivial factor of 4π comes up.
27
5.1
Reduced units
5
COMPUTATIONAL ASPECTS OF OUR MODEL
different unit systems are adopted. This means that the final algebraic expression describing
the physical phenomena must satisfy certain concerted scaling invariance with respect to its
dependent variables, corresponding to any feasible transformation between sensible unit systems.
This strongly limits the form of possible algebraic expressions describing physical phenomena,
which is the basis of dimensional analysis.
In the MD world, there is also another reason why people prefer to choose ”non-conventional”
units. Typical simulation sizes in molecular dynamics simulation are very small up to 1000
atoms. As a consequence, most of the extensive quantities are small in magnitude when measured in macroscopic units. There are two possibilities to overcome this problem: either one
should work with atomic-scale units (ps, a.m.u., nm) or to make all the observable quantities
dimensionless with respect to their characteristic values. The second approach is more popular.
The scaling is done with the model parameters e.g. in the case of LJ model: size σ, energy ,
mass m. So the common recipe is, one chooses a value for one atom/molecule pair potential
arbitrarily () and then other model parameters (say total energy E) are given in terms of this
reference value (E ∗ = E/). The other parameters are also calculated similarly. For example,
dimensionless
r
,
(34)
distance r∗ =
σ
E
energy E ∗ =
,
(35)
kB T
temperature T ∗ =
,
(36)
P σ3
pressure P ∗ =
,
(37)
t
p
time t∗ =
,
(38)
σ m/
Fσ
,
(39)
force F ∗ =
ρσ 3
density ρ∗ =
(40)
m
and so on.
These are the so-called reduced units for the LJ potential model that we will adopt in our
codes.
If we write the LJ potential (6) in dimensionless form
" 6 #
12
1
1
(41)
φ∗LJ (r∗ ) = 4
−
r∗
r∗
we see that it is parameter independent, consequently all the properties must also be parameter
independent. If a potential only has a couple of parameters then this scaling has a lot of advantages. Namely, potential evaluation can be really efficient in reduced units and as the results
are always the same, so the results can be transferred to different systems with straight forward
28
5.2
Simulation parameters
5
COMPUTATIONAL ASPECTS OF OUR MODEL
scaling by using the model parameters σ, and m. This is equivalent to selecting unit value for
the parameters and it is convenient to report system properties in this form e.g P ∗ (ρ∗ ).
For reference, Tab. 1 show the values of LJ model parameters for some noble gases which
are well described by such a physical model.
Ne
Ar
Kr
Xe
˚
σ (A)
0.278
3.405
3.680
3.924
/kB (K) m (a.m.u.)
37.29
20.18
119.8
39.95
166.6
83.80
257.4
131.29
Table 1: LJ model parameters for some nobel gases.
5.2
Simulation parameters
As mentioned several times in the previous chapters we will study the model of N identical
point-like atoms in a box where periodic boundary conditions are applied in all the directions,
and where atoms interact one each other only through a Lennard-Jones pairwise potential (41)
in dimensionless reduced units. The derivative of (41) with respect to the relative distance r
between two particles is the F2 term in equation (32) (F3 and higher order terms are ignored).
The N atoms are simulated in a 3D parallelepiped with periodic boundary conditions at the
Lennard-Jones state point defined by the reduced density:
ρ∗ = 0.8442
and reduced temperature:
T ∗ = 0.72
This is a liquid state near the Lennard-Jones triple point9 .
The simulation will begin with the atoms on a face-centered cube (fcc) lattice (Fig. 2) with
randomized velocities. The fcc configuration represents a minimal energy configuration (ground
state) and consequently it would be stable at T = 0, i.e. without kinetic energy as a solid crystal.
Since we will provide kinetic energy (T 6= 0) to that system, it quickly melts evolving to its
natural liquid state. The equilibration phase, in this very simple model, happens very fast.
A roughly uniform spatial density persists for the duration of the simulation.
The simulation is run at constant N , volume V , and total energy E, a statistical sampling
from the microcanonical ensemble.
Force computations using the potential in equation (41) are truncated at a cutoff distance
∗
Rc = 2.5. Using such cutoff, in our simulation each atom has on average 55 neighbor. If
neighbor list are used, an extended cutoff length Rs∗ = 2.8 is defined (encompassing about 78
9
A triple point is a point on the phase diagram, for which three distinct thermodynamic phases coexist in equilibrium. For the LJ model without truncation, Mastny and de Pablo found it at T ∗ = 0.694 and ρ∗ = 0.96 [19].
29
5.2
Simulation parameters
5
COMPUTATIONAL ASPECTS OF OUR MODEL
Figure 2: Face-centered cube (fcc) lattice cell, used as initial configuration of our system.
atoms) for forming neighbor lists and the lists are updated every 20 time steps. The cost of
creating neighbor lists every 20 time steps is amortized over the per time step timing.
The integration time step we are going to use in the leap-frog algorithm is
∆t∗ = 0.00462.
In the case of the Ar, the constant to convert times between reduced units and conventional ones
is: according to (38) and the values of the parameters from Tab. 1:
r
mσ 2
= 2.156x10−12 ,
thus the reduced time unit is 2.156 ps. This is roughly the timescale of one atomic oscillation
period in Ar. The integration time step for Ar corresponds to
∆t = ∆t∗ × 2.156 ps ' 0.01 ps.
In other words, we choose such integration time step (∼ 1/100 of atomic oscillation period) so
that to accurately describe even such elementary atomic motion.
30
6
6
ATOM-DECOMPOSITION ALGORITHM
Atom-decomposition algorithm
The first parallel algorithm that we will take into account is the atom-decomposition (AD) algorithm. In this algorithm each of the P processors is assigned a group of N/P atoms at the
beginning of the simulation. Atoms in a group need not have any special spatial relationship
to each other. For ease of exposition, we assume N is a multiple of P , though it is simple to
relax this constraint. A processor will compute forces on only its N/P atoms and will update
their positions and velocities for the duration of the simulation no matter where they move in the
physical domain. This is an atom-decomposition of the computational workload.
A useful construct for representing the computational work involved in the algorithm is the
N × N force matrix F . The (ij) element of F represents the force on atom i due to atom j.
Note that F is sparse due to short-range forces and skew-symmetric, i.e. Fij = −Fji due to
Newton’s 3rd law. We also define x and f as vectors of length N which store the position and
total force on each atom. For a 3-D simulation, xi would store the three coordinates of atom i.
With these definitions, the AD algorithm assigns each processor a sub-block of F which consists
of N/P rows of the matrix, as shown in Figure 3. If z indexes the processors from 0 to P − 1,
then processor Pz computes matrix elements in the Fz sub-block of rows. It also is assigned the
corresponding sub-vectors of length N/P denoted by xz and fz .
Figure 3: The division of the force matrix among P processors in the atom-decomposition algorithm.
Processor Pz is assigned N/P rows of the matrix and the corresponding xz piece of the position
vector. In addition, it must know the entire position vector x to compute the matrix elements in
their own rows.
For the LJ potential, the computation of matrix element Fij requires only the two atom positions xi and xj . To compute all the elements in Fz , processor Pz will need the positions of
many atoms owned by other processors. This implies that at every time step each processor must
receive updated atom positions from all the other processors, an operation called all-to-all communication. Various algorithms have been developed for performing this operation efficiently
on different parallel machines and architectures. Usually, the MPI library implementation of
the all-to-all communication operations takes advantages of such optimized algorithms. In the
course we will implement such operations by using MPI library. Moreover, we will also try to
implement such operations by using only point-to-point communication operations to compare it
with the MPI implementation. In particular, we will resort an idea outlined in [20], that is simple,
portable, and works well on a variety of machines.
31
6.1
6.1
Expand and fold operations
6
ATOM-DECOMPOSITION ALGORITHM
Expand and fold operations
Following Fox’s nomenclature [20], we term the all-to-all communication procedure described
above an expand operation. Each processor allocates memory of length N to store the entire
x vector. At the beginning of the expand, processor Pz has xz , an updated piece of x of length
N/P . Each processor needs to acquire all the other processor’s pieces, storing them in the correct
places in its copy of x. Figure 4a illustrates the steps that accomplish this for an 8-processor
example. The processors are mapped consecutively to the sub-pieces of the vector. In the first
communication step, each processor partners with an adjacent processor in the vector and they
exchange sub-pieces. Processor 2 partners with 3. Now, every processor has a contiguous piece
of x that is of length 2N/P . In the second step, each processor partners with a processor two
positions away and exchanges its new piece (2 receives the shaded sub-vectors from 0). Each
processor now has a 4N/P −length piece of x. In the last step, each processor exchanges an
N/2− length piece of x with a processor P/2 positions away (2 exchanges with 6); the entire
vector now resides on each processor.
The expand operation can also be accomplished by using the mpi allgather collective operation provided by the MPI library.
Figure 4: Expand and fold operations among eight processors, each of which requires three steps. (a)
In the expand, processor 2 receives successively longer shaded sub-vectors from processor 3,
0, and 6; (b) In the fold, processor 2 receives successively shorter shaded sub-vectors from
processors 6, 0, and 3.
A communication operation that is essentially the inverse of the expand will also prove useful in the atom-decomposition algorithm. Assume each processor has stored new force values
throughout its copy of the force vector f . Processor Pz needs to know the N/ P values in fz ,
32
6.2
The simpler algorithm
6
ATOM-DECOMPOSITION ALGORITHM
where each of the values is summed across all P processors. This is known as a fold operation
and is outlined in Figure 4b. In the first step each processor exchanges half the vector with a
processor it partners with that is P/2 positions away. Note that each processor receives the half
that it is a member of and sends the half it is not a member of (processor 2 receives the shaded
first half of the vector from 6). Each processor sums the received values with its corresponding
retained sub-vector. This operation is recursed, halving the length of the exchanged data at each
step.
The fold operation can also be accomplished by using the mpi reduce collective operation
provided by the MPI library.
Costs for a communication algorithm are typically quantified by the number of messages and
the total volume of data sent and received. On both these accounts the expand and fold of Figure
4 are optimal, each processor performs log2 (P ) sends and receives and exchanges N −N/P data
values. Each processor also performs N − N/P additions in the fold. A drawback is that the
algorithms require O(N ) storage on every processor. Alternative methods for performing all-toall communication require less storage at the cost of more sends and receives. This is usually
not a good trade-off for MD simulations because quite large problems can be run with the few
Gbytes of local memory available on current-generation processors.
We now describe two versions of an AD algorithm which use expand and fold operations.
6.2
The simpler algorithm
The first algorithm is simpler and does not take advantage of Newton’s 3rd law. We call this
algorithm A1: it is outlined in Figure 5 with the dominating term(s) in the computation or communication cost of each step listed on the right. We assume at the beginning of the time step
that each processor knows the current positions of all N atoms, i.e. each has an updated copy
of the entire x vector. Step (1) of the algorithm is to construct neighbor lists for all the pairwise
interactions that must be computed in block Fz . Typically this will only be done once every few
time steps. If the ratio of the physical domain diameter S to the extended force cutoff length Rs
is relatively small, it is quicker for Pz to construct the lists by checking all N 2 /P pairs in its Fz
block. When the simulation is large enough that 4 or more bins can be created in each dimension,
it is quicker for each processor to bin all N atoms, then check the 27 surrounding bins of each of
its N/P atoms to form the lists. This checking scales as N/P but has a large coefficient, so the
overall scaling of the binned neighbor list construction is recorded as N/P + N .
In step (2) of the algorithm, the neighbor lists are used to compute the non-zero matrix elements in Fz . As each pairwise force interaction is computed, the force components are summed
into fz so that Fz is never actually stored as a matrix. At the completion of the step, each processor knows the total force fz on each of its N/P atoms . This is used to update their positions
and velocities in step (4). (A step (3) will be added to the other algorithm.) Finally, in step (5)
the updated atom positions in xz are shared among all P processors in preparation for the next
time step via the expand operation of Figure 4a. As discussed above, this operation scales as N ,
the volume of data in the position vector x.
33
6.3
Using Newton’s 3rd law
(1)
6
ATOM-DECOMPOSITION ALGORITHM
Construct neighbor lists of non-zero interactions in Fz
N2
P
(S < 4Rs ) All pairs
(S ≥ 4Rs ) Binning
N
P
+N
(2)
Compute elements of Fz , summing results into fz
(4)
Update atom positions in xz using fz
N
P
N
P
(5)
Expand xz among all processors, result in x
N
Figure 5: Single time step of atom-decomposition algorithm A1 for processor Pz .
6.3
Taking advantage of Newton’s 3rd law
As mentioned in the previous paragraph, algorithm A1 ignores Newton’s 3rd law. If different
processors own atoms i and j as is usually the case, both processors compute the (ij) interaction
and store the resulting force on their atom. This can be avoided at the cost of more communication by using a modified force matrix G which references each pairwise interaction only once.
There are several ways to do this by striping the force matrix [21]; we choose instead to form G
as follows. Let Gij = Fij except that Gij = 0 when i > j and i + j is even, and likewise Gij = 0
when i < j and i + j is odd. Conceptually, G is colored like a checkerboard with red squares
above the diagonal set to zero and black squares below the diagonal also set to zero. A modified
AD algorithm A2 that uses G to take advantage of Newton’s 3rd law is outlined in Figure 6.
(1)
Construct neighbor lists of non-zero interactions in Gz
N2
2P
(S < 4Rs ) All pairs
(S ≥ 4Rs ) Binning
N
2P
+N
(2)
Compute elements of Gz , doubly summing results into local copy of f
N
2P
(3)
Fold f among all processors, result is fz
N
(4)
Update atom positions in xz using fz
N
P
(5)
Expand xz among all processors, result in x
N
Figure 6: Single time step of atom-decomposition algorithm A2 for processor Pz , which take advantage
of Newton’s 3rd law.
Step (1) is the same as in algorithm A1 except only half as many neighbor list entries are
made by each processor since Gz has only half the non-zero entries of Fz . This is reflected in
the factors of two included in the scaling entries. For neighbor lists formed by binning, each
34
6.4
Load-balance
6
ATOM-DECOMPOSITION ALGORITHM
processor must still bin all N atoms, but only need check half the surrounding bins of each of its
N/P atoms. In step (2) the neighbor lists are used to compute elements of Gz . For an interaction
between atoms i and j, the resulting forces on atoms i and j are summed into both the i and
j locations of force vector f . This means each processor must store a copy of the entire force
vector, as opposed to just storing fz as in algorithm A1. When all the matrix elements have been
computed, f is folded across all P processors using the algorithm in Figure 4b. Each processor
ends up with fz , the total forces on its atoms. Steps (4) and (5) then proceed the same as in A1.
Note that implementing Newton’s 3rd law essentially halved the computation cost in steps (1)
and (2) at the expense of doubling the communication cost. There are now two communication
steps (3) and (5), each of which scale as N . This will only be a net gain if the communication
cost in A1 is less than a third of the overall run time. This is usually not the case on large numbers
of processors, so in practice we almost always choose A1 instead of A2 for an AD algorithm.
However, for small P or expensive force models, A2 can be faster.
6.4
Load-balance
Each processor will have an equal amount of work if each Fz or Gz block has roughly the same
number of non-zero elements. This will be the case if the atom density is uniform across the
simulation domain. However non-uniform densities can arise if, for example, there are free
surfaces so that some atoms border on vacuum, or phase changes are occurring within a liquid or
Solid. This is only a problem for load-balance if the N atoms are ordered in a geometric sense
as is typically the case. Then a group of N/P atoms near a surface, for example, will have fewer
neighbors than groups in the interior. This can be overcome by randomly permuting the atom
ordering at the beginning of the simulation, which is equivalent to permuting rows and columns
of F or G. This insures that every Fz or Gz will have roughly the same number of non-zeros.
A random permutation also has the advantage that the load-balance will likely persist as atoms
move about during the simulation. Note that this permutation needs only to be done once, as a
pre-processing step before beginning the dynamics.
In summary, the AD algorithms divide the MD force computation and integration evenly
across the processors (ignoring the O(N ) component of binned neighbor list construction which
is usually not significant). However, the algorithms require global communication, as each processor must acquire information held by all the other processors. This communication scales as
N , independent of P , so it limits the number of processors that can be used effectively. The
chief advantage of the algorithms is that of simplicity. Steps (1), (2) and (4) can be implemented
by simply modifying the loops and data structures in a serial or vector code to treat N/P atoms
instead of N . The expand and fold communication operations (3) and (5) can be treated as blackbox routines (as for example by using the MPI operations) and inserted at the proper locations in
the code. Few other changes are typically necessary to parallelize an existing code.
35
7
7
FORCE-DECOMPOSITION ALGORITHM
Force-decomposition algorithm
Our next parallel MD algorithm is based on a block-decomposition of the force matrix rather than
a row-wise decomposition as used in the previous chapter. We call this a force-decomposition
(FD) of the workload.
As we shall see, this improves the O(N ) scaling of the communication
√
cost to O(N/ P ). Block-decompositions of matrices are common in linear algebra algorithms
for parallel machines even if they are not common for short-range MD simulations. The assignment of sub-blocks of the force matrix to processors with a row-wise (calendar) ordering of the
processors is depicted in Figure 7. We assume for ease of exposition that P is an even power of
2 and that N is a multiple of P , although again it is straightforward to relax these constraints.
As before, we let z index the processors from 0 to P − 1; processor Pz owns and will update the
N/P atoms stored in the sub-vector xz .
Figure 7: The division of the permuted force matrix F 0 among 16 processors
√ in the force-decomposition
√
0
algorithm. Processor P6 is assigned a sub-block F6√of size N/ P by N/ P . To compute its
matrix elements it must know the corresponding N/ P -length pieces xα and x0β of the position
vector x and permuted position vector x0 .
To reduce communication (explained below) the block-decomposition in Figure 7 is actually
of a permuted force matrix F 0 which is formed by rearranging the columns of F in a particular way. If we order the xz pieces in row-order, they form the usual position vector x which
is shown as a vertical bar at the left of the figure. Were we to have x span the columns as
in Figure 3 we would form the force matrix as before. Instead, we span the columns with a
permuted position vector x0 shown as a horizontal bar at the top of Figure 7 in which the xz
pieces are stored in column-order. Thus, in the 16-processor example shown in the figure, x
stores each processor’s piece in the usual order (0, 1, 2, 3, 4, . . . , 14, 15) while x0 stores them as
36
7.1
The simpler algorithm
7
FORCE-DECOMPOSITION ALGORITHM
(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15). Now the (ij) element of F 0 is the force on atom i
in vector x due to atom j in permuted vector x0 .
√
√
P
)
×
(N/
P ). To compute
The Fz0 sub-block owned by each processor Pz is of size (N/
√
the matrix elements in Fz0 , processor Pz must know one N/ P -length piece of each of the x
and x0 vectors, which we denote as xα and x0β . As these elements are computed they will be
accumulated into corresponding force sub-vectors fα and fβ0 . The Greek subscripts α and β each
√
run from 0 to P − 1 and reference the row and column position occupied by processor Pz .
Thus for processor 6 in the figure, xα consists of the x sub-vectors (4,5,6,7) and x0β consists of
the x0 sub-vectors (2,6,10,14).
7.1
The simpler algorithm
Our first FD algorithm F1 is outlined in Figure 8. As before, each processor has updated copies
of the needed atom positions xα and x0β at the beginning of the time step. In step (1) neighbor
lists are constructed. Again, for small problems this
is most quickly done by checking all N 2 /P
√
possible pairs in Fz0 . For large problems, the N/ P atoms in x0β are binned, then the 27 sur√
rounding bins of each atom in xα is checked. The checking
√ procedure scales as N/ P , so the
scaling of the binned neighbor list construction is still N/ P . In step (2) the neighbor lists are
used to compute the matrix elements in Fz0 . As before the elements are summed into a local copy
of fα as they are computed so Fz0 never need be stored in matrix form. In step (3) a fold operation
is performed within each row of processors so that processor Pz obtains the total forces fz on its
N/P atoms. Although the fold algorithm used is the same as in the preceding
is a
√ chapter, there√
key difference. In this case the vector fα being folded is only of length N/ P and only
√ the P
processors in one row are participating in the fold. Thus this operation scales as N/ P instead
of N as in the AD algorithm.
In step (4), fz is used by Pz to update the N/P atom positions in xz . Steps (5a-5b) share
these updated positions with all the processors that will need them for the next time step. These
are the processors which share a row or column with Pz . First, in (5a), the processors in row α
perform an expand of their xz sub-vectors
so that each acquires the entire xα . As with the fold,
√
this operation scales as the N/ P length of xα instead of as N as it did in algorithms A1 and
A2. Similarly, in step (5b), the processors in each column β perform an expand of their xz . As a
result they all acquire x0β and are ready to begin the next time step.
It is in step (5) that using a permuted force matrix F 0 saves extra communication. The permuted form of F 0 causes xz to be a component of both xα and x0β for each Pz . This would
not be the case if we had block-decomposed the original force matrix F by having x span the
columns instead of x0 . Then in Figure 7 the xβ for P6 would have consisted of the sub-vectors
(8,9,10,11), none of which components are known by P6 . Thus, before performing the expand
in step (5b), processor 6 would need to first acquire one of these 4 components from another
processor (in the transpose position in the matrix) requiring an extra O(N/P ) communication
step. The transpose-free version of the FD algorithms presented here was motivated by a matrix
permutation for parallel matrix-vector multiplication discussed in reference [22].
37
7.2
Using Newton’s 3rd law
(1)
7
FORCE-DECOMPOSITION ALGORITHM
Construct neighbor lists of non-zero interactions in Fz0
(S < 4Rs ) All pairs
(S ≥ 4Rs ) Binning
(2)
Compute elements of Fz0 , storing results in fα
(3)
Fold fα within row α, result is fz
(4)
Update atom positions in xz using fz
(5a)
Expand xz within row α, result in xα
(5b) Expand xz within column β, result in x0β
N2
P
N
√
P
N
P
N
√
P
N
P
N
√
P
N
√
P
Figure 8: Single time step of force-decomposition algorithm F1 for processor Pz .
7.2
Taking advantage of Newton’s 3rd law
As with algorithm A1 algorithm F1 does not take advantage of Newton’s 3rd law; each pairwise
force interaction is computed twice. Algorithm F2 avoids this duplicated effort by checkerboarding the force matrix as in the preceding chapter. Specifically, the checkerboarded matrix
G is permuted in the same way as F was, to form G0 . Note that now the total force on atom
i is the sum of all matrix elements in row i minus the sum of all elements in column i. The
modified FD algorithm F2 is outlined in Figure 9. Step (1) is the same as in F1 except that half
as many interactions are stored in the neighbor lists. Likewise, step (2) requires only half as
many matrix elements be computed. For each (ij) element, the computed force components are
now summed into two force sub-vectors instead of one. The force on atom i is summed into fα
in the location corresponding to row i. Likewise, the force on atom j is summed into fβ0 in the
location corresponding to column j. Steps (3a-3c) accumulate these forces
so that processor Pz
√
ends up with the total force on its N/P atoms. First, in step (3a), the P processors in column
β fold their local copies of fβ0 . The result is fz0 . Each element of this N/P -length sub-vector
is the sum of an entire column of G’. Next, in step (3b), the row contributions to the forces are
summed by performing a fold of the fα vector within each row α. The result is fz , each element
of which is the sum across a row of G0 . Finally, in step (3c) the column and row contributions
are subtracted, element by element, to yield the total forces fz on the atoms owned by processor
Pz . The processor can now update the positions and velocities of its atoms; steps (4) and (5) are
identical to those of F1.
In the FD algorithms, exploiting Newton’s 3rd law again halves the computation required
in steps (1) and (2). However, the communication cost in steps (3) and (5) does not double.
Rather there are 4 expands and folds required in F2 versus 3 in F1. Thus, in practice, it is
usually faster to use algorithm F2 with its reduced computational cost and slightly increased
communication cost rather
√ than F1. The key point is that all the expand and fold operations in
F1 and F2 scale as N/ P rather than as N as was the case in algorithms A1 and A2. When you
38
7.3
Load-balance
(1)
7
FORCE-DECOMPOSITION ALGORITHM
Construct neighbor lists of non-zero interactions in G0z
(S < 4Rs ) All pairs
(S ≥ 4Rs ) Binning
(2)
Compute elements of = G0z , storing results in fα and fβ0
(3a)
Fold fβ0 within column β, result is fz0
(3b) Fold fα within row α, result is fz
(3c)
Subtract fz0 from fz , result is total fz
(4)
Update atom positions in xz using fz
(5a)
Expand xz within row α, result in xα
(5b) Expand xz within column β, result in x0β
N2
2P
N
√
2 P
N
2P
N
√
P
N
√
P
N
P
N
P
N
√
P
N
√
P
Figure 9: Single time step of force-decomposition algorithm F2 for processor Pz , which takes advantage
of Newton’s third law.
run on large numbers of processors this significantly reduces the time the FD algorithms spend
on communication as compared to the AD algorithms.
7.3
Load-balance
Finally, the issue of load-balance is a more serious concern for the FD algorithms. Processors
will have equal work to do only if all the matrix blocks Fz0 or G0z are uniformly sparse. If the
atoms are ordered geometrically this will not be the case even for problems with uniform density.
This is because such an ordering creates a force matrix with diagonal bands of non-zero elements.
As in the AD case, a random permutation of the atom ordering produces the desired effect. Only
now the permutation should be done as a pre-processing step for all problems, even those with
uniform atom densities.
In summary, algorithms F1 and F2 divide the MD computations evenly across processors
as did the AD algorithms.
√ But the block-decomposition of the force matrix means each procesits computations. Thus the communication
sor only needs O(N/ P ) information to perform
√
and memory costs are reduced by a factor of P versus algorithms A1 and A2. The FD strategy retains the simplicity of the AD technique; F1 and F2 can be implemented using the same
“black-box” communication routines as A1 and A2. The FD algorithms also need no geometric
information about the physical problem being modeled to perform optimally. In fact, for loadbalancing purposes the algorithms intentionally ignore such information by using a random atom
ordering
39
8
8
SPATIAL-DECOMPOSITION ALGORITHM
Spatial-decomposition algorithm
In our final parallel algorithm the physical simulation domain is subdivided into small 3-D boxes,
one for each processor. We call this a spatial-decomposition (SD) of the workload. Each processor computes forces on and updates the positions and velocities of all atoms within its box at each
time step. Atoms are reassigned to new processors as they move through the physical domain. In
order to compute forces on its atoms, a processor need only know positions of atoms in nearby
boxes. The communication required in the SD algorithm is thus local in nature as compared to
global in the AD and FD cases.
The size and shape of the box assigned to each processor will depend on N , P , and the
aspect ratio of the physical domain, which we assume to be a 3-D rectangular parallelepiped.
Within these constraints the number of processors in each dimension is chosen so as to make
each processor’s box as “cubic” as possible. This is to minimize communication since in the
large N limit the communication cost of the SD algorithm will turn out to be proportional to the
surface area of the boxes. An important point to note is that in contrast to the link-cell method
described in Section 4.3, the box lengths may now be smaller or larger than the force cutoff
lengths Rc and Rs .
Each processor in our SD algorithm maintains two data structures, one for the N/P atoms
in its box and one for atoms in nearby boxes. In the first data structure, each processor stores
complete information positions, velocities, neighbor lists, etc. This data is stored in a linked
list to allow insertions and deletions as atoms move to new boxes. In the second data structure only atom positions are stored. Inter-processor communication at each time step keeps this
information current.
8.1
The communication scheme
The communication scheme we use to acquire this information from processors owning the
nearby boxes is shown in Figure 10. The first step (a) is for each processor to exchange information with adjacent processors in the east/west dimension. Processor 2 fills a message buffer with
atom positions it owns that are within a force cutoff length Rs of processor 1’s box.10 If s < Rs ,
where s is the box length in the east/west direction, this will be all of processor 2’s atoms; otherwise it will be those nearest to box 1. Now each processor sends its message to the processor
in the westward direction (2 sends to 1) and receives a message from the eastward direction.
Each processor puts the received information into its second data structure. Now the procedure
is reversed with each processor sending to the east and receiving from the west. If s > Rs , all
needed atom positions in the east-west dimension have now been acquired by each processor. If
s < Rs , the east-west steps are repeated with each processor sending more needed atom positions to its adjacent processors. For example, processor 2 sends processor 1 atom positions from
box 3 (which processor 2 now has in its second data structure). This can be repeated until each
processor knows all atom positions within a distance Rs of its box, as indicated by the dotted
boxes in the figure. The same procedure is now repeated in the north/south dimension (see step
10
The reason for using Rs instead of Rc will be made clear further down this chapter
40
8.1
The communication scheme
8
SPATIAL-DECOMPOSITION ALGORITHM
(b) of the Figure 10). The only difference is that messages sent to the adjacent processor now
contain not only atoms the processor owns (in its first data structure), but also any atom positions
in its second data structure that are needed by the adjacent processor. For s = Rs , this has the
effect of sending 3 boxes worth of atom positions in one message as shown in (b). Finally, in
step (c) the process is repeated in the up/down dimension. Now atom positions from an entire
plane of boxes (9 in the figure) are being sent in each message.
Figure 10: Method by which a processor acquires nearby atom positions in the spatial-decomposition
algorithm. In six data exchanges all atom positions in adjacent boxes in the 9a) east/west, (b)
north/south, and (c) up/down directions can be communicated.
There are several key advantages to this scheme, all of which reduce the overall cost of
communication in our algorithm.
1. First, for s ≥ Rs needed atom positions from all 26 surrounding boxes are obtained in
just 6 data exchanges. Moreover, if the parallel machine has a hypercube topology (as for
example the IBM BlueGene/P supercomputer at JSC) the processors can be mapped to the
boxes in such a way that all 6 of these processors will be directly connected to the center
processor. Thus message passing will be fast and contention-free.
2. Second, when s < Rs so that atom information is needed from more distant boxes, this
occurs with only a few extra data exchanges, all of which are still with the 6 immediate
neighbor processors. This is an important feature of the algorithm that enables it to perform
well even when large numbers of processors are used on relatively small problems.
3. A third advantage is that the amount of data communicated is minimized. Each processor
acquires only the atom positions that are within a distance Rs of its box. Fourth, all of
41
8.2
The simpler algorithm
8
SPATIAL-DECOMPOSITION ALGORITHM
the received atom positions can be placed as contiguous data directly into the processor’s
second data structure. No time is spent rearranging data, except to create the buffered
messages that need to be sent.
4. Finally, as will be discussed in more detail below, this message creation can be done very
quickly. A full scan of the two data structures is only done once every few time steps, when
the neighbor lists are created, to decide which atom positions to send in each message. The
scan procedure creates a list of atoms that make up each message. During all the other
time steps, the lists can be used, in lieu of scanning the full atom list, to directly index the
referenced atoms and buffer up the messages quickly. This is the equivalent of a gather
operation on a vector machine.
8.2
The simpler algorithm
We now outline our SD algorithm S1 in Figure 11. Box z is assigned to processor Pz , where z
runs from 0 to P −1 as before. Processor Pz stores the atom positions of its N/P atoms in xz and
the forces on those atoms in fz . Steps (1a-1c) are the neighbor list construction, performed once
every few time steps. This is somewhat more complex than in the other algorithms because, as
discussed above, it includes the creation of lists of atoms that will be communicated at every time
step. First, in step (1a) the positions, velocities, and any other identifying information of atoms
that are no longer inside box z are deleted from xz (first data structure) and stored in a message
buffer. These atoms are exchanged with the 6 adjacent processors via the communication pattern
of Figure 10. As the information routes through each dimension, processor Pz checks for new
atoms that are now inside its box boundaries, adding them to its xz . Next, in step (1b) all atom
positions within a distance Rs of box z are acquired by the communication scheme described
above. As the different messages are buffered by scanning through the two data structures, lists
of included atoms are made. The lists will be used in step (5). The scaling factor ∆ for steps (1a)
and (1b) will be explained below.
When steps (1a) and (1b) are complete, both of the processor’s data structures are current.
Neighbor lists for its N/P atoms can now be constructed in step (1c). If atoms i and j are both in
box z (an inner-box interaction), the (ij) pair is only stored once in the neighbor list. If i and j are
in different boxes (a two-box interaction), both processors store the interaction in their respective
neighbor lists. If this were not done, processors would compute forces on atoms they do not own
and communication of the forces back to the processors owning the atoms would be required. A
modified algorithm that performs this communication to avoid the duplicated force computation
of two-box interactions is discussed below. When s, the length of box z, is less than two cutoff
distances, it is quicker to find neighbor interactions by checking each atom inside box z against
all the atoms in both of the processor’s data structures. This scales as the square of N/P . If
s > 2Rs , then with the shell of atoms around box z, there are 4 or more bins in each dimension.
In this case, as with the algorithms of the preceding sections, it is quicker to perform the neighbor
list construction by binning. All the atoms in both data structures are mapped to bins of size Rs .
The surrounding bins of each atom in box z are then checked for possible neighbors.
Processor Pz can now compute all the forces on its atoms in step (2) using the neighbor lists.
42
8.2
The simpler algorithm
8
SPATIAL-DECOMPOSITION ALGORITHM
(1a)
Move necessary atoms to new boxes
∆
(1b)
Make lists of all atoms that will need to be exchanged
∆
(1c)
Construct neighbor lists of interacting pairs in box z
N
P
(s < 2Rs ) All pairs
(s ≥ 2Rs ) Binning
(2)
Compute forces on atoms in box z, doubly storing results in fz
(4)
Update atom positions in xz in box z using fz
(5)
Exchange atom positions across box boundaries
N
P
with neighboring processors
(s < Rs ) Send N/P positions to many neighbors
(s & Rs ) Send N/P positions to nearest neighbors
(s Rs ) Send positions near box surface to nearest neighbors
N
+∆
2P
N
+∆
2P
N
+∆
2P
N
P
(1 + 2Rs /d)3
Rs3
N
P
N 2/3
P
Figure 11: Single time step of spatial-decomposition algorithm S1 for processor Pz .
When the interaction is between two atoms inside box z, the resulting force is stored twice in fz ,
once for atom i and once for atom j. For two-box interactions, only the force on the processor’s
own atom is stored. After computing fz , the atom positions are updated in step (4). Finally,
these updated positions must be communicated to the surrounding processors in preparation for
the next time step. This occurs in step (5) in the communication pattern of Figure 11 using the
previously created lists. The amount of data exchanged in this operation is a function of the
relative values of the force cutoff distance and box length, and is discuss in the next subsection.
Also, we note that on the time steps that neighbor lists are constructed, step (5) does not have to
be performed since step (1b) has the same effect.
8.2.1
Communication scaling
The communication operations in algorithm S1 occur in steps (1a), (1b) and (5). The communication in the latter two steps is identical. The cost of these steps scales as the volume of data
exchanged. For step (5), if we assume uniform atom density, this is proportional to the physical
volume of the shell of thickness Rs around box z, namely (s+2Rs )3 −s3 . Note there are roughly
N/P atoms in a volume of s3 since s3 is the size of box z. There are three cases to consider:
1. If s < Rs data from many neighboring boxes must be exchanged and the operation scales
as 8Rs3 .
2. If s & Rs the data in all 26 surrounding boxes is exchanged and the operation scales as
43
8.2
The simpler algorithm
8
SPATIAL-DECOMPOSITION ALGORITHM
27N/P .
3. If s Rs only atom positions near the 6 faces of box z will be exchanged. The communication then scales as the surface area of box z, namely 6Rs (N/P )2/3 .
These cases are explicitly listed in the scaling of step (5). Elsewhere in Figure 11, we use the
term ∆ to represent whichever of the three is applicable for a given N , P , and Rs . We note that
step (1a) involves less communication since not all the atoms within a cutoff distance of a box
face will move out of the box. But this operation still scales as the surface area of box z, so we
list its scaling as ∆.
8.2.2
Computational scaling
The computational portion of algorithm S1 is in steps (1c), (2) and (4). All of these scale as N/P
with additional work in steps (1c) and (2) for the atoms that are neighboring box z and stored in
the second data structure. The number of these atoms is proportional to ∆ so it is included in
the scaling of those steps. The leading term in the scaling of steps (1c) and (2) is listed as N/2P
as in algorithms A2 and F2, since inner-box interactions are only stored and computed once for
each pair of atoms in algorithm S1. Note that as s grows large relative to Rs as it will for very
large simulations, the ∆ contribution to the overall computation time decreases and the overall
scaling of algorithm S1 approaches the optimal N/2P . In essence, each processor spends nearly
all its time working in its own box and only exchanges a relatively small amount of information
with neighboring processors to update its boundary conditions.
8.2.3
Data structures’ update
An important feature of algorithm S1 is that the data structures are only modified once every few
time steps when neighbor lists are constructed. In particular, even if an atom moves outside box
z’s boundaries it is not reassigned to a new processor until step (1a) is executed. Processor Pz
can still compute correct forces for the atom so long as two criteria are met:
• The first is that an atom does not move farther than s between two neighbor list constructions.
• The second is that all nearby atoms within a distance Rs , instead of Rc , must be updated
every time step.
The alternative is to move atoms to their new processors at every time step. This has the advantage that only atoms within a distance Rc of box z need to be exchanged at all time steps when
neighbor lists are not constructed. This reduces the volume of communication since Rc < Rs .
However, now the neighbor list of a reassigned atom must also be sent. The information in the
neighbor list is atom indices referencing local memory locations where the neighbor atoms are
stored. If atoms are continuously moving to new processors, these local indices become meaningless. To overcome this, one can assign a global index (1 to N ) to each atom that moves with
the atom from processor to processor. A mapping of global index to local memory must then be
44
8.3
Using Newton’s 3rd law
8
SPATIAL-DECOMPOSITION ALGORITHM
stored in a vector of size N by each processor or the global indices must be sorted and searched
to find the correct atoms when they are referenced in a neighbor list. The former solution limits
the size of problems that can be run, the latter solution incurs an extra cost for the sort and search
operations.
On the other side, the S1 version of the code exploits a Tamayo and Giles idea [23] which
makes it less complex and reduces the computational and communication overhead. This did not
affect the timings for simulations with large N , but improves the algorithm’s performance for
medium-sized problems.
8.3
Taking advantage of Newton’s 3rd law
A modified version of S1 that takes full advantage of Newton’s 3rd law can also be devised, call it
algorithm S2. If processor Pz acquires atoms only from its west, south, and down directions (and
sends its own atoms only in the east, north, and up directions), then each pairwise interaction
need only be computed once, even when the two atoms reside in different boxes. This requires
sending computed force results back in the opposite directions to the processors who own the
atoms, as a step (3) in the algorithm. This scheme does not reduce communication costs, since
half as much information is communicated twice as often, but does eliminate the duplicated
force computations for two-box interactions. Two points are worth noting. First, the overall
savings of S2 over S1 is small, particularly for large N . Only the ∆ term in steps (1c) and
(2) is saved. Second, the performance of SD algorithms for large systems can be improved by
optimizing the single-processor force computation in step (2). As with vector machines this
requires more attention be paid to data structures and loop orderings in the force and neighborlist construction routines to achieve high single-processor flop rates. Implementing S2 requires
special-case coding for atoms near box edges and corners to insure all interactions are counted
only once [24], which can hinder this optimization process. For all these reasons we will not
implement an S2 code.
8.4
Load-balance
Finally, the issue of load-balance is an important concern in any SD algorithm. Algorithm S1
will be load-balanced only if all boxes have a roughly equal number of atoms (and surrounding
atoms). This will not be the case if the physical atom density is non-uniform. Additionally, if the
physical domain is not a rectangular parallelepiped, it can be difficult to split into P equal-sized
pieces. Sophisticated load-balancing algorithms have been developed to partition an irregular
physical domain or non-uniformly dense clusters of atoms, but they create sub-domains which
are irregular in shape or are connected in an irregular fashion to their neighboring sub-domains.
In either case, the task of assigning atoms to sub-domains and communicating with neighbors
becomes more costly and complex. If the physical atom density changes over time during the
MD simulation, the load-balance problem is compounded. Any dynamic load-balancing scheme
requires additional computational overhead and data movement.
In summary, the SD algorithm, like the AD and FD ones, evenly divides the MD computations across all the processors. Its chief benefit is that it takes full advantage of the local
45
8.4
Load-balance
8
SPATIAL-DECOMPOSITION ALGORITHM
nature of the interatomic forces by performing only local communication. Thus, in the large
N limit, it achieves optimal O(N/P ) scaling and is clearly the fastest algorithm. However,
this is only true if good load-balance is also achievable. Since its performance is sensitive to
the problem geometry, algorithm S1 is more restrictive than A2 and F2 whose performance is
geometry-independent. A second drawback of algorithm S1 is its complexity; it is more difficult
to implement efficiently than the simpler AD and FD algorithms. In particular the communication scheme requires extra coding and bookkeeping to create messages and access data received
from neighboring boxes. In practice, integrating algorithm S1 into an existing serial MD code
can require a substantial reworking of data structures and code.
46
REFERENCES
REFERENCES
References
[1] B. J. Alder, T. E. Wainwright, J. Chem. Phys. 27, p. 1208 (1957).
[2] L. Verlet, Phys. Rev. 159, p. 98 (1967).
[3] L. Hannon, G. C. Lee, E. Clementi, J. Sci. Computing 1(2), p. 145 (1986).
[4] M. Karplus, J. A. McCammon, Scientific American April 1986, p. 42 (1986).
[5] See, for example, J. P. Hansen and I. R. McDonald, Theory of simple liquids, 2nd Ed.,
Academic, 1986. A classic book on liquids, with a chapter devoted to computer Simulation.
[6] See for instance N. W. Ashcroft, N. D. Mermin, Solid state physics, Brooks Cole, 1976.
[7] M. P. Allen and D. J. Tildesley, Computer simulation of liquids Oxford, 1987. This is a
classic how to book containing a large number of technical details on a vast array of techniques.
J. M. Haile, Molecular dynamics simulation, Wiley, 1992. An excellent general introduction to molecular dynamics. Not as thorough as the previous one in explaining techniques,
but very deep in explaining the fundamentals, including ”philosophical” aspects. Probably
the best starting point. A good amount of Fortran 77 code is included.
[8] H. J. C. Berendsen and W. F. van Gunsteren, in Molecular dynamics simulation of
statistical-mechanical systems, G. Ciccotti and W. G. Hoover (Eds.), North-Holland, 1986.
[9] D. Frenkel and B. Smit, Understanding Molecular Simulation, 2nd ed., Academic, San
Diego, 2002.
[10] R. W. Hockney and J. W. Eastwood Computer Simulation Using Particles, Adam Hilger,
New York, 1988.
[11] J. E. Barnes and P. Hut, Nature 324, p. 446 (1986).
[12] L. Greengard and V. Rokhlin, J. Comp. Phys. 73, p. 325 (1987).
[13] R. W. Hockney, S. P. Goel and J. W. Eastwood, J. Comp. Phys. 14, p. 148 (1974).
[14] http://www.charmm.org
[15] http://www.igc.ethz.ch/GROMOS/index
[16] http://ambermd.org
[17] http://www.gromacs.org
[18] http://www.ks.uiuc.edu/Research/namd
[19] E. A. Mastny, J. J. de Pablo, J. Chem. Phys., 127, p. 104504 (2007).
47
REFERENCES
REFERENCES
[20] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. otto, J. K. Salmon, and D. W. Walker,
Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs,
NJ (1988).
[21] H. Schreiber,O. Steinhauser and P. Schuster, Parallel Computing 18, p.557-573 (1992).
[22] J. G. Lewis and R. A. van de Geijn, Distributed memory matrix-vector multiplication and
conjugate gradient algorithms. In Proc. Supercomputing. ’93, pages 484-492. IEEE Computer Society Press (1993).
[23] P. Tamayo and R. Giles, A parallel scalable approach to short-range molecular dynamics
on the CM-5. In Proc. Scalable High Performance Computing Conference-92, p. 240-245.
IEEE Computer Society Press (1992).
[24] D. Brown, J. H. R. Clarke, M. Okuda and T. Yamazaki, A domain decomposition parallelization strategy for molecular dynamics simulations on distributed memory machines.
Comp. Phys. Comm. 73, p. 67-80 (1993).
48