Download Report

TECHNIQUES FOR DEVELOPING CORRECT, FAST, AND ROBUST
IMPLEMENTATIONS OF DISTRIBUTED PROTOCOLS
BY
AAMOD ARVIND SANE
THESIS
Submitted in partial fulllment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 1998
Urbana, Illinois
c Copyright by
Aamod Arvind Sane
1998
TECHNIQUES FOR DEVELOPING CORRECT, FAST, AND ROBUST
IMPLEMENTATIONS OF DISTRIBUTED PROTOCOLS
Aamod Arvind Sane, Ph.D.
Department of Computer Science
University of Illinois at Urbana-Champaign, 1998
Roy H. Campbell, Advisor
A distributed system must satisfy three requirements: it should correctly implement process interactions to realize desired behavior, it should exhibit satisfactory performance,
and it should have a robust software architecture that accommodates changing requirements. This thesis presents research that addresses each of these concerns.
The thesis presents new techniques for designing protocols that coordinate process
interactions. The specication technique allows designers to design protocols by topdown renement. Renement steps divide the original protocol into sub-protocols that
have smaller state spaces than the original protocol. Therefore, the divided protocols
can be automatically veried without encountering state-space explosion. The complete
protocol is synthesized by composing the divided protocols.
The thesis also shows how protocols can be tailored for improved performance. A new
technique for designing high-performance distributed shared memory consistency protocols is presented. The technique optimizes consistency protocols by using information
about previous memory accesses to anticipate future communication. Such anticipation
allows communication to overlap with computation, resulting in improved application
performance.
iii
Finally, the thesis presents a software architecture for implementing systems with
interacting distributed objects. The architecture allows systems to be incrementally extended with new objects and new operations, including operations over objects on remote
systems. This is achieved using design patterns, and a novel scheme for incremental construction of state machines. The architecture was used to build a virtual memory system
that is smoothly extended to support distributed shared memory.
iv
TABLE OF CONTENTS
Chapter
1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
1.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
1.2 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
2 A Protocol Design Technique : : : : : : : : : : : : : : : :
2.1 Goal : : : : : : : : : : : : : : : : : : : : : : : : : : :
2.1.1 The Problem : : : : : : : : : : : : : : : : : :
2.1.2 Our Solution : : : : : : : : : : : : : : : : : :
2.1.3 Summary : : : : : : : : : : : : : : : : : : : :
2.2 Background and Related Work : : : : : : : : : : : : :
2.2.1 Verication Systems : : : : : : : : : : : : : :
2.2.2 High-Level Service Specication : : : : : : : :
2.2.3 Synthesis Methods : : : : : : : : : : : : : : :
2.2.4 Our Approach : : : : : : : : : : : : : : : : : :
2.3 The Synthesis Method : : : : : : : : : : : : : : : : :
2.3.1 Synthesis : : : : : : : : : : : : : : : : : : : :
2.3.2 Process and System : : : : : : : : : : : : : : :
2.3.3 Automata : : : : : : : : : : : : : : : : : : : :
2.3.4 Automata and Processes : : : : : : : : : : : :
2.3.5 Protocols : : : : : : : : : : : : : : : : : : : :
2.3.6 Protocol Synthesis : : : : : : : : : : : : : : :
2.4 Specifying Coordination : : : : : : : : : : : : : : : :
2.4.1 Constraint-Rule Specications : : : : : : : : :
2.4.2 Action-Rule Specications : : : : : : : : : : :
2.4.3 Observation-Rule Specications : : : : : : : :
2.4.4 Proving Implementation : : : : : : : : : : : :
2.5 Implementing Constraints, Actions, and Observations
2.5.1 Synthesizing Constraint Rules : : : : : : : : :
2.5.2 Synthesizing Action Rules : : : : : : : : : : :
2.5.3 Observations via Memory and Messages : : :
2.6 Summary : : : : : : : : : : : : : : : : : : : : : : : :
v
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
1
3
4
5
5
5
6
8
9
9
11
12
13
14
15
16
17
18
19
20
22
23
26
33
35
37
38
42
44
45
3 Distributed Shared Memory Consistency Protocols : : : : :
3.1 Goal : : : : : : : : : : : : : : : : : : : : : : : : : : : :
3.1.1 The Problem : : : : : : : : : : : : : : : : : : :
3.1.2 Our Solution : : : : : : : : : : : : : : : : : : :
3.2 Background and Related Work : : : : : : : : : : : : : :
3.2.1 Sequential Consistency : : : : : : : : : : : : : :
3.2.2 Beyond Sequential Consistency : : : : : : : : :
3.2.3 Synchronization in Distributed Shared Memory
3.2.4 Our Approach : : : : : : : : : : : : : : : : : : :
3.3 Coordinated Memory : : : : : : : : : : : : : : : : : : :
3.3.1 Adaptive Barriers : : : : : : : : : : : : : : : : :
3.3.2 Other Adaptive Constructs : : : : : : : : : : :
3.4 Designing Consistency Protocols : : : : : : : : : : : : :
3.4.1 Consistency Specication : : : : : : : : : : : : :
3.4.2 Adaptive Barrier : : : : : : : : : : : : : : : : :
3.4.3 Summary : : : : : : : : : : : : : : : : : : : : :
3.5 Implementation and Performance : : : : : : : : : : : :
3.5.1 Experimental Platform : : : : : : : : : : : : : :
3.5.2 Applications : : : : : : : : : : : : : : : : : : : :
3.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
47
47
48
50
51
51
54
60
63
64
64
67
68
69
71
72
73
73
74
77
4 A Software Architecture : : : : : : : : : : : : : : : : : : : : : :
4.1 Goal : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
4.1.1 The Problem : : : : : : : : : : : : : : : : : : : : :
4.1.2 Our Solution : : : : : : : : : : : : : : : : : : : : :
4.2 Background and Related Work : : : : : : : : : : : : : : : :
4.2.1 Basic Objects : : : : : : : : : : : : : : : : : : : : :
4.2.2 Interactions : : : : : : : : : : : : : : : : : : : : : :
4.2.3 Operations : : : : : : : : : : : : : : : : : : : : : : :
4.3 Why the New Architecture : : : : : : : : : : : : : : : : : :
4.3.1 Examples : : : : : : : : : : : : : : : : : : : : : : :
4.3.2 Why Change is not Easy : : : : : : : : : : : : : : :
4.4 What Needs to be Redesigned : : : : : : : : : : : : : : : :
4.4.1 Data Structures and Synchronization : : : : : : : :
4.4.2 Interactions : : : : : : : : : : : : : : : : : : : : : :
4.4.3 A Solution : : : : : : : : : : : : : : : : : : : : : : :
4.5 Architecture of the Virtual Memory System : : : : : : : :
4.5.1 Exporting Functionality : : : : : : : : : : : : : : :
4.5.2 Organizing the Internals : : : : : : : : : : : : : : :
4.5.3 Concurrency Control : : : : : : : : : : : : : : : : :
4.5.4 Operations Using Object-Oriented State Machines :
4.5.5 Implementing Remote Interactions : : : : : : : : :
4.5.6 Dynamic Page Distribution : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
78
78
78
80
80
80
81
82
82
82
84
86
86
87
88
90
90
93
96
98
104
108
vi
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
4.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110
5 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112
5.1 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112
5.2 Future Research : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114
Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115
vii
Chapter 1
Introduction
This thesis presents techniques for the design and implementation of protocols that coordinate the actions of concurrent processes in a distributed system. The design of novel
memory consistency protocols for a distributed shared memory system illustrates the
application of these techniques. The system is implemented using a new software architecture for designing object-oriented systems with concurrent and distributed operations.
Protocols are dicult to design because systems of interacting concurrent processes
exhibit a large number of behaviors. Therefore, computer-aided methods are used for protocol design. Currently, such methods can be classied into either verication methods
or synthesis methods. Verication methods let users model the protocols in a suitable
language, and check that model obeys desired properties by exhaustive search of the
system state space. But the detailed, low-level models often result in very large state
spaces. The search is made tractable by exploiting patterns in the state space to reduce the states actually examined. Even so, many practical protocols remain beyond the
reach of exhaustive search. Synthesis methods avoid building complex low-level models.
Instead, they translate high-level specications to low-level implementations. But these
methods often require manual proofs, or are useful only in restricted cases such as peerto-peer communication protocols. Ideally, we would like a design method that combines
the clarity of high-level specications of synthesis methods with the automated checking
characteristic of verication methods.
1
In this thesis, we develop such a design method. We introduce an approach for
dividing the task of protocol design into several steps. The division produces protocols
that have small state spaces either because they are abstract or because they implement
parts of the original protocol. Therefore, their correctness can be easily established
using verication tools. We then show how to implement the divided protocols so that
the complete protocol can be synthesized by combining the divided protocols. We have
applied the synthesis method to guide the implementation of new distributed shared
memory consistency protocols. A distributed shared memory (DSM) system simulates
shared memory over networked computers. DSM systems allow programs designed for
shared memory multiprocessors to be used over networked computers.
DSM systems use local memories of the networked computers as caches for the simulated shared memory. Just like shared memory multiprocessors, caches in DSM systems
replicate the shared data for eciency, but then require protocols to ensure that the
replicas remain consistent.
In this thesis, we develop consistency protocols that allow DSM systems to operate
eciently over wide-area networks characterized by high-latency high-bandwidth interconnections. A protocol that performs well over a wide-area network must be able to
utilize the bandwidth to overcome latency. Our protocols gain their eciency using information about process synchronization and past memory access patterns to predict future
requests from other processes. This technique reduces the time processes spend waiting
for data to arrive. When computations are regular, this anticipatory communication
overlaps communication and computation, giving good speedups for distributed shared
memory programs over wide-area networks.
Protocol implementations derived by our method are state machines that dene protocol behavior. However, the programmer is still left to manage a myriad details of the
implementation environment. In our case, the protocol implementation has to be a part
of a virtual memory system that supports distributed shared memory.
2
In this thesis, we present a software architecture for building object-oriented systems
that have many concurrent operations on groups of objects. The architecture allows the
system to be incrementally extended with new objects and new operations. It smoothly
implements interactions between objects on remote systems. In the course of designing the architecture, we have discovered several design patterns, and a new technique
for constructing state machines incrementally using an object-oriented approach. The
architecture is used to build a virtual memory system. The resulting system is exible: beginning with simple virtual memory facilities, we extended it with facilities like
distributed shared memory in an orderly manner.
1.1 Contributions
This thesis makes the following contributions:
A method for designing process coordination protocols based on
{ A family of notations to express protocols at dierent levels of abstractions.
{ A set of transformations to rene protocols from one level to the next.
{ Application of the method to design memory consistency protocols.
Distributed shared memory consistency protocols that
{ Improve over the performance of existing protocols
{ Perform well over either wide-area and local-area networks.
A software architecture for object systems with concurrent operations on groups of
objects. The architecture is based on:
{ Object-oriented state machines that facilitate construction of state machines
by inheritance, composition and other object-oriented techniques.
{ Design patterns that simplify concurrency control, remote interactions, and
resource management.
3
1.2 Thesis Outline
In Chapter 2 we present our method for synthesizing distributed shared memory protocols. We begin with a review of background and related work and identify our contribution. Next, we chapter present the basic theory and discusses the notations we use at
dierent levels of abstraction. After that we present a set of transformations for synthesizing the protocol implementation from a specication. We then show how interpret the
implementation as a shared memory or message passing program.
In Chapter 3, we develop our consistency protocols. We present the evolution of consistency protocols, and highlight our approach. Then we explain and formally specify our
protocols and comment on the implementation. We conclude this part with performance
results.
In Chapter 4, we present our new software architecture. We use the design of a
virtual memory as the primary example. First we explain the usual architecture of virtual
memory systems, motivating the basic objects and operations. Then we critique it by
considering the impact of changes, and motivate the new architecture. The architecture
is discussed subsequently.
In Chapter 5, we review the contributions and identify problems for future research.
4
Chapter 2
A Protocol Design Technique
In this chapter, we present a new technique for designing nite state process coordination
protocols. We begin by presenting the problem and our solution in brief. Then we
examine the background research in detail, and contrast our solution with it. The rest
of the chapter presents the formal details.
2.1 Goal
2.1.1 The Problem
Protocols that describe the behavior of systems with concurrent interacting components
are dicult to design, because such systems exhibit a large variety of behaviors. A human designer may overlook undesirable interactions in the system, leading to errors such
as deadlock. So an automated method for synthesizing such systems is highly desirable. There are two types of approaches for computer-aided protocol design, verication
methods and synthesis methods. Verication methods help debug previously designed
protocols by exhaustive search of state spaces, while synthesis methods start with protocol specications and translate them to low-level protocol implementations.
We used these methods in our research for designing distributed shared memory
consistency protocols. These protocols describe systems that have a very large number
5
of states. Therefore, we could only verify simplied versions of the protocols. We also
attempted to use synthesis methods. But methods that had tool support are designed
for synthesizing peer-to-peer communication protocols, or OSI protocol stacks. Synthesis
by hand, based on specications with algebraic or logical languages that could describe
multi-party protocols, requires manual proofs of the specication. Such proofs were
practical only for simple versions of the protocol.
Thus, while verication methods are applicable to a wide class of protocols, they
are limited by the need to describe protocols in detail, as well as the limitations of
exhaustive search. On the other hand, synthesis methods provide abstract protocol
description languages, but the methods require manual correctness proofs. Also, the
abstract descriptions can be more dicult to produce than low-level descriptions based
on communicating automata.
This experience suggested the need for a design method that could combine the
desirable attributes of both verication and synthesis methods.
2.1.2 Our Solution
We introduce an approach for dividing the task of protocol design into several steps.
At each step, the protocol is simple enough that exhaustive search is tractable, so that
verication tools can be used to establish correctness. The simplication is achieved
using notations that support abstraction and decomposition.
We introduce a family of notations that are all communicating automata, except
that the communication is expressed at dierent degrees of abstraction. We chose notations based on communicating automata because communicating automata are a familiar
model used in popular verication tools. In the rst step, a designer uses specications
that express process coordination as abstract predicates that suppress details of communication and control. Whole system verication is done at this step. In the second step,
the designer produces implementations of the predicates in a notation that expresses control but not the details of communication. Here we verify each predicate implementation
separately. The design method requires the implementations to have certain properties
6
that permit composition so that the composite system does not have to be veried again.
In the third step, the predicate implementations are translated to protocols that express
details of communication media. Again, these translations obey conditions that allow
safe composition. Composing the translations terminates the protocol synthesis. We
expand on these ideas in the following.
Step 1 In our method, the initial design is specied by automata that describe only the
desired coordination between automata, without saying how it is implemented. Thus, the
initial models are extensional, abstract, and relatively simple. The notation we use formalizes a common way to describe process coordination. For example, mutual exclusion
between two processes is often described as follows: \when one process is in its critical
region, the other should not be in its critical region." Here, we divide the execution of
a process into regions, and express coordination as a predicate on the regions of interacting processes. Our rst notation describes processes as automata, and coordination
as predicates on their states. The notation suppresses details of how processes control
each other to implement the predicates, as well as details of communication, and leads
to models with small state spaces. We use verication tools to verify deadlock freedom,
liveness, and other system properties.
Step 2 The next step in a design is to show how to implement the predicates. Most
distributed systems support communication mechanisms like message passing that allow
one process to control another process unidirectionally. Bidirectional control is achieved
by some form of request-response interaction. Our next model describes how a process
controls another by unidirectional actions. At this level, we do not model elements like
message queues. For example, consider a predicate over two process that says: \when
one process enters region x, the other should enter y". This might be used to describe
the process opening a TCP connection. One implementation might be: \First, the target
waits for the connector request. Then the connector sends its request and waits for the
reply". We formalize such a description with a notation where transitions in one process
7
may enable or disable transitions in another process. We use verication tools to ensure
that such an implementations correctly implements a predicate.
The original, abstract protocol may have several predicates. We establish conditions
to ensure that implementations of the predicates can be composed without loss of correctness. Thus, the abstract protocol can be translated to the lower-level protocol by
translating each predicate separately.
Step 3 In the nal step, we use a notation that models communication media. The
idea is that one process may \observe" the current state of another process and use the
information to choose its transitions. An observation is easily implemented in a message
passing system: a process can request the current state of another process, and wait for
the response. Similarly, in a shared (or distributed) memory system, one process may
observe the state of another process by reading shared (distributed) variables.
Observation protocols are used to implement process control: a transition of process P
that requires process Q to be in a certain state is disabled when Q is in a dierent state.
We use verication tools to ensure that an observation protocol correctly implements
process control, and hence the second-level protocols.
Again, we establish conditions to ensure that implementations can be composed safely.
These conditions carry over from the second-level protocols. Thus, the abstract protocol can be translated to observation-based protocols, and hence to shared memory and
message passing programs.
2.1.3 Summary
Our approach has several advantages. The notations based on communicating automata
are familiar, and allow us to use popular protocol verication tools based on communicating automata. The design approach allows us to simplify protocols, rst by abstraction,
and then by decomposition, so that the state space presented to a verier is smaller. The
abstract predicates on processes can be implemented in various ways, and dierent im-
8
plementations can coexist in a synthesized protocol. Also, we can use known algorithms
to implement the predicates, as long as we ensure that the composition conditions hold.
In this thesis, we develop the formal basis for the approach. We have used it to design
the consistency protocols for distributed shared memory, described in Chapter 3.
2.2 Background and Related Work
We describe some of the previous research on protocol design. We then relate our approach to this work.
2.2.1 Verication Systems
Verication systems (also called model checkers ) such as SPIN [Hol91], SMV [McM92],
and Mur' [Dil96] are designed to verify the correctness of nite state distributed protocols. Each verication system provides a language for a precise and understandable
mathematical model of the system. For instance, SPIN uses communicating nite state
machines [BZ83]. Another formal language allows the user to specify correctness predicates; the systems mentioned above use variants of temporal logic [MP91]. Temporal
logic allows the user to express notions such as \if process P takes action a, Q will
eventually respond with action b". The systems include algorithms that examine the
complete state space of a system and verify that the state graph satises the correctness
criteria [Hol91].
These systems have two drawbacks. First, the modeling languages are at a fairly low
level (messages in SPIN, shared memory in SMV), so that constructing the models is
tedious and error prone. Second, exhaustive exploration of the system state space can
be intractable: this is the state explosion problem. Recent research has concentrated
on developing techniques for checking correctness without exhaustive analysis, as well as
methods for managing large state spaces in limited memory.
Partial-order methods [WG93] attempt to eliminate states that arise from modeling
concurrency as interleaving. If the state transitions of two processes are independent,
9
then in an execution of the system, the transitions may be permuted without aecting
correctness of the execution. Therefore, instead of examining all permutations, we can
examine the state space for an arbitrary permutation of independent transitions and still
check correctness. Moreover, dependencies among the state transitions of a nite-state
system can be approximated by examining the source code. Using such dependency
information, partial order methods guide the search over a limited part of the system
state space.
Symbolic model checking [McM92] uses binary decision diagrams (BDDs) to represent
the state space. The symbolic representation allows compact storage of a large state
space. Algorithms that search the state space to verify correctness can be changed to
operate directly on the BDD representation. BDDs work best for digital circuitry that
has many replicated components. Traditional state exploration may outperform BDDs
for distributed protocols [Hu95].
Fair reachability [GH85, GY84, LM95] methods force state transitions of processes
in a distributed system and explore the resulting state space. This space is smaller than
the state space generated when some processes do not take steps. The smaller space is
sucient to check some properties like deadlocks.
Abstraction [Lon93, PD97] Abstract Interpretation [Lon93], and Composition [Lon93]
based approaches are developed to present model checkers with simpler systems to verify.
These approaches use user-dened equivalence relations [Pon95] induction over replicated
components [McM92], symmetry [Ip96], language containment [Kur94] and similar approaches to eliminate irrelevant states in a system. In methods based on abstraction, it
is enough to check the abstract model to ensure that a property that holds in the abstracted system is really true of the actual system. Methods based on composition and
simulation use theorems that show how to decompose system properties of interest when
verifying components or simulations. All these methods require human intervention.
Methods for managing a large number of states in limited memory include techniques
such as supertrace [Hol91] and hash compaction [WL93]. These methods use hash tables
to remember whether a state has been reached in the exhaustive search. The hash table
10
only stores an approximate description of a state, so that there is a small probability
that one state is mistaken for another. Thus, the exhaustive search omits some states,
and some system errors will not be detected. On the other hand, many more states can
be stored in the same amount of memory, so that approximate search is applicable to
larger systems. State space caching methods [GHP92] use memory as a cache, trading
verication time for memory.
The verication systems have a signicant drawback: the use of low-level models
rst introduces irrelevant system states, and then techniques like partial-order methods
attempt to extract abstract system description.
2.2.2 High-Level Service Specication
Other research such as path expressions [Cam74] and logic of knowledge [HM90] has
concentrated on high-level notations for describing protocols.
Designers often informally describe relationships between the processes in distributed
systems in terms of what one process \knows" about another process. For instance, in
the description of TCP [Pos81] we nd: \An established connection is said to be halfopen if one of the TCPs has closed or aborted the connection at its end without the
knowledge of the other, . . . ". The logic of knowledge formalizes this notion of knowledge
so that programs may include knowledge statements directly without referring to the
method for gaining and losing knowledge. Such knowledge protocols are abstract and
easy to specify [HZ87]. Some results [CM86] hint at ways to implement gain and loss of
knowledge. But so far reducing knowledge specications to actual programs has proven
dicult [FHMV95].
Path expressions [Cam74] are a well-known and easy to use notation for specifying process coordination. Path expressions are regular expressions that describe the
sequences of process activities in a distributed system. Campbell [Cam76] investigates
several variants of path expressions. For some restrictive types of path expressions, it
can be proven a priori that problems like deadlock do not exist, and there are known
11
algorithms to translate such path expressions to low-level P and V operations. But more
expressive notations may be dicult to understand and implement [Hol91].
While verication systems use temporal logic to specify correctness predicates, there
have been attempts to use it to specify systems. Temporal logic has been used as a
programming language [Gab87]. However, descriptions based purely on temporal logic
have proven dicult to understand in practice [Lam94].
2.2.3 Synthesis Methods
Synthesis methods translate a high-level specication to a low level language like communicating nite state machines or CSP.
Tableau based methods [MW84] translate specications in temporal logic to languages
like CSP or Buchi automata. The synthesis method produces a model for the formula as
an existence proof. Tableaus were developed as proof techniques for mathematical logic.
A tableau is a systematic way of decomposing a logical formula into subformulae until we
reach elementary formulae. The truth of elementary formulae can be easily veried, and
the tableau structure ensures that we verify enough elementary formulae to guarantee
the truth of the original formula. When applied to temporal logic, the tableau can be
interpreted as an automaton [MW84]. The automaton is then regarded as a centralized
synchronizer for all processes that interleaves their actions so that the temporal formulae
hold on the resulting sequence of actions. But such a centralized solution is undesirable in
practice. Also, as noted above, descriptions based purely on temporal logic have proven
unwieldy.
Finite State methods are used in synthesizing communication protocols. They begin
with a description of all desirable interactions in the system to be designed, and decompose them into communicating nite state machines. But these methods are often limited
in various ways and appear to be too inexible for use in practice [PS91]. The approach
of specifying desirable interactions seems applicable only to small systems [PS91] and
decomposition is a dicult problem [PS91].
12
In a related method [BZ83], the user starts with a dummy initial state for each process
in the system to be synthesized. The user then species message transmissions for each
process, and the synthesis software deduces the corresponding message receptions. The
software traces all possible states where a reception may occur and updates the receiver
state machine. After each update, the system warns the user if there are states without
any messages in transit and none to be transmitted. Such states correspond to deadlock
situations. Conformance to the service specication is not guaranteed by the method,
although verication methods can be use after the synthesis is complete.
Translation methods [KHvB92] translate specications in notations like LOTOS [BvdLV95]
to message exchanges. The specications dene an ordering of operations, and the translation methods produce state machines that generate the sequences. These are suitable
where service specication can be done as sequences of operations. But specications are
often done in other styles [VSvSB91].
2.2.4 Our Approach
Our work was inspired by research on the logic of knowledge. This research showed
that notions like \a process knows" were enough to express many interesting protocols
succinctly. The treatment by Chandy and Mishra [CM86] reduced the logical operators
to an algebraic form. Path expressions [Cam74] and LOTOS [BvdLV95] were earlier
examples of the use of conjunction and disjunction predicates.
We combined these ideas with the observation that the operators could be regarded as
an abstract form of communication between communicating automata. This combination
leads to succinct specications that can be checked by verication tools developed for
communicating automata.
The next question was how to describe the implementations of constraints without
modeling the peculiarities of communication media. This would allow us to model control
ow without the extraneous system states introduced by communication media. The
model we use here is similar to LOTOS and Path expression operators that permit
13
specifying orders of execution. The novelty is in showing that the implementations can
be composed in way that they do not interfere with one another.
The nal model is similar to the usual communicating nite state machines with
single element queues. The dierence is that we communicate the current state rather
than unstructured values. This makes it easy to translate the protocols to either shared
memory or message passing with optimizations.
Our approach gives a design technique that allows designers to simplify protocols
by decomposition and abstraction. Since our development, we have found that the LOTOSPHERE [BvdLV95] project has informally described the idea of design styles that
mirror our own. They observe that experience shows that early specications are best
described in Constraint-oriented style, while later designs in a State-oriented style.
Our design method can be seen as a formalization of this observation. This unexpected
similarity between our development and LOTOS research has strengthened our belief in
the utility of the method.
2.3 The Synthesis Method
In this section, we present the formal details of the synthesis method. We introduce
our models of process, distributed system, and show how we use automata to denote
processes. Then we discuss our three notations for describing protocols. The rst notation represents communication using abstract operations. The next two notations rene
these operators so that they can be implemented using shared memory and message passing programs. For each notation, we show how to prove that one protocol implements
another. Then we describe some implementations for the abstract communication operators, and show how the implementations can be expressed using shared memory and
message passing.
14
2.3.1 Synthesis
Let Beh be a set of desired behaviors, such as a set of sequences of events. Let Ls be a
specication language and Li an implementation language that specify desired subsets of
Beh . Let the meaning of a specication be given by the function [ :]]s : Ls ! 2Beh , while
the meaning of an implementation by [ :]]i : Li ! 2Beh . Then we dene the problem of
synthesis as follows.
Denition 1 A synthesis method is a total function S : Ls ! Li such that given a
specication , S () denotes the same behaviors as , [[S ()]]i = [[] s.
A classic example of synthesis is the construction of a nite state machine that recognizes
a set of ASCII words !, given a regular expression that species !. Here, Ls is the
language of regular expressions, Li the description of automata, and Beh is the set of all
ASCII words. The synthesis method inductively translates the regular expression into a
nite state machine.
A protocol is a set of rules that describes how processes in a distributed system interact.
For example, a le transfer protocol is a set of rules followed by processes on two machines
in order to transfer les from one machine to another. A protocol synthesis method takes
a description of the externally visible behavior b for a set of processes, and produces
programs that the processes must execute in order to implement behavior b.
We dene processes and distributed systems as sequences of abstract events. The
events represent activities such as memory accesses or message transmission. Protocols are specied using automata. The behavior of each process is specied with an
automaton, and the the joint behavior of a distributed system by the product of these
automata. The state transitions in an automaton that represents a process may depend
on the transitions of automata that represent other processes. The rules that govern this
dependence constitute a model of process communication. Our synthesis method begins
with a high-level specication with an abstract form of communication rules. Through
a series of intermediate steps, the high-level specication is translated to programs that
15
use shared memory or message passing for communication. In the following, we make
these ideas more precise.
2.3.2 Process and System
Denition 2 A process P is a pair (Ep; Rp) where Ep is a nite set of events and Rp is
a set of runs, a set innite sequences over Ep .
The events of two processes are disjoint: for every pair of processes P and Q, Ep\Eq = ;.
Denition 3 A distributed system P is a pair (E ; R) where E is SP 2P EP , and R a set
of system runs, a set of sequences over E such that for every run , 2 R, and every
process P , the projection P is a run of P , P 2 RP .
The runs of processes are specied using automata, and the runs of a distributed system
as the product of the automata of the constituent processes.
16
2.3.3 Automata
Denition 4 A nite state automaton is a tuple of the form (; ; ; ; ), where is
a non-empty nite alphabet, a nonempty nite set of states, a transition relation,
S S , a nonempty set of starting states and a nonempty set
of nal states. Also, all states of are reachable, i.e., for all s 2 , there is a sequence
s0; : : : ; sn where sn = s; s0 2 , and for 0 i < n, (si ; a; si+1) 2 for some a 2 .
Let , be the set of transitions f(s; s0)g such that for some letter a 2 , (s; a; s0) 2 . A
trace t of automaton A on a word w = a0 : : : an,1 in is a sequence of states s0; : : : ; sn
where s0 2 and for every si ; si+1 in t, (si ; ai ; si+1) 2 . Note that a trace can also be
thought of as a sequence of transitions, (s0 ; s1); (s1; s2); : : :. States and transitions that
constitute a trace are said to occur in that trace. The automaton A accepts a word w if
the last state sn is a nal state, sn 2 . The set of all words accepted by an automaton
is the language L of the automaton.
We are interested in automata that accept innite words. Let w = a0 a1 : : : be an
innite word and t be a trace s0; s1; : : : over w. Let the limit of a trace t be the set of
states that appear innitely often in t, lim(t) = fs j s = si innitely ofteng. Then A
accepts w if there is a state s 2 that appears innitely often in t, lim(t)\
6= ;. This
condition, Buchi acceptance, denes Buchi automata. The language of the automaton is
is set of innite words L! that is accepted by the automaton.
Since we specify distributed systems as products of automata, we use a slightly different presentation of the acceptance condition called generalized acceptance [GW94].
Let F = fF1; : : : ; Fk g; Fi 2
; k 0 be a set of sets of accepting states. Then the
automaton A accepts w if for every Fi , lim(t)\Fi 6= ;. If there is only one Fi, then the
condition is the same as Buchi acceptance. Here, A is intended to be the product of
of automata Ai; : : : ; Ak . The condition ensures that accepted sequences are those where
every automaton goes through its accept state innitely often. Thus we enforce fairness
by requiring that every automaton must make progress.
17
Denition 5 A generalized Buchi automaton is a nite state automaton that accepts
innite words under the generalized acceptance condition.
Henceforth, we will assume that every automaton is equipped with a generalized acceptance condition.
2.3.4 Automata and Processes
We use automata to denote processes. Intuitively, we want either automata states or
automata transitions to represent nite sequences of process events. For example, when
describing mutual exclusion, we refer to sequences of critical and non-critical states.
but when describing serial communication, we might use send and receive transitions.
Technically, we achieve this by dening the alphabet to be a set isomorphic to either
the set of states or the set of transitions ,. In the rst case, accepted words dene
acceptable sequences of states. In the second case, accepted words dene acceptable
sequence of transitions.
Correspondence between runs and words is established though a semantic mapping
from letters to process events. Each letter corresponds to a set of nite sequences of
process events. This mapping is inductively extended to map words (sequences over the
alphabet) to runs (sequences over events).
Let A be an automaton and P a process. Let [[]] be a function, [ ]] : A ! 2Ep , such
that for every pair of two distinct letters a; b 2 A , [ a] \[[b]] = ;. The semantics is dened
as follows:
Denition 6 Automaton A = (; ; ; ; ) is said to denote process p = (E; R) if
there exists a (nondeterministic) function [ ]] : A ! 2Ep extended to the words accepted
by A as follows:
[ ] = [ aw]] = [[a]][[w]] for letter a and word w.
18
If is isomorphic to , events of P are represented by the states of A. If is isomorphic
to ,, events of P are represented by transitions of A.
2.3.5 Protocols
Protocols describe the behavior of distributed systems. A protocol is dened by automata
products with restrictions. In a protocol, individual automata denote process behavior,
the product denotes system behavior, and the restrictions on the product model communication. We rst dene the notion of automata products without restriction.
Denition 7 Given A = (A; A; A; A; A ) with acceptance condition FA, and B =
(B ; B ; B ; B ; B ) with acceptance condition FB , and disjoint alphabets and states,
the free product A B = (; ; ; ; ) is dened as follows:
= (A B )[A[B
= A B
= A B
= A B
Given s = (sA; sB ); t = (tA ; tB ); s; t 2 , and a = (aA; aB ); a 2 , (s; a; t) 2 if
(sA ; aA ; tA ) 2 A and (sB ; aB ; tB ) 2 B .
Given s = (sA; sB ); t = (tA ; sB ); s; t 2 , and a = aA; a 2 , (s; a; t) 2 if
(sA ; aA ; tA ) 2 A .
Symmetrically for B.
F = SFi2FA fFi B g[SFi2FB fA Fig
With this denition, A or B may have individual or joint transitions in A B . A state sA
is said to occur in a trace of the product A B if it occurs in some product state (sA; sB ).
Similarly, a transition (sA ; tA ) occurs in a product trace if it occurs either individually
19
or jointly in a product transition. The acceptance condition for innite words requires
that a word be accepted by A B if both A and B pass innitely often through their
accepting states.
In the free product, A and B execute their state transitions independently. But if A
and B communicate, the product cannot have all possible transitions. This observation
motivates the following denition of a protocol.
Denition 8 A protocol p with automata A and B is the free product A B and a
set of restrictions (a subset of all transitions) C that describes an automaton such
that p = , and all the other sets are dened by the states reachable from via the
transitions in , C .
Note that although a protocol is an automaton, we prefer to specify it as a free product
with restrictions. This allows us to reason separately about the structure of component
processes (represented by the free product) and communication (represented by the restriction). Dierent ways of specifying the automata and C are used at dierent stages
of the synthesis method.
2.3.6 Protocol Synthesis
Having dened protocols, we can now dene protocol synthesis. First we dene the
notion of protocols that serve as specication and implementation. This is based on the
standard notion of language substitution [HU79]. A substitution is a mapping from an
alphabet to subsets of 0 . The mapping is used to transform words in a language
L() over to words in a language L(0 ) over 0 . We use substitution to transform a
specication into an implementation.
Let l 2 2 be a a set of nite words over some alphabet . Let (l) be the
letters of used in the words of l. The languages l; m over are distinct if they use
dierent letters, (l)\(m) = ;.
Denition 9 A protocol q implements a protocol p (conversely, p species q)
20
Every automaton of p corresponds to exactly one automaton of q.
There is a nondeterministic renement function : p ! 2q such that for every
pair of distinct letters a; b 2 p , the languages (a) and (b) are distinct, and the
words accepted by q are just the rened words accepted by p (extending to words
by induction).
Let P be an automaton of p and Q be the automaton of q that corresponds to P .
Then there is a nondeterministic function a : P ! 2Q that maps letters of P
to distinct languages, and the words accepted by Q are just the rened accepted
words of P .
This denition captures the ideas that a protocol implementation
is composed from automata with terminating executions,
adds detail to the specication, and
every step of the specication is rened in a distinct way.
Note also that the relation between specication and implementation is dened in terms
of alphabets. Since words can describe either sequences of state or transitions, an implementation may rene either states or transitions.
Finally, notice that the relationship of implementation to specication is dened in
terms of relationships between the components, the automata and their states. Thus, we
reason can about the (innite) words via straightforward induction. We may now dene
protocol synthesis simply as follows:
Denition 10 A protocol synthesis method is a function that given a specication pro-
tocol produces its implementation protocol.
By comparing Denition 6 and Denition 9, it is clear that we can always choose denotations such that implementations that conform to Denition 9 preserve behavior. Indeed,
in practice we rene a protocol several times until the events of the process of interest
21
have a one-to-one relationship to transitions of the protocol; the nal step is a ordinary
shared memory or message passing program.
2.4 Specifying Coordination
Automata in a protocol aect each others' state transitions. The rules that describe the
eects model interprocess communication. We use a variety of rules to specify protocols.
The most extensional, abstract protocol specications are given by Constraint rules.
Constraint-rule specications are implemented by Action -rule specications. Action
rules include more details of communication. In turn, Action-rule specications are
implemented by Observation -rule specications. Observation-rule specications are intensional; they are suciently detailed so that they can be easily translated to shared
memory or message passing programs.
System properties like absence of deadlock and reachability are veried once and for
all at the most abstract level for Constraint-rule specications. The subsequent syntheses
preserve these properties.
A property is dened as a set of words over the alphabet of interest [Alp86]. A
word w has a property if w 2 . A language Li preserves properties of language Ls
if for every property s of Ls, there is a unique property i of Li , and if a word ws has
property s , then the corresponding word wi does not have any property disjoint from
i .
Lemma 1 An implementation preserves properties of its specication.
Proof. From Denition 9, by induction, every word accepted by an implementation
protocol corresponds to exactly one word accepted by the specication protocol. Therefore, a property s of the a specication maps to a a unique property i . Furthermore,
if i is a property disjoint from i , and ws 2 s is a word accepted by the specication
with a corresponding word wi of the implementation, then wi 62 i (and wi 2 i ). Thus,
the implementation preserves properties.
22
Action-rule specications are derived from Constraint-rule specications by translating
constraint rules to action rules, and Observation-rule specications are derived from
Action-rule specications by translating action rules to observation rules. We show that
each translation leads to automata that are implementations of the corresponding specication.
2.4.1 Constraint-Rule Specications
Constraint-rule specications express essential coordination among processes. A specication describes desirable sequences of process behavior as succinctly as possible. Constraintrule specications are extensional: they describe the eects of coordination, but not the
details of how processes implement coordination or properties of communication media.
Constraint-rule specications are designed so that techniques for specifying and verifying
protocol properties are easily applicable. For the purposes of this thesis, two types of
constraints suce. One constraint requires that processes synchronize their behavior,
and the other species that behaviors be disjoint.
In the following, we dene Constraint-rule specications, and show a simple example,
the dining philosophers. We discuss the advantages and disadvantages of this style of
specication. Then we explain how protocol properties can be checked at this level using
verication algorithms.
2.4.1.1 Denitions
Let P = (Ep ; Rp) be a process, and Ep be a partition of the set of events Ep into regions.
This denition formalizes intuitive notions like the \critical region" used to describe parts
of the run of a process. Dene a bijection [[]] : ! Ep. Let A = (; ; ; ; ), where
= and a word corresponds to the sequence of states of the trace on that word. A is
said to be a region automaton if it denotes P using [[]] according to Denition 6. Every
state of a region automaton denotes a distinct region of that process. Constraint-rule
specications are protocols that use region automata.
23
Let A1; : : : ; An be automata of some protocol p. A Constraint is dened to be a a
tuple (si; : : : ; sl ) 2 Ai : : : Al for some set of automata fAi ; Aj ; : : : ; Al g of p. An
automaton Ai appears in a constraint if some state si appears in the constraint. A
constraint c is conjunctive (denoted ( ^ : si ; : : : ; sl )) if for every state (h1 ; : : : ; hn ) of p,
if hj = sj for some automaton Aj , then every hk = sk for all automata that appear
in c. A constraint c is disjunctive (exclusive-or) (denoted ( _ : si ; : : : ; sl )) if for every
state (h1 ; : : : ; hn ) of p, if hj = sj for some automaton Aj , then for all other hk ; hk 6= sk .
Conjunctive constraints force all states of a constraint to appear simultaneously in a
global state, while disjunctive constraints allow exactly one state from a constraint to
appear in any global state.
Denition 11 A Constraint-rule specication is a set of automata A1; : : : ; An with a
set of constraints C that denes a protocol composed of the automata and the smallest
restriction that satises C . Each constrained state in an automaton must be preceded
and followed by internal states that do not appear in any constraint.
The internal states act as placeholders that simplify the renement from Constraintrule specications to Action-rule specications. Protocol 1 shows how to specify a three
process dining-philosophers protocol. We denote automata by giving the states and
transitions. The internal states are not shown in the protocols for simplicity. For example,
before f , there is the state n; after f there is an unseen internal state.
Protocol 1
Automata:
Rules:
P1 : t1 !f1!g1!e1!t1
P2 : t2 !g2 !h2 !e2 !t2
P3 : t3 !h3 !f3 !e3!t3
( _ : f1 ; f3)
( _ : g1 ; g2)
( _ : h2 ; h3 )
24
In the protocol, Pi are the philosophers, ti is the region where a philosopher is thinking,
ei where a philosopher is eating, and fi ; gi ; hi are regions where philosophers pick up one
fork on either side. The constraints require that at any time, only one philosopher may
pick up a fork.
2.4.1.2 Advantages and Disadvantages
Constraint-rule specications are abstract. We specify only the desired eects of communication, abstract from details such as memory variables and queues. In addition, we can
use the generalized acceptance condition of Section 2.3.3 to require that the implementation of disjunctive constraints will be fair. As a result, Constraint-rule specications
result in small models; in Protocol 1, the total state space is at most 43 states. Therefore,
it is easy to see that the protocol is incorrect : the three philosophers may each pick up
forks f , g, and h, and block permanently waiting for the other fork. But this model is
not easily implemented in practice.
By comparison, in systems such as SPIN [Hol91], the problem would be modeled
by processes that communicated over message queues. Both forks and philosophers are
modeled by processes. Suppose that the fork processes had three states, fork-here, forkleft, fork-right. In addition, they remember the last philosopher who had the fork to
implement fairness. This means each philosopher has six states, and the system has
43 63, more than 12000 states for such a tiny system. The advantage of the detailed
model is that if forks indeed represent resources that are accessed by messages, deriving
an implementation is easy.
In our method, we resolve the dilemma between abstraction and implementation by
devising ways to generate implementations from the abstract specications. Therefore,
it becomes feasible to use abstract specications.
2.4.1.3 Verifying Correctness
Two main approaches are used to verify the properties of protocols.
25
Reachability analysis [Wes78] searches the global protocol state space to nd states
or sequences of states that violate correctness properties. For example, a deadlock state
is a system state where no process can take a step.
Model checking [CES86] species the desired properties of a protocol using a (undesirable) property automaton. The transitions of the property automaton are described
by predicates that express properties of the state space of the original system. Thus,
the property automaton describes \bad" runs of the system. The protocol violates the
property if there is any word that is accepted by both the protocol automaton and the
property automaton. Thus, we have to detect cycles in the product of the property
automaton and the protocol automaton.
Both approaches are applicable to Constraint-rule specications specications. For
example, an exhaustive search of the state space of Protocol 1 quickly reveals states
where no process can take a step.
The main problem with these techniques is that exhaustive search can be intractable.
By virtue of their level of abstraction, Constraint-rule specications have small state
spaces for many protocols of interest, so verication is not dicult.
Constraint-rule specications can also exploit the research in minimizing state exploration [WG93]. There has been a great deal of interest in methods for detecting and
exploiting regular patterns in the state space. These methods analyze the models to
determine regularities. When models use variables with assignments, dependency detection can be tricky. In contrast, dependencies among processes are explicitly given by
Constraint-rule specications.
2.4.2 Action-Rule Specications
Action-rule specications implement Constraint-rule specications. They are more intensional, in that they show how processes coordinate their actions to implement constraints.
Rules for coordinating actions are binary, that is, any action in one process may control
at most one action of one other process. This reects usual peer-to-peer communication
26
available in practical systems. For our purposes, we need rules where an action can either
disable or enable other actions, or one out two possible actions may be chosen.
In the following, we dene Action-rule specications and present a simple example.
Then we explain how to translate a Constraint-rule specication into an Action-rule
specication. Constraints in a Constraint-rule specications can be implemented in many
ways in an Action-rule implementation. So an abstract Constraint-rule specication can
specify several Action-rule implementations.
An action is some behavior in one process that aects the behavior of another process.
For example, a process P may set the value of a shared variable read by process Q,
aecting its behavior. Formally, an action is just a transition. We use the word action
to distinguish the transitions of an Action-rule specications from the transitions of
Constraint-rule specications or Observation-rule specications.
Denitions Let P = (Ep; Rp) be a process, and Ep be a partition of the set of events Ep
into regions. Dene a bijection [ ] : ! Ep. Let A = (; ; ; ; ), where = , and a
word corresponds to the sequence of transitions of the trace on that word. A is said to be
a branch automaton if it denotes P using [ ] according to Denition 6. Every transition of
a branch automaton denotes a distinct region of that process. Action-rule specications
are protocols that use branch automata. A transition of a branch automaton is called an
action.
Let A1; : : : ; An be branch automata of some protocol p. An action rule is a pair
of actions ((si ; ti ); (sj ; tj )) from two distinct automata Ai and Aj . Actions refer to the
transitions of constituent automata such as Ai.
Let G be a state of p that appears in some trace. Then the next state H depends on
the actions executed from G. The action that leads from G to H must be enabled in G.
An action is disabled or enabled relative to a product state in a trace. Therefore, an action
disabled (enabled) in one product state of a trace may be enabled (disabled) in another
product state that occurs later in the trace. An Action-rule specications is required to
27
be consistent, so that every action is either enabled or disabled in a product state. The
following rules describe how actions are selected in an Action-rule specications.
An action rule is an enabling rule, ((si ; ti ))(sj ; tj )), if (si ; ti ) is the only action taking
si to ti , and for every product state G = (g1; : : : ; gn ) where gi = ti and gj = sj , there is
a transition in the product to a state H = (h1 ; : : : ; hn ) with hj = tj . At every state in a
trace, all actions that are enabled are executed.
An action rule is a disabling rule, ((si ; ti )6)(sj ; tj )), if (si ; ti ) is the only action taking
si to ti , and for every product state G = (g1 ; : : : ; gn ) where gi = ti and gj = sj , there is no
transition in the product to a state H = (h1 ; : : : ; hn ) with hj = tj . A disabled transition
remains disabled in a trace unless enabled by a subsequent action in that trace.
An action rule is a choice rule, ((si ; ti )_(sj ; tj )), if for every product state G =
(g1 ; : : : ; gn ) where gi = si and gj = sj , there is no transition in the product to state
H = (h1 ; : : : ; hn ) with both hi = ti and hj = tj . Choice is assumed to be fair over a
trace.
An action rule is a condition rule, ((si ; ti ) )
? (sj ; tj )), if (si ; ti ) is the only action taking
si to ti , (sj ; tj ) is disabled by some action (ui ; vi ), and for every product state G =
(g1 ; : : : ; gn ) where gi = ti and gj = sj , there is a transition in the product to state
H = (h1 ; : : : ; hn) with hj = tj . Intuitively, the transition (sj ; tj ) is enabled by (si; ti )
provided (ui ; vi ) has disabled it previously. Otherwise, the transition (sj ; tj ) does not
need enabling.
Denition 12 An Action-rule specication is a set of automata A1; : : : ; An with a set
of action rules R that denes a protocol composed of the automata and the smallest
restriction that satises R. The specication must be consistent so that at every product
state, an action is either disabled or enabled.
Protocol 2 below presents a simple protocol for one-shot mutual exclusion that works
when there are no cycles in the process. Starting with the state (s1 ; s2), either one of the
transitions (s1; t1 ) or (s2; t2 ) is chosen. The selected transition bars the progress of the
other process. That process waits until it is enabled by another transition of the executing
28
process. The states u1; u2 are the critical regions. If (s1; t1 ) is chosen, it disables (t2 ; u2 ).
(t2 ; u2 ) is enabled after P1 executes (u1 ; v1 ).
Protocol 2
Automata:
Rules:
P1 : s1!t1 !u1 !v1
P2 : s2!t2 !u2 !v2
((s1; t1 )_(s2; t2 ))
((s1; t1 )6)(t2 ; u2)), ((s2 ; t2 )6)(t1 ; u1))
((u1 ; v1))(t2 ; u2 )), ((u2 ; v2))(t1 ; u1 ))
In the following section, we show how how to apply Denition 9 to show how Action-rule
specications can implement Constraint-rule specications.
2.4.2.1 Proving Implementation
Denition 9 relates an implementation to a specication via substitution. Recall that
substitution maps the letters of one language to words over 0, transforming the
words of a language L() to those of L(0 ). This transformation over languages can
also dened as a transformation on automata. Let us rst consider the transformation
informally.
Let (s; a; t) be the sole transition of an automaton A, where s is the sole initial state,
t the nal state and a a letter. Let be a renement function that maps the letter a
to some nite language. Then the automaton A can be transformed to an automaton
A0 that recognizes (a) simply by replacing the sole transition by the automaton that
recognizes (a).
Given an automaton with exactly two transitions (s; a; t) and (t; b; u), substitutions
over the letters a and b may be implemented by separately replacing the transitions. If
the replacement automata have exactly one initial and nal nal state, then we may
compose the replacements by identifying the nal state of the rst replacement with the
29
initial state of the second replacement If they have multiple initial and nal states, we can
connect the the nal states of the rst replacement automaton to the initial states of the
second replacement by transitions. The epsilon transitions can be removed by the usual
determinization algorithms. Thus, the translation involves composing the replacement
automaton.
Similarly, an implementation protocol dened using Action-rules is derived from a
specication protocol dened using Constraint-rules by replacing each constraint of a
Constraint-rule specications with Action-rule specications and composing them. We
require that when considered in isolation, each replacement implements a constraint according to Denition 9. The action rules that dene one replacement do not aect the
rules that dene the other replacements. Therefore, one replacement does not interfere with another replacement when composed. As a result, the overall Constraint-rule
specications is implemented by the Action-rule specications produced by composing
the replacements. These ideas are formalized as follows. First we formally describe the
form of replacements that we will use. We use the term replacement when talking about
the implementations of individual constraints in a Constraint-rule specications, while
using implementation to refer to the entire Action-rule specications that results upon
substituting every constraint with its replacement.
Denition 13 An Action-rule replacement is a protocol that implements Constraintrule specications of the form
Automata:
P1 : n1!s1!o1
P2 : n2!s2!o2
Rules:
..
.
Either ( ^ : s1; s2; : : :) or ( _ : s1; s2 ; : : :)
and the states n and o do not appear in any constraint rule; the replacement must have
exactly one initial and one nal state, and must remain an implementation if n and o are
identied.
30
We require that the replacement be an implementation even when n and o are identied
so that cycles in the Constraint-rule specication do not invalidate correctness. Note that
n and o are intended to be internal states that represent what happens before and after
a constraint. Given replacements in this form, we can safely compose the replacements.
The composition is dened as follows.
Denition 14 An Action-rule specication is said to be synthesized from a Constraint-
rule specication given the following construction: Given a set of replacements for every
constraint in a Constraint-rule specications, identify the rst and last states of the
replacement for each transition in the Constraint-rule specications for every automaton.
Replace unconstrained states by a single new transition.
With this denition for composing replacements, it is not dicult to argue that the
composition produces an implementation.
Theorem 1 An Action-rule specication A synthesized from a Constraint-rule speci-
cation C is an implementation of C .
Proof. To prove that A implements C (Denition 9), we have to prove for each au-
tomaton, and for the product, that we can dene a function that maps each letter in the
alphabet of the specication to a distinct nite language of nite words. In this case, the
alphabet is just the possible product states. So we will show that for each product state
the function may be dened.
First, we note that traces of a replacement remain unchanged even when it is used in
the synthesis. Consider a product state in the synthesized Action-rule implementation
such that in this state, some replacement automaton is in its initial state. Then, since
the action rules aect only the actions dened within the replacement, the transitions of
the replacement are unaected by transitions implementing constraints on other states.
Therefore, the replacement transitions can always be executed in every reachable state.
Initial states in a Constraint-rule specication do not take part in any constraint.
They are replaced by a single transition each. Therefore, the initial product has an
image.
31
Each new state is reached by a constraint rule. Each transition therefore executes
transitions of the replacement that implement the constraint. Dene a map such that
the replacement transitions corresponding to the n and o states map to internal states.
The transitions that map to the s states dene a suitable image for the product state.
Thus, we have the result.
If we have a set of replacements, we can produce an implementation. But deriving a
replacement for conjunctive constraints requires use to address the issue of simultaneity.
Simultaneity A Constraint-rule specication allows joint transitions in a conjunctive
constraint: all the processes \simultaneously" satisfy the conjunctive constraint. But
Action-rule specications have no synchronous transitions, however they can dene order
relations like before and after. Therefore, we interpret simultaneity by requiring that
\simultaneous" transitions appear after their predecessors and before their successors.
Formally, we have
Denition 15 Let a function f be the renement function for a branch automaton that
maps the states of a conjunctive constraint over processes Pi such that ( ^ : s1; s2; : : :),
and states ni and oi are the predecessor and successor states. Let ti ; ui ; vi be transitions
in an implementation of the constraint such that f (n1) 7! ni; f (si ) 7! ui ; f (oi ) 7! vi .
Then if in every trace, ui follows all occurrences of ni and precedes all occurrences of vi
for all processes, then for all i, transitions ui are said to be simultaneous.
For example, consider the Constraint-rule protocol
Automata:
Rules:
A1 : n1 ! s1 ! o1
A2 : n2 ! s2 ! o1
( ^ : s1; s2)
One possible replacement is the protocol:
32
Automata:
Rules:
A1 : x1 ! y1 ! z1 ! a1
A2 : x2 ! y2 ! z2 ! a2
((x1 ; y1))(y2 ; z2)), ((x2; y2 ))(y1 ; z1))
((y1; z1 ))(z2; a2 )), ((y2 ; z2 ))(z1 ; a1 ))
We dene the map from implementation to specication so that the transitions (y; z )
map to the s states, while the (x; y) and (z; a) transitions map to n and o states. By
Denition 15, the (y; z ) transitions of the processes are simultaneous.
In the following section, we consider Observation-rule specications and show how
they can replace Action-rule specications.
2.4.3 Observation-Rule Specications
Observation-rule specications implement Action-rule specications. They model process communication at a detailed level. One process can \observe" the state of another
process, and choose one of several state transitions as its next transition. Action coordination is implemented via observations.
In the following, we dene Action-rule specications and present a simple example.
Then we explain how to translate an Action-rule specication into a Observation-rule
specication.
Denitions Observation-rule specications also use branch automata like Action-rule
specications.
Let A1; : : : ; An be (branch) automata of some protocol p.
An observation is a tuple, ((si ; ti ); Xj ) of a transition from one automaton (Ai ) and
a set of states Xj from another automaton (Aj ). A transition (si; ti ) is said to be based
on an observation if there is an observation with that transition. Every transition may
be based on at most one observation.
33
Given two observations ((si ; ti ); Xj ) and ((si ; ui ); Yj ), Xj \Yj = ;. That is, transitions
based on observations are deterministic.
A state sj is observable if there is an observation ((si ; ti ); Oj ) where sj 2 Oj .
The semantics of an observation ((si ; ti ); Xj ) is as follows. Intuitively, automaton Ai
in state si observes Aj in one of the states Xj , \remembers it" and later asynchronously
changes state to ti . The state where Ai remembers the observed state cannot itself be
detected by any other automaton. Formally, we ensure this asynchrony by splitting every
transition into a pair of transitions. Transitions based on an observation implicitly specify
a pair of ordinary transitions, ((si ; t0i ); (t0i ; ti )) with a hidden state t0i . Hidden means that
t0i is not observable.
Observations determine global transitions as follows. Given an observation ((si ; ti ); Xj ),
in every product state H = (h1 ; : : : ; hn ) where hi = si and hj 2 Xj , all successor
states G = (g1 ; : : : ; gn ) have t0i as their ith component, gi = t0i . Further, for every state
F = (f1; : : : ; fn) where fi = t0i , for every state E = (e1 ; : : : ; en) where (F; E ) is a transition, either ei = t0i thtObserv0either
e
states. If P1 sees P2 in n2 and goes to w1 , then P2 observes P1 and goes to w20 , then P1
observed P2 before P2 observed P1. Therefore P1 \wins", and can execute (w1 ; c1). Now
suppose that this protocol is executed cyclically; then the next time around P1 will see
P2 in w20 and go to w10 , \releasing" P2.
Thus this protocol implements fair choice: only one process can execute its (w; c)
transitions at a time, but the next time round the other process will be given a chance.
Protocol 3
Automata:
Rules:
P1 : n1!w1 !c1
: n01!w10 !c01
: n1!w10 !w10
: n01!w1 !w1
P2 : n2!w2 !c2
: n02!w20 !c02
: n2!w20 !w20
: n2!w2 !w2
((n1; w1 ); fn2 ; w2 ; c2g), ((n01; w1 ); fn2 ; w2 ; c2g)
((n01; w10 ); fn02 ; w20 ; c02g), ((n1; w10 ); fn02 ; w20 ; c02g)
((w1 ; w1 ); fn2 ; w2 ; c2g), ((w10 ; w10 ); fn02; w20 ; c02 g)
((n2; w2 ); fn01 ; w10 ; c01g), ((n02; w2 ); fn01 ; w10 ; c01g)
((n02; w20 ); fn1 ; w1 ; c1g), ((n2; w20 ); fn1 ; w1 ; c1g)
((w2 ; w2 ); fn01 ; w10 ; c01g), ((w20 ; w20 ); fn1; w1 ; c1 g)
2.4.4 Proving Implementation
In proving that a Observation-rule specication implements a Action-rule specication,
we use the same denition (Denition 9) and the same argument as in Theorem 1. For
each Action-rule replacement, we produce a Observation-rule replacement. We show
how to compose Observation-rule replacements. Every replacement uses observations
35
limited to its own states, therefore, a replacement is unaected by other replacements
of the Action-rule. Each Observation-rule replacement is assumed to be an implementation of an Action-rule replacement. Thus, we can dene a function as required by
Denition 9, showing that the resulting Observation-rule specication implements the
Action-rule specication, and hence the original Constraint-rule specication.
First dene the form of an Observation-rule replacement.
Denition 17 An Observation-rule replacement is a protocol that implements an Action-
rule replacement. It must have the same number of initial and nal states. Transitions
from initial states and transitions to nal states must not be based on observations, If
an initial (nal) state is substituted by an automaton connected to the replacement only
by the initial (nal) transition, and the automaton is assumed to eventually execute the
initial transition, we may substitute the initial (nal) state in an observation with the
states the automaton without aecting correctness.
Just as Action-rule replacements include n and o states that represent possible successors
and predecessors, initial and nal transitions are intended to represent the context of a
replacement. Since transitions from initial states and to nal states do not involve any
interaction with the other process, all interesting behavior begins after the initial transitions and ends before the nal transition. Therefore, a replacement is not aected by
other replacements; all of their states can be thought of as an undistinguished initial state
or nal state. The condition ensures that only transitions internal to the replacement
aect progress.
When composing one Observation-rule replacement with another, as a notational
shorthand, we let the nal states of a replacement non-deterministically select the initial
state of a successor replacement. A determinization algorithm ensures fair behavior
through the subset construction [HU79].
Denition 18 An Observation-rule specication is said to be synthesized from a Constraint-
rule specication given the following construction: Given a set of replacements for every
constraint in a Constraint-rule specications, let every nal state of a replacement be
36
joined to the initial states of the successor replacement. Replace unconstrained states
by a single new transition. In all observations of a replacement, change initial and nal
states to include all states of all other replacements of the observed process.
Theorem 2 An Observation-rule specication O synthesized from a Constraint-rule
specication C is an implementation of C .
Proof. The proof is similar to that of Theorem 1. We dene the map with the help of
the Action-rule maps.
First, we note that traces of a replacement remain unchanged even when it is used
in the synthesis. Consider a product state in the synthesized Observation-rule implementation such that for some replacement the initial transition is executed. Then, since
the observations aect only the transitions dened within a replacement, execution of
replacements of other constraints have no eect. Therefore, the replacement transitions
can always be executed in every reachable state.
Initial states in a Constraint-rule specication do not take part in any constraint.
They are replaced by a single transition each. Therefore, the initial product has an
image.
Every replacement is an implementation of an Action-rule replacement. So we compose the maps from Observation-rule to Action-rule replacements with the map from
Action-rule to Constraint-rule.
Thus, we have the result.
2.5 Implementing Constraints, Actions, and Observations
In this section, we describe a few ways to implement various rules. We present proofs
that they are implementations. In practice, such proofs are best done with verication
tools.
37
2.5.1 Synthesizing Constraint Rules
We rst show how constraint rules are implemented using action rules.
2.5.1.1 Binary Conjunctive Constraints
A binary conjunctive constraint is of the form:
Protocol 4
Automata:
Rules:
P1 : n1!s1!o1
P2 : n2!s2!o2
( ^ : s1; s2)
We want a replacement in the sense of Denition 13. In the following, we describe two
replacements. In Protocol 5, the conjunctive constraint is interpreted as a two-process
rendezvous. In Protocol 6, it is interpreted as a request-response protocol.
Protocol 5
Automata:
Rules:
P1 : n1!w1 !b1!b01!w10 !n01
P2 : n2!w2 !b2!b02!w20 !n02
((n1; w1 ))(w2 ; b2)), ((n2 ; w2 ))(w1 ; b1))
((b01; w10 ))(w20 ; n02)), ((b02 ; w20 ))(w10 ; n01))
In the construction, the (w; b) transition of each process depends on the (n; w) transition
of the other. Similarly, the (b0 ; w0 ) transitions enable (w0 ; n0 ) transitions. Therefore, the
(b; b0) transitions in every trace are \simultaneous" in the sense of Denition 15. So we
have:
Lemma 2 Protocol 5 is a replacement of Protocol 4.
38
Proof. Dene a map f from the transitions of the replacement to the states of the
specication as follows: f ((b; b0 )) 7! s, f ((n; w)) 7! n, f ((w0 ; n0 )) 7! o.
In every trace, the (b; b0 ) transitions occur between the (n; w) and (w0 ; n0 ), even if n
and o are identied.
Therefore, Protocol 5 is an implementation that meets the criteria of Denition 13.
Another interpretation for a binary conjunctive constraint is request-response communication.
Protocol 6
Automata:
Rules:
P1 : n1!b1!b01!d1!d01!o1
P2 : n2!b2!b02!d2!d02!o2
((b1; b01))(b2; b02))
((d2; d02 ))(d1 ; d01))
Here, P2 waits for P1 to enable (b2; b02). This models the request; the next dependence
between (d2; d02 ) and (d1; d01 ) models the response.
Lemma 3 Protocol 6 is a replacement of Protocol 4.
Proof. Dene a renement map f as follows: f ((n; b)) 7! n, f ((b; b0)) 7! n, f ((d; d0)) 7!
o, f ((d0 ; o)) 7! o, f ((b0 ; d)) 7! s, In every trace, the (b0; d) transitions are simultaneous.
Hence the result.
2.5.1.2 Binary Disjunctive Constraints
A binary disjunctive constraint is of the form:
39
Protocol 7
Automata:
Rules:
P1 : n1!s1!o1
P2 : n2!s2!o2
( _ : s1; s2)
We want a replacement in the sense of Denition 13. In the following, we describe
a replacement. In Protocol 8, the disjunctive constraint is interpreted as symmetric
mutual exclusion. One might also implement it using a token based protocol.
40
Protocol 8
Automata:
Rules:
P1 : n1!b1!w1 !c1!e1!o1
: b1!w10 !c1
P2 : n2!b2!w2 !c2!e2!o2
: b2!w20 !c2
choices rules:
((b1; w1 )_(b2; w2 ))
((b1; w10 )_(b2; w20 ))
enable/disable rules for P1:
((b1; w1 )6)(b2; w20 )), ((b1 ; w1 )6)(w2 ; c2))
((c1 ; e1) )
? (b2 ; w20 )), ((c1 ; e1) )
? (w2 ; c2))
((b1; w10 )6)(b2; w2 )), ((b1 ; w10 )6)(w20 ; c2))
((c1 ; e1) )
? (b2 ; w2 )), ((c1 ; e1) )
? (w20 ; c2))
symmetric enable/disable rules for P2:
((b2; w2 )6)(b1; w10 )), ((b2 ; w2 )6)(w1 ; c1))
((c2 ; e2) )
? (b1 ; w10 )), ((c2 ; e2) )
? (w1 ; c1))
((b2; w20 )6)(b1; w1 )), ((b2 ; w20 )6)(w10 ; c1))
((c2 ; e2) )
? (b1 ; w1 )), ((c2 ; e2) )
? (w10 ; c1))
We show that:
Lemma 4 Protocol 8 is a replacement of Protocol 7.
Proof. Dene the map such that the (n; b) transitions map to n states of Protocol 7,
(e; o) to o states, (w; c) to s states, and the rest to null.
In the construction, the (b; w) transitions disable the (b; w0 ) transitions of the opposite
process, and vice versa. The (b; w) and (b; w0 ) transitions choose between one another.
Thus, consider the state (b1; b2). By the choice rule, only one of (b1 ; w1 ); (b1; w10 ) or
(b2; w2 ); (b2 ; w20 ) may execute. Suppose (b1 ; w1 ) executes. Then it disables (b2; w20 ) and
41
(w2 ; c2). Therefore P1 can safely execute (w1 ; c1). The disabled transitions are enabled
by (c1 ; e1), so that P2 can continue. In all four cases, the situation is symmetric.
Therefore, in all traces, the transitions (w; c) will never occur concurrently. This does
not change if there is a cycle connecting the n and o states.
Therefore, Protocol 8 is a replacement.
2.5.1.3 N-process Constraints
One general way to implement an N-process constraint replacement from 2-process constraint replacement is to use a hierarchical tournament structure. For example, consider
the implementation of a disjunctive constraint between two process, and suppose we
want to implement a three way constraint. Then, we let the \winner" of the two-process
protocol compete with the third process by repeating the two process protocol. Clearly,
the winner of the second round will be the only process that executes the transition
corresponding to the constrained state. Conjunctive constraint implementations can be
composed in a similar way.
2.5.2 Synthesizing Action Rules
Action-rule specications tend to be tedious. We will discuss only one example of an
action rule protocol, the implementation of binary disjunction.
42
Protocol 9
Automata:
Rules:
P1 : n1!t1 !w1!c1!o1
: n01!t01 !w10 !c01!o01
: t1 !w10 !w10
: t01 !w1 !w1
P2 : n2!t2 !w2!c2!o2
: n02!t02 !w20 !c02!o02
: t2 !w20 !w20
: t02 !w2 !w2
observations for P1:
((t1 ; w1 ); fn2 ; w2 ; t2 ; c2; o2 g), ((w1 ; c1 ); fn2; w20 ; o2 g), ((w1 ; w1 ); ft2 ; w2 ; c2g)
((t01 ; w10 ); fn02 ; w20 ; t02 ; c02; o02 g), ((w10 ; c01 ); fn02; w2 ; o02 g), ((w10 ; w10 ); ft02 ; w20 ; c02g)
antisymmetric observations for P2:
((t2 ; w2 ); fn01 ; w10 ; t01 ; c01; o01 g), ((w2 ; c2 ); fn01; w1 ; o01 g), ((w2 ; w2 ); ft01 ; w10 ; c01g)
((t02 ; w20 ); fn1 ; w1 ; t1 ; c1; o1 g), ((w20 ; c02 ); fn1; w10 ; o1 g), ((w20 ; w20 ); ft1 ; w1 ; c1g)
For convenience, Protocol 9 is depicted graphically in Figure 2.1. In the gure, transitions
that show sets of states over the arrow are observation-based; they observe the states of
the other process.
Lemma 5 Protocol 9 implements Protocol 8.
Proof. In Protocol 9 and Figure 2.1, the transitions corresponding to (n; b) and (e; o)
of Protocol 8 have not been depicted to avoid clutter. The other sequences of Protocol 9
transitions map to the transitions of Protocol 8 as follows: f ((n; t); (t; w)) 7! (b; w),
f ((w; c)) 7! (w; c), f ((w0 ; c0 )) 7! (w; c), f ((c; o)) 7! (c; e), f ((c0 ; o0 )) 7! (c; e).
To show that the traces of Protocol 9 constitute an implementation, consider the
following argument. Suppose P1 is in n1, and P2 in n2. Both make progress; suppose P1
43
- t1
n1
twc
0 0 0 0 0
0 0 0 0 0
0 0 0
nw0o
o1 n1
@@ ,,
,n@t w cno t w c o
@
ntwco
,
, R
@ ?
- w?1
w10 t w c - w10
ntwco
w1 t01 ?
?
c01
@
,
,
?
w2 t w c w2
0 0 0
R@ ?
0
n0 wo0
- o1
t02 , ntwco
n0 t0w0 c0 o00 0 0 @
,
@
0
0
n t w c,o ntwco
@
n0 wo0
c1
- t2
n2
o2 ?
?
c02
Process P2
Figure 2.1: Disjunctive Constraint Using Observation Rules
ends up in w1 , then it has seen P2 in either n2 or t2 . Then P1 observes P2 again. If P2
is still in n2, either it will stay in n2, or it will make progress and end up in w20 where it
will wait for P1. This argument applies symmetrically to all possible traces. Thus, when
one process executes a (w; c) transitions, the other cannot.
Hence we have the result.
2.5.3 Observations via Memory and Messages
We briey explain how to convert Observation-rule protocols to shared memory and
message programs. Note that the following translations are easily expressed in formal
models of messages [FLP85] or memory [LAA87].
Messages : States are implemented as local memory bits. A state transition clears
the bit representing the current state, and sets the bit for the next state. Observations may be implemented in two ways over reliable, ordered channels.
{ Pull model : We can use a request-response protocol to implement observations. A process sends a request for the current state of another process, and
the reply is the observation.
44
- w20
nw0o
c2
Process P1
w2 twc
n02
- o2
{ Push model : Every process sends its current state in a message to all processes
that observe that process.
Memory : We code each state as a memory bit in a distributed memory. State
transitions are implemented by setting the bit for the next state, and clearing it for
the current state. Observations are implemented as if statements that decide the
assignment to the next state.
Many optimizations are possible: for example, since observations by one process are often
based on sets of states of a dierent process, we only need to indicate when state changes
in a process can change observations. Therefore, not every state need be explicitly
represented, and we need only enough bits for distinct observations.
2.6 Summary
We have presented a technique for structuring the design of protocols. The rst step
uses automata with abstract coordination operators to produce simple, extensional descriptions of protocols. The descriptions have small state spaces, as they assume fair
implementation of coordination and hide the details of communication and control. As
a result, whole-system properties can be eectively analyzed at this level by verication
tools. In the second step, we rene the coordination operators. The notation at this
level expresses how processes control one another, but hides details of communication
media. We show how to ensure that these renements can be composed safely to rene
the original abstract protocol. In the third step, we translate the rened operator implementations to a notation that reects the behavior of shared memory or message passing
systems. Again, the resulting protocols are small and can be veried by exhaustive search
tools. The translations are themselves safely composable, completing the implementation
of the original protocol.
We have presented the essential aspects of the method. It was developed simultaneously with the target application, distributed shared memory consistency protocols,
45
described in the next chapter. Much remains to be done as regards expressiveness and
tool development before it can be deployed in practice.
46
Chapter 3
Distributed Shared Memory
Consistency Protocols
In this chapter, we present a new set of consistency protocols for distributed shared
memory (DSM). We rst explain the problem we address and present our solution. The
next section discusses related research. We then present our new consistency protocols,
and explain them intuitively. Next we describe how we constructed them using our design
method. We conclude with a presentation of performance results.
3.1 Goal
The goal of this work is to devise consistency protocols for distributed shared memory
that operate eciently over either local area or wide area networks. Communication over
local area networks is characterized by low latency and low bandwidth, whereas wide area
networks have high latency, but high bandwidth. A protocol that performs well over a
wide area interconnect must be able to utilize high bandwidth to overcome high latency.
This is possible by overlapping computation and communication. Therefore, we must
minimize situations where computation waits on communication.
Research in distributed shared memory systems has concentrated on minimizing data
communication. But in our case, reducing data communication alone is not enough; since
47
computation must not wait for communication whenever possible, we also have to reduce
communication needed for synchronization.
In the following, we briey review the basic motivation and design of distributed
shared memory consistency protocols. We then explain our approach.
3.1.1 The Problem
A distributed shared memory system simulates shared memory over networked computers. Distributed shared memory simplies the task of writing parallel programs for several
reasons:
It eliminates the distinction between accesses to local memory and accesses to
remote memory.
It relieves the burden of managing data since data in the memory is persistent and
easy to access.
Since memory can be overwritten, the programmer does not have to explicitly
distinguish between stale and newer data. Old values are simply overwritten and
the latest values are easily accessible.
The shared memory abstraction naturally supports arrays and other familiar data
structures.
The use of shared memory for interprocess communication is well understood from
programming multitasking uniprocessors as well as multiprocessors.
DSM implementations strive to be transparent to the user, so that programs written
for shared memory can be used on distributed shared memory with as few changes as
possible. Shared memory programs have processes that communicate using through
shared variables. When these processes execute over machines with distributed shared
memory, the variable accesses are transparently converted to distributed accesses using
virtual memory hardware. When a variable on a memory page is accessed, if the contents
48
of a shared page exist in the local memory, memory access is allowed. Otherwise, the
access is intercepted, and pages on remote machines are fetched over the network. Thus,
the local memory is treated as a cache of the hypothetical shared memory.
Consistency and Communication Fetching pages over a network takes time. There-
fore, when possible, copies of pages must be kept in local memory. But if a process writes
to its local copy, the copy may become inconsistent with other copies. DSM implementations avoid such data inconsistency by coordinating memory accesses by following a
memory consistency protocol.
A fairly simple protocol for avoiding inconsistencies is to allow only one process to
write to a page at a time. If a process P writes to a page, and another process Q either
reads or writes to that page, the two memory accesses conict. This conict is resolved
by granting write permission rst to one process and then the other. When a process
acquires permission to write, it may acquire a fresh copy of the page and invalidate
the copy held by the previous writer, or update the copy maintained by the previous
writer. In either case, data must be communicated across the network. The challenge in
implementing DSM is to design consistency protocols that minimize communication.
There are many ways to reduce communication. For example, in the single-writer
protocol above, virtual memory hardware forces implementations to detect modications
to pages rather than bytes or words. Thus, writes by processes executing on dierent processors to dierent osets within a page appear to conict. This is called false
sharing. False sharing may be reduced using code rearrangement, choosing small page
sizes, proper data placement, or detecting the actual changes to a page and transmitting the dierence. Other ways to reduce communication include improvements to the
communication infrastructure, using process scheduling to reduce conicts, and allowing
multiple read-only copies of a page. But the communication is dictated primararily by
the single-writer protocol: writes to a page from processes on dierent processors will always require paging across the network. The key to reducing communication is to permit
multiple writers to a page.
49
Weak Consistency Weak consistency protocols allow multiple writers to a page at the
same time. Much of the communication in single-writer protocols happens because writes
to a page are interpreted as writes to a common variable. But in practice, true conicts,
that is, writes to the same variable never occur, because programmers explicitly resolve
such conicts using synchronization constructs such as locks or barriers. Therefore, if
a page is accessed by two processes without rst getting a lock (or a barrier), we can
assume that there is no conict. Copies of a page can then be separately updated. On the
other hand, if a page is accessed from within a synchronization construct, then a conict
is possible. Therefore, writes to a single page must be serialized. Moreover, the page
must be updated prior to synchronization. The performance of DSM implementations
with weak consistency protocols thus depends on how page updates or invalidations are
dictated by the synchronization requirements.
3.1.2 Our Solution
Just as weak consistency models use the presence of synchronization to weaken the consistency requirements, we suggest that it is possible to take into account program memory
access patterns to reduce synchronization requirements. For example, a barrier synchronizes processes so that all processes must gather at the barrier before the barrier opens.
When barriers are used for synchronizing phases of a computation, it ensures that all
processes complete computations in one phase before beginning the next phase. But in
many computations, such strict dependence is not necessary. If we keep track of patterns
of memory access during a computation, we can optimistically begin the next phase and
communicate the results of the previous phase to those processes that are likely to need
them. We detect errors, and back up if necessary. If the computation is relatively regular,
we will succeed more often than we fail, and computation and communication of dierent
processes will overlap.
We use this idea of anticipatory computation and communication to minimize time
wasted waiting for communication. This results in a data-driven model for updating the
results of distributed shared memory. We call this model coordinated memory.
50
In the following, we explain the ideas in detail, and provide a formal specication.
Performance results show that the model achieves the goals: distributed memory computations over wide-area high-bandwidth programs achieve good speedup.
3.2 Background and Related Work
In shared memory multiprocessors, every processor has to access data from the common
shared memory. However, since fetching data from the common memory takes time (i.e.,
has high latency), the data is cached locally with each processor. As a result, the data is
replicated and copies reside in caches that are accessible at high speed (i.e., low latency).
However, whenever a processor modies cached replicated data, somehow all the copies
must be updated, otherwise some copies end up with stale data. Furthermore, if two
processors write to their caches at the same time, the update procedure must choose
between the values written by each processor. These two issues, (1) how to update
replicas and (2) how to choose between simultaneous values, are collectively referred to
as the cache consistency problem. Particular strategies used to address these issues are
called consistency models, consistency protocols, or simply memory models.
3.2.1 Sequential Consistency
A particularly intuitive memory model is called sequential consistency. Under sequential
consistency, the shared memory behavior of a multiprocessor must lead to results that
are similar to those for parallel processes that are interleaved on on a uniprocessor. More
precisely [Lam79],
The result of any execution is the same as if the operations of all the processors
were executed in some sequential order, and the operations of each processor
appear in this sequence in the order specied by its program.
We deduce that for sequential consistency (1) a write must update the cache of every other
processor and all simultaneous reads must occur either before or after that write, and
51
(2) simultaneous writes by dierent processors may be interleaved in any order. These
guarantees are met if we ensure that for any location, only one process may write to it
at any given time. Thus, we can have two types of protocols for keeping any memory
location consistent: either allow only a single-writer process and single-reader process
at a time (SWSR) or allow single-writer but multiple-reader processes (SWMR). Note
sequential consistency imposes no restrictions on simultaneous reads.
3.2.1.1 Sequential Consistency Implementation and Performance
To implement sequential consistency on networked machines (for simplicity, assume that
each machine is a uniprocessor so that every process executes on a separate machine),
we must ensure that only one process may write to a given memory location at a time.
Other processes are prevented from writing to that location at the same time. If a
dierent process writes to that location immediately afterward, that process becomes
the writer, and the permission to access the location is revoked from the rst process.
On the other hand, if a process has the permission to write to a location, and another
process wants to read from that location, we have two alternatives: (1) Either withdraw
the write permission from the rst process and grant it to the second, or, (2) Allow both
processes to only read that location, and revoke all access if any process later writes to
that location, and transfer it to the new writer. Thus, we have to control the read and
right permissions to locations, and transport data and permissions between the machines
as required.
In a seminal paper, Li and Hudak [LH89] showed how to employ virtual memory hardware and message passing over the interconnection network to implement sequentially
consistent distributed shared memory. The virtual memory mechanisms of the operating
system controlling the networked machines are extended to control the read and write
permissions, and to revoke access rights at the granularity of virtual memory pages.
Thus, suppose that a process attempts to write to a memory location within a page
that it cannot access. This attempt results in a page-fault, and the fault handler contacts
52
other machines across the network, locates the page and gets write access as well as the
page data. Similarly, if a process attempts to write to a page that it may on read, there
is a protection-fault, and the fault handler must gain the write permission and revoke
access for all other processes that may have copies.
The implementation must support some directory scheme that allows a process to
locate other writers and readers, and some arbitration scheme to select a writer in case
of multiple requesters. Typically, the directory scheme itself must be distributed for
ecient access and update of directory data.
Now it is easy to see that the sequential consistency model lead to severe performance
problems, and the particular implementation techniques exacerbate them.
Since sequential consistency requires that only one process may write to a memory
location at any time, if two (or more) processes write to a location in succession, the
memory location \bounces" across the network, as the write permission and page data
are exchanged. Thus, a signicant number of write operations require data transmission
across the network and add communication overhead to the computation. Note that this
problem is not an artifact of the implementation, since the single-writer requirement is
imposed by the consistency model, and for every write operation by a distinct process,
obtaining write permission requires network transmission. Thus, sequentially consistent
memory inherently requires signicant network communication; as a result, sequentially
consistent distributed shared memory must be slow and unscalable.
The page-level granularity of the implementation further exacerbates the \bouncing".
The virtual memory system can detect memory accesses to pages, but cannot dierentiate between accesses to dierent osets within a page. Therefore, writes by dierent
processors to dierent osets within a page seem to conict, since they are writes to the
same page. This is called false sharing, because data that is in fact not shared (i.e., data
at dierent osets within a page), appears to be shared.
53
3.2.2 Beyond Sequential Consistency
To improve performance, we can reduce communication by code rearrangement [SMC90],
data placement [SMC90], or by choosing small page sizes [FLR+94]. Similarly, we may
improve the communication infrastructure [SMC90] and develop special network communication protocols. Other techniques include scheduling improvements, so that after acquiring a page, requests from other processes are ignored for some time period
in order to let local computation continue. In our earlier research, we [SMC90] (and
others[FP89, DCM+90, MF90]) investigated such issues.
However, such changes can only lead to minor improvements, since the communication
requirement is dictated by the sequential consistency memory model. Since sequential
consistency requires that only one process may write to a page at a time, writes from
processes executing on dierent processors will always require paging across the network
(\bouncing"). Further scalability problems arise since potentially every computer may
have to be visited to locate a page. Thus, a major improvement is possible only if we can
allow multiple writers, and manage to bound the number of machines that have to be interrogated to locate a page. At the same time, the consistency model must be suciently
intuitive that programmers can easily reason about their programs and implementors of
the model have simple, understandable implementations. In the following, we describe
weak consistency models that allow multiple writers, and also ameliorate problems with
false sharing. Then we argue that the existing models are complex, and that there is
room for further optimizations; this motivates coordinated memory as a simpler model
that also suggests new optimizations.
3.2.2.1 Weakly Consistent Memory Models
Weakly consistent memory models were invented by computer architects [DSB86a, GLL+90].
in order to minimize memory-to-CPU trac by allowing optimizations such as out of
order memory operations, overlapping memory operations, and lock-free caches. For example, if a processor sequentially issues two load instructions i and j to distinct memory
54
locations mi and mj then j may be issued before i completes, thereby overlapping memory reads and reducing overall time. However, in a multiprocessor, this may mean that
some other processor sees the eect of j before i, resulting in incorrect operation. On
the other hand, if no other processor actually reads mi , then the computation would
still be correct, and at the same time the overlapping loads allow the program would
execute faster. But the sequential consistency model forbids such instruction shuing
even though it may improve performance.
To address this problem, weakly consistent models have been dened. The denition
of weakly consistent memories depend on two (interrelated) observations: (1) Since programmers dene their own high-level consistency requirements using locks or barriers,
can we design memory models use this information? (2) Sequential consistency requires
that every process must be able to observe the eects of writes in the same order. Is it
possible to relax this requirement and still reason about correct programs?
The rst question leads to models like release consistency [GLL+90, CBZ95], entry
consistency [BZS93] and other hybrid [AF92] consistency models. The hybrid models
distinguish between synchronization operations and ordinary operations, and the expected order of program memory observations are analyzed as a combination of the two.
Provided programs distinguish between synchronization and ordinary operations, these
models lead to memory behaviors that are indistinguishable from sequential consistency.
The second question leads to models like Pipelined RAM [LS88], and causal consistency [AHN+93, JA94]. These models do not require that all processes be aected by
write and read operations, and allow dierent processes to \see" the eects of operations from dierent processes in dierent orders. Using these models, programmers may
have to rewrite programs that were intended for sequentially consistent memory [AHJ91].
Further, it can be expensive to implement these models, since the implementation must
maintain information about the order of memory accesses for every memory access. Fortunately, it is possible to deduce the order information from programming constructs
such as locks and barriers, thus resulting in hybrid models that are very similar to entry
or release consistency.
55
In the following, we will briey discuss the details of the four models and point out
the drawbacks, thus motivating coordinated memory.
Release Consistency Release consistency [GLL+90, CBZ95] is motivated by the ob-
servation that memory locations that are protected by locks are accessed by only one
process at a time. Further, an accessing process must rst acquire the lock and will be
blocked until the it is released. Thus, the lock holder knows that until it releases the
lock no other process will attempt to access the protected data as any \interleaving" is
forbidden by the lock. Therefore, changes made by a process p to variables protected by
a lock need not become visible to any other process q until after p releases the lock. In
other words, memory should become consistent only upon lock release.
Thus, a release consistent memory implementation can optimize communication in
several ways:
1. It can collect modications made to several variables u; v; w within a critical section
and broadcast them in a single message if reducing messages is important [CBZ95].
2. It may pipeline the writes to u; v; w if hiding write latency is more important than
reducing messages [GLL+90].
3. It may transmit the changed values only to the process that acquires the lock next
[DKCZ93] instead of broadcasting them to all processes.
4. It may piggyback the write values to the acquiring processor on the lock grant
message [DKCZ93].
5. It may delay transmitting the changes, and instead merely inform the acquiring processor about the changed values. The acquiring processor can request the changes
when necessary [DKCZ93].
6. It can make use of access information, such as (1) if accesses will be read-only,
data can be freely replicated, (2) if process modify data one at a time, then the
56
data can be migrated from the lock holder to the next acquiring process. All such
optimization can be brought into play to minimize communication [CBZ95].
7. Since programmers use synchronization to avoid data races (i.e., simultaneous
writes to one memory location), virtual memory based distributed shared memory
systems can assume that pages modied by multiple processes between an acquire
and the corresponding release are in fact modied at dierent osets within a page.
Thus, after the release, the modications can be safely merged. Therefore, pages
that are simultaneously modied do not \bounce" as in sequentially consistent
distributed shared memory.
Selecting the appropriate optimization depends on the tradeos. In true shared memory
multiprocessors [GLL+90], data changes will be pipelined with multiple messages since
hiding write latency is more important than reducing messages. In distributed shared
memory systems [CBZ95], data changes are buered to reduce messages. Other optimizations such as migrating data can be applied in both cases, provided the required
information is available.
Entry Consistency A release consistent memory model updates changes to all vari-
ables upon a release operation. With entry consistency, the programmer creates an
explicit association between a lock and the memory locations it protects, and only the
associated memory locations are made consistent upon the next lock acquire. Thus, entry consistency takes a step beyond release consistency in making eective use of the
information that implicit in a program.
Clearly, entry consistency and release consistency are similar enough that aggressive
optimizations detailed above can be applied to entry consistency to boost performance.
One slight disadvantage of entry consistency is that the programmer must declare the association between the synchronization variables and the data. However, related research
indicates that such modications are not very onerous [JKW95].
57
Causal Consistency Sequential consistency requires that all processes agree on some
global order of memory accesses. Then we say that a read of variable v must return
the value written to v by the \most recent" write. However, as we have seen, enforcing
such a strict order precludes many optimizations that do not aect correctness. So we
seek a weaker consistency model that does not require all write operations to be totally
ordered. Instead, only writes that can aect the behavior of a process should be ordered;
this ordering must allow the writes to be overwritten when they no longer aect process
behavior.
A process p aects another process q when q reads a value written to some variable v
by p. Suppose that as a result of reading v, q writes a value f () to another variable w.
Now if a third process r reads the value f () from w, followed immediately by a read of
v, then r must read the value and not any prior value of v; this restriction denes how
memory locations (variables) are updated. Without some similar restriction, if r were to
able to read any previous value of v, then in the degenerate case the architecture need
not update memory at all and always return the initial values; in that case, computation
would be impossible since processes cannot communicate with one another.
We say that a write operation W is causally related to a read operation R either if
R reads the value written by W , or if R reads from some process o that read the value
written by (but has not itself updated) W . A memory model in which a read operation
may return the value written by one of its causally related writes is said to be causally
consistent [DSB86b, AHN+93].
To see that multiple return values are possible, consider an extension of the example
above such that some process s writes to v simultaneously with p, and r rst reads
w followed by some variable x written by s. The reads of x and w establish causal
relationships between r and s as well as r and p. Furthermore, since p and s wrote
simultaneously to v, the writes to v cannot be ordered. Thus, from the viewpoint of r, v
has two possible causally recent values, and (stated another way, under causal memory
simultaneous writes to a variable spawn multiple copies of that variable). Upon reading
v, r must choose between one of them as the \actual" value. As a consequence, a single
58
causally consistent variable allows multiple writes. Further, since a write must aect
only causally related processes, writes need not be propagated to all processes. Thus,
implementing causal consistency potentially requires less communication than sequential
consistency. Interesting example programs for causal memory are presented in [AHJ91].
Notice that there is a problem with the above description: we have said that writes
must be propagated only to causally related processes; however, until the rst read
establishes causality, a write need not aect any processes at all. This paradox may
be resolved in several ways: (1) Broadcast the values of writes to all processes with
timestamps that allow readers to select causally recent values. But this implementation
requires too much communication. (2) Initialize memory locations to invalidate states
such that readers of these locations must contact possible writers to get initial values.
Later, every read operation (logically) updates all locations with causally recent values
acquired during the read. Both of these implementation have untenable overhead for
determining and transmitting causally recent values.
However, we can decrease the need for communication considerably by realizing that
the programmer implicitly denes high-level causality relationships by specifying synchronization using locks or barriers [DKCZ93, JA94]. For example, it is obvious that
all write operations immediately before a barrier are causally related to read operations
immediately afterward, since a barrier release operation is causally related to all further computation. Thus, memory operations for synchronization alone can be used to
accurately deduce causality, thereby reducing computational overhead for determining
causally recent values and transmitting updates. Such a \hybrid" causal consistency
model turns out to be very similar to release and entry consistency.
3.2.2.2 Communication in Weak Models
In weak memory models, we assume that conicting write operations occur guarded by
synchronization constructs. Therefore, consistency has to be guaranteed at the begin-
59
ning (or the end) of synchronization. Therefore, communication depends on how often
synchronization constructs are accessed.
When a process must synchronize its computation with other processes, it waits for
its pages to be updated, and for the synchronization operations to complete. During this
time, computation is suspended waiting for communication. Since we are interested in
using distributed shared memory over wide area networks, we want to overlap computation and communication as much as possible. Therefore, we study the implementation
of synchronization constructs.
3.2.3 Synchronization in Distributed Shared Memory
Early distributed shared memory systems [SMC90, LH89, DCM+90] attempted transparent emulation of shared memory multiprocessors. Therefore, synchronization was
implemented by supporting test-and-set like operations within the system. As a result
[SMC90], synchronization operations such as spinlocks led to poor performance since
the page bouncing (Section 3.2) caused by competing spinlocks requires communication
across the network.
Weakly consistent systems [CBZ95] moved away from the strict transparency requirement, supporting weak consistency models as well as multiple implementations of release
consistency. The programmer is expected to annotate shared memory programs to indicate the optimizations desired (e.g., whether some data should be migrated, while other
require multiple writers). In a similar vein, the synchronization operations are implemented by message passing libraries that avoid problems like page bouncing.
However, even these implementations require too much communication and signicantly impact application performance. In the following, as an example, we examine the
implementations of locks, barriers, and queues and identify the bottlenecks. Later, we
see how these bottlenecks can be eliminated, and other synchronization constructions
may be suggested.
60
3.2.3.1 Implementing Locks
Locks are used to ensure exclusive access to sets of memory locations. Simple spinlocks
that spin on a location in shared memory exhibit contention. Contention can be reduced
by using queue locks [And90], where processes waiting for the lock enqueue themselves
rather than spin. When the lock holder releases the lock, the next holder in the lock queue
acquires it. In distributed shared memory systems, the queue is distributed [CBZ95] over
all participants. To enqueue itself, a process puts itself on a local queue, and if the lock is
held by a remote machine, the request is forwarded to that machine. If another machine
has grabbed the lock, the request is appropriately forwarded. When the request reaches
a current holder, the request is queued there. When this request reaches the head of the
queue, the requesting node acquires the lock.
In the best case, the lock is held locally. Otherwise, at least one network access is
needed, and the acquisition grant message can return the data. The new holder must
send an acknowledgment to the previous holder to ensure that lock has been denitely
acquired. In the worst case, the acquisition request may have to \visit" all nodes before
it succeeds in entering the queue. When network access is needed, lock requesters must
wait at least twice the network latency time, plus data transfer time. Implicitly, the
sender must also wait for twice the network time.
If such lock accesses could be optimized, in the ideal case the requester should have
the lock and data when needed, and only the sender needs to wait. With coordinated
memory, we show how the programmer can help the system to approximate this case.
3.2.3.2 Implementing Barriers
A barrier synchronizes all processes that participate in the barrier. When the barrier is
active, no process can pass the barrier until all processes (participants) have arrived at the
barrier. Barriers are commonly used in programs to separate phases of a computation.
For example, in methods for the iterative solution of linear equations, successive iterations
are separated by barriers. The barrier ensures that all processes complete their updates
61
before any process can start the new iteration. Thus, every iteration is guaranteed to
use fresh data.
Barriers are typically implemented [CBZ95, BZS93] using a barrier master process
which collects barrier entry requests from barrier participants. In case the request also
contains updates from processes, the master may merge and rebroadcast the merged
results. Since barriers are usually visited by all processes, it would seem that there is
little to be gained by using distributed barriers [HFM88].
However, a barrier entails considerable overhead in both communication and waiting
for all processes to arrive. However, consider that the purpose of a barrier is often to
ensure that only fresh data from earlier phases is used, not that the processes should
synchronize; synchronization is only a means to this end. Thus, if the programmer
has access more specic synchronization constructs that can express this requirement,
the needless synchronization overhead is eliminated. The coordinated memory model
suggests how to enrich the programmer's repertoire of synchronization constructs so that
such information is available to the implementation and may be utilized.
3.2.3.3 Implementing Task Queues
A task queue is used in programs where processes do not have static allocations of work.
Instead, each process starts o by processing some part of the problem data, and enqueues
the result after it is done. Other processes can then pick the data for further processing.
Task queues have been implemented as migratory data [CBZ95] in distributed shared
memory systems. When a process needs to access the queue, the queue is migrated over
to that process, where it can add or remove tasks. Since the information in the queue
must be available to all processes so that they can pick of task as soon as they become
idle, the migratory data transfer is natural. Experiments show that [CBZ95] contention
for the task queue has a signicant impact on application speedup.
Normally, a task queue need not guarantee that the holder of a task has exclusive
access to that data. However, if we observe that in some problems (e.g., parallel quicksort) we can also regard queue access as a request for data and access, then we can
62
distributed the queue implementation without suering contention. With coordinated
memory, we can give a precise specication of the eect of using synchronization constructs on the memory. Thus, it becomes possible to optimize data transfer by achieving
synchronization as a side eect of data transfer (similar to message passing).
3.2.4 Our Approach
Early sequentially consistent distributed shared memory systems [LH89, DCM+90] attempted transparent emulation of shared memory multiprocessors. Arguing that these
systems inherently [LS88] require more communication than programmers need, weakly
consistent models were introduced to reduce communication requirements. These weak
models utilized information about synchronization given by the programmer to identify
and eliminate unnecessary communication.
Modern systems [JKW95, BZS93, DKCZ93] extend this trend to gain eciency by
requiring programmers or compilers to specify more information about the program to the
runtime system. Even so, for many programs these system appear easier to programsince
the programmer has less data management overhead.
Just as it is argued that sequential consistency is overly consistent, we argue that traditional synchronization constructs such as locks, barriers, etc., are overly strong. They
specify more constraints on process interaction than are necessary. Thus, they require
processes to wait for each other more than strictly necessary. By evaluating the relationship between synchronization constructs and memory consistency, and by aggressively
utilizing known patterns of process interaction as a part of synchronization, we show that
performance of distributed shared memory systems can be further improved. The coordinated memory model shows how to specify the relationship between synchronization,
consistency, and known process interaction patterns. In many cases, this requires little
or no changes to existing shared memory programs, and has considerable performance
benets.
63
3.3 Coordinated Memory
In Coordinated Memory, we coordinate memory accesses using information about previous access patterns. We also achieve synchronization as a side eect of data ow.
To motivate the solution, we again consider the barrier example and show how to
optimize it further. Analyzing the optimization leads to the formal denition of the
coordinated memory model.
3.3.1 Adaptive Barriers
When a barrier is used to separate the phases in a iterative computation, changes1 made
to the memory by each process are forwarded to the barrier master. The barrier master
consolidates the changes and sends them to the participants who need the data. But
whatever the optimizations used to transfer data, to go through a barrier, processes have
to synchronize with the barrier master.
Now suppose that the data distribution is known beforehand, and also we know
which processes update which data. As a concrete example, consider the implementation
of nite dierencing using an iterative method involving three processes. Assuming the
data is partitioned as in Figure 3.1: after each iteration, process pairs h1, 2i and h2,
3i must exchange the pages on the boundaries that contain modied data. Normally,
a barrier would be used to insert a boundary to tell the distributed shared memory
implementation that the data should be updated. However, it is possible to ensure that
stale data is not used without explicit barrier synchronization.
Faking a Barrier To the same eect as a barrier, we change the processes so that
after completing an iteration, each process transmits the boundary data areas (indicated
in Figure 3.1) to the process that will use it in the next iteration. For example, consider
processes 1 and 2. When 1 sends the data to 2 and receives an acknowledgment, 1 knows
1
Or simply change lists instead of the actual contents. The changes might be retrieved lazily.
64
Data for
Process 1
Boundary values
needed by 1 & 2
Data for
Process 2
Boundary values
needed by 2 & 3
Data for
Process 3
Figure 3.1: Data Distribution for an Iterative Linear Equation Solver
that 2 has received the new data. Similarly, 2 sends the data to 1 and learns that 1 has
the data. Both 1 and 2 know that neither will proceed until it gets the data, therefore
when both have sent the data and received the acknowledgments, they know that neither
will use stale data. Thus, the explicit barrier is unnecessary; the data transmission also
guarantees synchronization.
However, waiting for an acknowledgment is expensive on a high-latency network.
Thus, further speedup is possible if we can eliminate the wait. Interestingly, it is possible
to eliminate explicit acknowledgments since both pairs of processes must transmit data to
one another. Thus, the data transmission can also be used as acknowledgment. Hence,
in our example, on a reliable network, process 1 can transmit the data to 2 and vice
versa. Both process know that the other will not start the next phase until it receives
the updated data, so once again the barrier is not necessary.
But most networks are not reliable. Reliable transmission is usually achieved using
acknowledgments, so in a \reliable" network, again latency will aect the computation.
It turns out that we can eliminate acknowledgments most of the time even without a
reliable network. The idea is that each process transmits its data to the other, and also
keeps a local version of the data. For example, suppose that process 1 makes a local
copy of the boundary data and transmits it to 2, and begins its computation for the next
65
phase (and symmetrically for 2). If that message does not make it, 2 will explicitly ask
1 for the data; 1 guarantees that until it knows that 2 no longer needs the data it will
keep its local version. Now if the message gets to 2, then 2 can continue its computation
and when it transmits the results before beginning the next phase, 1 will learn that 2
indeed received the data. Thus, by keeping at most two local versions of the data, the
need for the barrier has been eliminated. By assuming that networks are mostly reliable,
this optimistic strategy will be able to avoid explicit synchronization for most iterations.
The transformations described above can be implemented by altering the program
either manually or using compiler techniques. However, we can show that explicit data
manipulation within the program is unnecessary; it is possible to implement an adaptive
barrier that optimizes itself at run time.
Adaptive Barriers In an adaptive barrier implementation, we assume that the data
distribution and the shared boundary data does not change from iteration to iteration.
We begin with a conventional implementation, where processes contact a barrier master
in order to enter a barrier. The barrier master collects the data requests and after
combining them takes further action.
With the static distribution assumption, suppose the barrier is hit after the rst
iteration and every process communicates required boundary data locations to the barrier
master. Now, the barrier master collates the data and informs all processes about the
future requirement of all other processes. The process acknowledge the information, and
then the barrier master allows the processes to proceed. After that, the processes need
never again communicate with the barrier master, since they all know one another's data
requirement. The implicit or data-driven synchronization can be employed for all future
alterations. Thus, we get the ecient barrier without any change to the programs.
For situations where we cannot guarantee that processes share the same memory
locations from iteration to iteration, the adaptive approach can be used, but the tradeos
must be carefully evaluated. For example, if we know that each process will require data
66
receiver must be able to store the data and hand it over to the next acquirer. Thus,
depending on the accuracy of the statically expected communication pattern, we can
avoid communication and make the data available to the receiver when needed. If the
static pattern is violated, the implementation will still be be correct at the cost of extra
data management and message exchange.
Similarly, if we use task queues where the next task request can be anticipated, then
enqueueing process can send the task directly to the desired processor.
This technique allows us to leverage known patterns to enhance performance. With
increasing research in identifying such patterns, we can apply them to distributed shared
memory programs without modifying user programs.
3.4 Designing Consistency Protocols
In this section, we show how to use the design method of Chapter 2 to guide the development of protocols for distributed shared memory. In practice, we developed both the
protocols and the design method concurrently. Experience gained in one area guided the
research in the other area.
The method can be used in two ways. In one, we start top down and rene the
desired specication. We can use the Constraint-rule specications at the high-level for
very simple models, or also model details. Just as we can show that one Action-rule
specication implements another, we can also show that one constraint rule specication
implements another. In the other way, we can model a new implementation at a low
level, for example, using action rules, and attempt to verify that it implements some
desired specication. In the following, we give Constraint-rule example specications for
several distributed shared memory protocols, and suggest possible implementations. We
also show an example of Action-rule specications used to justify the adaptive barrier.
68
3.4.1 Consistency Specication
First consider very high-level specications for ordinary, sequentially consistent distributed shared memory. Protocol 10 species a single writer protocol for two processes.
Protocol 10
Automata:
Rules:
P1 : n1!w1 !n1
P2 : n2!w2 !n2
( _ : w1 ; w2 )
In this protocol, the n states represents the regions where the process has no page, while
the w states when the process has a page and is a writer. The constraint tells us that
the only process may be a writer at a time.
We might interpret this as a mutual exclusion protocol, and implement it using token
rings or any other suitable distributed mutual exclusion algorithm.
Next consider the specication of multiple reader protocol.
Protocol 11
Automata:
Rules:
P1 : n1$w1 $r1$n1
P2 : n2$w2 $r2$n2
( _ : w1 ; w2 )
( ^ : r1 ; r2 )
Here, we have added the state r to represent readers. In this specication, we would
actually like to say that any process can be in either n or r, but at most one process can
be in w. But the specication does not permit multi-state state constraints as yet. So
we get a somewhat clumsy model, but it helps in developing the more general model.
69
In this specication, we can interpret the conjunctive constraint as a group membership protocol, rather than as a barrier. This interpretation would allow us to naturally
model the change from reader to writer as a group leader election.
These two models are very high level. They tell us that the initial specication makes
sense, and that the introduction of reader copies is harmless.
We can also use the abstract specications to consider a few more details.
One way commonly used in distributed shared memory implementations uses a home
site for a page. All data about page copies (group members) is maintained there. This
is a statically determined membership protocol. In another approach, the current writer
serves as the coordinator that grants requests for read copies and write copies. One may
rely on randomizing factors like network delays and the interference of other computations
to ensure fairness. We can model these possibilities into the specication.
Consider a model with the static home page.
Protocol 12
Automata:
Rules:
N : nN !aN !gN !wN
H : nH !aH !fH !gH !nH
: nH !a0H !fH !wH
W : wW !fW !nW
( _ : aN ; aH ; a0H )
( ^ : gN ; gH )
( ^ : fH ; fG )
In this case, we have shown parts of state machines of three processes. N is a process
that requests write access, it is originally in state n with no page. H is the home process
from which N requests the page. W currently has write access. In state a, N asks for
write access, H fetches the page from W (state f ) and grants it to N in state g. The
disjunctive constraints between the ask states a ensures that only one ask succeeds; here
70
the disjunction is interpreted as arbitration by the home site. The fetch and grant (f
and g) constraints are simple request response exchanges. The path for H through a0
illustrates the ow when H itself requests write access.
This specication illuminates the basic exchanges between the automata to implement
the single writer protocol.
Just as we showed that an Action-rule specication can be an implementation of a
Constraint-rule specication, we can also show that one Constraint-rule specication is
an implementation of another. The methods are the same: establish maps that relate
the higher-level specication to the lower-level specication, and show correspondences
between the traces. Here, we can do such an analysis for the entire system, because we
have still managed to abstract communication. It is easy to show that Protocol 12 is an
implementation of Protocol 10.
3.4.2 Adaptive Barrier
So far, we have seen examples of using Constraint-rule specications. For analyzing the
correctness of the adaptive barrier, we use a notation that directly expresses dependencies
by the enabling and disabling transitions. Protocol 13 expresses the dependencies shown
in Figure 3.1.
Protocol 13
Automata:
Rules:
P1 : n1!w1 !b1!b01!w10 !n01
P2 : n2!w2 !b2!b02!w20 !n02
P3 : n3!w3 !b3!b03!w30 !n03
((n1; w1 ))(w2 ; b2)), ((n3 ; w3 ))(w2 ; b2)),
((n2; w2 ))(w1 ; b1)), ((n2 ; w2 ))(w3 ; b3))
((b01; w10 ))(w20 ; n02)), ((b03 ; w30 ))(w20 ; n02))
((b02; w20 ))(w10 ; n01)), ((b02 ; w20 ))(w30 ; n03))
71
In this protocol, process P2 can enter the barrier only when processes P1 and P3 enter
the barrier, while P1 and P3 depend only on process P2. In a regular barrier, all three
processes would depend on one another. Analyzing the traces of protocol shows that the
barrier transitions (b; b0 ) are not simultaneous in the sense that the transitions all occur
after their predecessors and successors. Rather, the reduced dependency allows P1 (or
P3) to proceed to the next iteration of the barrier without waiting for P3 (P1 ) from the
previous transition. But the dependence on P2 ensures that only the second iteration can
be started this way, not the third. Thus, what we have implemented is a barrier in which
computations in two phases can be merged. We satisfy specications where P2 proceeds
when P1 and P3 are done, while the dependency between P1 and P3 is derived implicitly
through their dependence on P2.
More formally, this is expressed by specications like Protocol 14, where P2 synchronizes separately with P1 and P3 through the constraints between the regions g and h. In
a usual barrier, all three processes would synchronize through a conjunctive constraint
on a common region.
Protocol 14
Automata:
Rules:
P1 : h1 !h01
P2 : g2!h2!g20 !h02
P3 : g3!g30
( ^ : h1 ; h2 ), ( ^ : h01; h02 )
( ^ : g1 ; g2), ( ^ : g10 ; g20 )
In this case, the reinterpretation is harmless, so we can accept the implementation.
3.4.3 Summary
In this section, we saw how to model the consistency protocols. But the models also show
that the method has many limitations. We have developed only the basis; the method
72
lacks for constraints involving multiple states in the same process, and ways to specify
constraints for a variable number of processes. Thus, when induction is necessary, we
need a manual check. At present, this is a common failing of specication methods that
use exhaustive search for verication. Future research should reveal solutions.
3.5 Implementation and Performance
This section presents experimental performance results. The experiments are designed
to explore the eects of synchronization related communication, and to see whether
programs execute more quickly with adaptive coordinators.
Coordinated memory was developed as a part of the XUNET/BLANCA project which
explores research issues in wide-area ATM gigabit networks. Therefore, the main problem for implementing distributed shared memory over a wide area network is that of
latency. However, ample bandwidth is available, so we sought a distributed memory
model that would allow bulk-data transfer and optimistic communication to compensate
for the latency bottleneck. Release consistency and entry consistency are two models that
allow bulk-data transfer, and can be extended for optimistic communication. However,
we soon discovered that the synchronization requirements became the bottleneck. The
development of adaptive coordinators is designed to relieve this bottleneck. Thus, the
genesis of coordinated memory is partly a result of the unique experimental platform.
3.5.1 Experimental Platform
Our testbed was a four-node cluster of SGI Onyx workstations, one at the University of
Wisconsin, another at NCSA at Illinois, and two in the Computer Science Department
at Illinois, connected through ATM switches developed at AT&T. We normalized the
latency to 74ms roundtrip on all four nodes using the communication latency between the
NCSA and Wisconsin workstations to calibrate the behavior. The ATM interconnection
was available through locally developed HXA HiPPI to Xunet adaptors that connected
73
TCP2 sockets to the underlying network. The available bandwidth for 4KB buers with
a 1280 KB TCP Window averaged around 103 Mb/s, with a peak of about 140.8 Mb/s.
With 64KB buers, a peak of 189 Mb/s was observed. While the underlying network
is capable of a raw bandwidth of 600 Mb/s, the interconnect with the HiPPI adaptor
and TCP overhead considerably aect performance. For comparison, the experiments
are also executed on a local-area ethernet.
Coordinated memory is implemented as an application program using standard Unix
facilities. The implementation has two major components: the library of adaptive coordinators that are implemented using message passing, and the virtual memory manipulation
system that traps non-coordinator accesses to ensure consistency.
3.5.2 Applications
We selected three applications to evaluate coordinated memory. The rst application,
matrix-multiply, models the trivial case of no coordination. The program multiplies
two 400 400 integer matrices and puts the result in a third matrix. We split the
computation equally between the four processes. Each process independently computes
the result. This application serves as a canonical example of an embarrassingly parallel
application, where coordinated memory allows unrestricted replication of data.
The sequential program runs for 32 seconds. Figure 3.2 shows the speedup for four
cases: with Xunet and Ethernet, with and without pre-replication of the matrices. The
Xunet without replication is slower than the others, but the application is compute
bound, and the dierences are not apparent with only four processors.
The second application, SOR (Successive over relaxation) is an iterative method of
solving partial dierential equations (PDE). The program models the discretized area
over which PDE is solved as a matrix. During each iteration, each matrix element
is updated by averaging the values of its four perpendicular neighbors. The program
2
UDP communication turned out to be very unreliable.
74
3
2+
4
S 3
p
e
e
d
u 2
p
3
2+
Xunet 3
Xunet (adaptive) +
Ethernet 2
Ethernet (adaptive) 3
+
2
+
3
2
1
1
2
Processes
3
4
Figure 3.2: Matrix Multiplication
4
Xunet 3
Xunet (adaptive) +
Ethernet 2
2 Ethernet (adaptive)
+
S 3
p
e
e
d
u 2
p
+2
+2
3
2+
1
1
3
3
2
3
Processes
3
4
Figure 3.3: Successive Over Relaxation
divides the matrix into rows, and each process computes the averages on a row. Only
the boundary elements of each row are shared with other processes. We computed 50
iterations for a 1024 3072 matrix. After each iteration, the processes synchronize on a
barrier. In our case, we experimented with normal and adaptive barriers to explore the
performance impact of adaptive barriers.
The sequential program completes in 119 seconds. Figure 3.3 shows that with the
Ethernet in a local area low-latency network, presence or absence of a barrier does not
have much eect. Some impact is apparent with four processes. However, it can be clearly
seen that for a high-latency network, the dierence between the versions with a normal
75
4
S 3
p
e
e
d
u 2
p
+
3
2
1
1
Xunet 3
Xunet (adaptive) +
Ethernet 2
Ethernet
(adaptive)
2
+
+
+
2
3
2
Processes
2
3
3
3
4
Figure 3.4: Quicksort
barrier versus as adaptive barrier is remarkable. The version with the adaptive barrier
exhibits nearly the same speedup as with a local area network, because the processes do
not wait for one another. They optimistically send the changes to the intended recipient
and continue their computation. This overlaps communication and computation, and
the network latency has little eect; it is also compensated by the bandwidth. With an
adaptive barrier, the speedup is limited only by load imbalance.
The nal application, quicksort, was chosen as an example of an application that
exhibits low speedups with release consistency [CBZ95] due to contention over the queue.
The quicksort uses an explicit distributed queue to coordinate data transfers, which can
even be anticipated in some cases (Section 3.3.1). The program partitions an unsorted
list of 512K integers into sublists. Small sublists are locally sorted with bubblesort, while
larger ones are enqueued into a workqueue. Whenever possible, the enqueueing process
explicitly delegates the task; otherwise, a process that has completed its task must deque
task from the workqueue. Coordinated memory allows this optimization without aecting
user programs, as discussed earlier (Section 3.3.1).
Figure 3.4 shows the speedup for quicksort (for a sequential time of 74 seconds).
With the adaptive queue, the quicksort program exhibits behavior similar to SOR; it
is speedup limited only by load imbalance. With the explicit distributed queue, there
76
is little communication once the tasks are farmed out, and processes rarely wait for
work. However, without the adaptive queue, speedup is severely limited, especially for
the high-latency case.
3.6 Summary
The preceding sections presented a technique for overlapping computation and communication by minimizing the contention and waiting period for synchronization.
The adaptive synchronization structures combine communication with synchronization. In addition, they allow optimistic communication, so that process avoid blocking for
one another. Thus, they boost application performance even over hostile environments
such as wide area networks.
Notice that in our applications, with adaptive coordination, performance becomes
limited by load imbalance. Further, the virtual memory driven communication implies
that data scattered in local memory must of necessity be communicated sequentially. On
the other hand, we have already observed that known patterns of communication can
be used to guess future data requirements. Such patterns could be conceivably used for
guessing load requirements as well as anticipating future communication.
77
Chapter 4
A Software Architecture
In this chapter, we present a new software architecture used to build the Choices virtual
memory and distributed shared memory system. The architecture is useful in any application that involves concurrent operations over groups of objects, including objects on
remote machines. The architecture permits incremental extension that add new objects
and new operations. It uses object-oriented state machines to program the operations,
permitting incremental extension of state-machine driven logic. The architecture also
includes techniques for resource management.
4.1 Goal
We motivate the architecture with virtual memory as the primary example. We show
that current software architectures can lead to change-resistant systems, and suggest
ways to factor the objects so that incremental changes become easy.
4.1.1 The Problem
Let us briey recapitulate the basic concepts of traditional virtual memory systems.
Computer architectures provide hardware that permits memory addresses issued by the
processor to be late-bound to addresses actually used to access physical memory. The
78
hardware maintains a per-process table (or a cache) that maps virtual addresses to physical addresses to facilitate the late-binding. This allows operating systems to implement
per-process virtual address spaces that are far larger than available physical memory.
The physical memory is used to cache the contents of the virtual memory that actually
reside on the secondary disk storage. For various reasons [Tan92], the caching system
manipulates chunks of data called pages rather than the smallest addressable unit of
physical memory. Normally, a virtual memory address transparently maps to a physical
address and memory is accessed. But if a virtual memory access requires a page not
available in physical memory, then a page fault is said to occur. The fault is handled
by virtual memory management software that accesses secondary storage to locate the
desired memory contents and moves them to physical memory. Periodically, relatively unused pages (selected according to a paging policy) from the physical memory are paged
out to disk by a process called the pageout daemon. The pageout daemon must also
remove the virtual to physical address binding when a page is removed from physical
memory. The virtual memory subsystem of an operating system thus interacts with user
level processes, the le system, pageout daemons, the process system, and the hardware.
Modern virtual memory systems further complicate this state of aairs. They support
shared memory between processes, copy-on-write sharing, user-level page management,
distributed shared memory and other facilities. If processes share virtual memory, then
the page-fault system has to contend with simultaneous page faults from multiple processes. The pageout daemon has to manipulate the virtual to physical address maps for
several processes. If a page is shared copy-on-write, then the page-faults system must
copy the page for a write access, but otherwise repair faults as usual. If a page is a part of
distributed shared memory, then page-fault handling may require interaction with other
machines across the network. User-level page management requires the paging system to
make upcalls from kernel level to user level. Thus, page-fault handling becomes complex,
and therefore programming a fault handler is tedious and error-prone. We argue that
with current approaches for virtual memory design result in systems that are hard to
understand and change.
79
4.1.2 Our Solution
In the following, we present a software architecture that simplies the construction of
such modern virtual memory systems. Our architecture factors out common structure
and behavior for virtual memory management systems. The factorization makes it possible to add data structures and logic for new facilities incrementally, starting from a
simple traditional virtual memory system. The design disentangles concurrent interactions between the virtual memory, process, le and networking systems. The resulting
system can be either embedded inside the operating system, or used as an external paging
facility.
4.2 Background and Related Work
We rst explain the architecture used for current virtual memory systems [KN93, RJY+ 87,
Rus91] We begin by analyzing the requirements of virtual memory systems, and suggest
the basic objects for the system. Then we discuss how other parts of an operating system
interact with the virtual memory system. The basic objects together with the interactions
reveal the overall structure of the system.
Next we study how the structure changes when new capabilities like copy-on-write and
distributed shared memory are added to the system. We argue that the usual framework
structure [Rus91, Lim95] results in a virtual memory system that is hard to change. The
arguments motivate an architecture that refactors the framework and reduces spurious
dependencies making it easier to change.
4.2.1 Basic Objects
The basic requirements of the virtual memory system are derived by looking at the primary clients, the user-level processes. A user process issues virtual addresses in its address
space, and the virtual memory system translates them to physical addresses, retrieving
memory contents from disk when necessary. A user process may also manipulate regions
80
(ranges of addresses) of the address space. For example, a process may map les (or
parts thereof) to regions with read, write, or execute permissions. These requirements
suggest the following basic objects:
Domains that represents the address space,
MemoryObjects that represent datasets like les,
MemoryObjectViews that represents parts of memory objects,
MemoryObjectCaches that maintain the physical pages that hold the contents of
MemoryObjects,
and AddressTranslation that manages virtual memory hardware.
The other clients are daemon processes such as networking drivers and pageout daemons. These processes operate on physical pages, mapping or unmapping them to virtual
addresses in dierent address spaces. They may also use physical pages that belong to
the kernel. This suggests the need for a Store object that regulates the distribution of
physical pages between various memory object caches, the kernel, device drivers, and
pages that are unused.
4.2.2 Interactions
The le system, the process system, and the networking also interact with the virtual
memory system. The le system supports operations on MemoryObjects that convey or
retrieve data from the disk. The virtual memory system uses the le system to repair page
faults. The le system uses the virtual memory system to access physical pages used in
le caches. The process system interacts with the virtual memory system to manipulate
AddressTranslation when scheduling and descheduling processes. The networking system
interacts with the virtual memory system to make its pages accessible to the device
driver.
81
4.2.3 Operations
Given the basic objects, let us consider the structure of typical virtual memory operations. The code for the operations is distributed throughout the objects, creating a
framework [JR91]
A page fault is detected at the user-level. Given the process and the virtual address,
the handler invokes a pageFault method on the Domain associated with that that process.
The pageFault determines the memory object, the associated physical page (allocating
one from the Store if necessary), and issues a read request on the memory object.
A pageout request is generated when the Store runs low on available pages. The
pageout daemon visits several MemoryObjectCaches, invoking their pageOut method to
release a few physical pages. The data in the physical pages is written out to the disk using
the write method of the MemoryObject cached by a MemoryObjectCache. The virtual to
physical address maps referring to that page are also altered. These maps, implemented
as AddressTranslation objects are associated with address spaces (Domain). So the pager
daemon must either visit all Domain objects and operate on all AddressTranslations, or
maintain a reverse map.
4.3 Why the New Architecture
The preceding discussion portrays the structure of conventional virtual memory implementations with basic facilities.
We now consider how the virtual memory system changes when new capabilities are
added. We analyze the diculties, identifying parts that must be refactored.
4.3.1 Examples
Consider adding support for shared memory to the above system. Shared memory is
implemented by allowing multiple Domains to map a single MemoryObject, perhaps with
multiple MemoryObjectViews. The addition of shared memory means that multiple virtual
82
addresses from dierent domains may map to the same physical page. The pageout
daemon now requires a physical address to virtual address map that references multiple
Domains or AddressTranslations. Moreover, a pageout operation may conict with multiple
pagein requests, and multiple pagein requests may conict with one another. The policy
that selects physical pages during pageout may also be altered to favor non-shared pages.
This requires alteration to the synchronization code.
Next consider adding support interprocess communication via virtual memory manipulation. For example, to transmit data between process P and Q with dierent address
spaces, the physical page mapped to P 's address space can be mapped to the address
space of Q. The transmission may have dierent semantics: for example, the mapping
for P might be removed after transmission, or the physical page may be mapped with a
copy-on-write permission. Should P or Q write to the data, the physical page is copied,
so that the process that has not modied its copy has the original data. The communication is implemented by determining the Domain associated with the source process and
the physical page associated with the source address. The Domain of the the destination
process is modied to map the destination address, and the AddressTranslation is modied to associate it with the physical page. Like the addition of shared memory, adding
copy-on-write (especially if copies of copies are allowed) requires changes to the maps. It
also requires changes to the synchronization code, since the implementation may simultaneously operate on multiple Domains and AddressTranslations. Unlike the addition of
shared memory, adding copy-on-write requires changes to the pagefault repair code. A
virtual page may now have an extra state, copy-on-write, in addition to the usual write,
read, execute.
Addition of interprocess communication creates additional changes. For example,
Both the le system and the networking system can use virtual memory manipulations
to convey data to and from user level processes to the device drivers. In many cases, this
may be faster than copying data between physical pages.
Finally, let us briey consider adding support for distributed shared memory. Virtual
pages in distributed shared memory have additional states such has-distributed-copies or
83
exists-on-remote machines. Pagefault repair may require the retrieval of page contents
across the network. The system must also implement complex page consistency protocols.
This requires extensive changes to the pagefault routines.
The changes discussed so far may be classied as follows:
Changes to objects that associate information: for example, shared memory requires changes to physical-to-virtual address maps maintained in MemoryObjectCaches.
New objects that add new associations, for example, data structures that remember
pages that are copy-on-write copies.
Changes to synchronization code. The usual implementation strategy is to add
semaphores to the methods of objects. For example, MemoryObjectCache uses
a semaphore to resolve conicts between pagefaults and pageouts. Deadlock is
avoided by ordering the semaphores for various objects in a hierarchy, and ensuring
that dierent virtual memory operations visit various objects in an ascending order [Tan92] When new operations are added, we have to devise suitable semaphore
hierarchies.
Changes to pagefault and pageout procedures, as virtual pages acquire new states,
and handling pagefaults and pageouts becomes more involved.
New interactions between the virtual memory system and the rest of the operating system. These arise when new capabilities of the virtual memory system are
exploited in the rest of the operating system.
4.3.2 Why Change is not Easy
Applying these changes is tedious in the usual framework structure where objects hide
not only implementation details, but also distribute control ow. For example, pagefault
processing consists of a set of method calls beginning with a call by the user-process on
84
a Domain. Domain locates the appropriate MemoryObject and invokes pageFault in turn
locates MemoryObjectCache; the MemoryObjectCache fetches a physical page from Store
and lls the page by invoking MemoryObject::read . In this structure, logically pagefault
processing may simply be understood as the method Domain::pageFault . While this is
logically elegant and attractive, it makes changes more dicult. Adding new objects or
changing existing associations becomes dicult, as assumptions about the organization
permeate the method calls, the selection of method call parameters, and the call chain gets
harder to understand. Using inheritance to redene objects exacerbates the diculty;
the call processing now gets distributed over the inheritance hierarchy in addition to
the objects.Embedding synchronization further complicates matters: the hierarchy of
semaphores used to resolve deadlocks becomes implicit in the structure of the call chain.
Furthermore, many such call chains originating at dierent object appear in the system. For instance, the pageout daemon visits MemoryObjectCaches to remove pages. But
removing pages requires operations on AddressTranslations to alter the map and MemoryObject to move data to disk. This creates a class chain that visits objects in a dierent
order.
Interactions between the virtual memory system and le system multiply the diculties. For example, when a page is removed from a MemoryObjectCache during pageout,
the method MemoryObjectCache::write is invoked. That method invokes the disk driver
to move the data to disk. In turn, the disk driver uses interprocess communication via
virtual memory manipulations that are implemented by the MemoryObjectCache.
Such intermingling of data structures, processing, synchronization and inheritance
makes the virtual memory system very fragile, and changes can be hazardous. We need
a framework structure that separates these aspects, becoming easier to understand and
change.
85
4.4 What Needs to be Redesigned
We have identied the intermingling of various aspects the virtual memory system as the
culprit that makes the software brittle.
In the following, we show how to separate these aspects. Then we discuss how the design can be reected in the program by explicitly representing design features as objects.
The refactoring and reication make it easy to understand the ramications of adding
new virtual memory features.
4.4.1 Data Structures and Synchronization
First consider data structures and synchronization.
In every operation, given some parameter such as the pair of virtual-address-andprocess, the operation decides upon the objects to be visited, and collects related
information such as the physical page associated with the virtual address, the le
in which the page contents are stored, and operates on this information. The
objects in the virtual memory system are primararily tables that implement the
associations. For instance, a Domain maps virtual addresses to MemoryObjects, an
AddressTranslation maps virtual addresses to physical addresses and so on. Changes
to virtual memory either add new maps or change existing maps.
The operations may be invoked concurrently, and many conict with one another.
The conict is apparent only when all of the information necessary for an operation is collected. For example, a pagefault operation may conict with a pageout
operation. That a pagefault conicts with a pageout is known only invocations of
MemoryObjectCache::pageFault and MemoryObjectCache::pageOut identify the same
physical page. Adding new capabilities does not change this nature of conict detection.
86
Conicts between operations are resolved by allowing one operation to proceed
while others wait. An operation that proceeds gains exclusive rights to alter the
data structures. Again, this aspect does not vary when new capabilities are added.
The changes to the data are few and predictable: the classic example is that after
pagefault, a virtual page has an associated physical page, and after pageout, there
is no physical page. The result of an operation depends on the current state. For
example, a pagefault operation may allocated and ll a physical page if necessary,
but if the physical page is present, it need only add a virtual address to physical
address mapping to the AddressTranslation. Detailed analysis shows that the logic of
the operations can be easily programmed as a state machine. When new operations
are added, the state machine must be extended.
4.4.2 Interactions
Next consider interactions with other subsystems:
Consider interactions where the virtual memory systems makes le or networking
requests as in pageout operations. During such a request, the other system make
invoke virtual memory operations on the same physical page that is part of pageout,
leading to an apparent conict. Such conicts due to call cycles must be avoided.
Consider interactions initiated by other systems, such as virtual memory manip-
ulations during interprocess communication. These interactions can be treated as
normal virtual memory operations.
Other interactions are implicit. For example, physical pages are dynamically dis-
tributed in among many entities in the operating system: le system, process system, user allocated memory, device drivers and so on. When new pages are needed
elsewhere, pages allocated to one entity must be deallocated. The most appropriate
a page to be deallocated, and the disposal of its contents depends on the entity, so
we need to distribute the responsibility for deallocation.
87
Interactions across machines in distributed shared memory. Most interactions be-
tween virtual memory and other subsystems are simply implemented by designing
the appropriate interfaces, because the language compiler takes care of the rest.
Distributed shared memory is dierent, because here virtual memory systems interact across dierent machines.
4.4.3 A Solution
We make the aspects discussed above explicit. First consider the call chain and synchronization.
The basic objects of the virtual memory like Domain and MemoryObjectCache implement methods to query and alter table entries.
The call chain for every operation is reied into Operation objects that invokes
the various table methods to implement the operation. For example, pagefault is
implemented with a OpPageFault class.
All the information required to implement an operation is gathered into Parameter
objects. For example, the execution of pagefault begins with the virtual address
and gathers the relevant MemoryObject,MemoryObjectCache, PhysicallyAddressableUnit (an object representing the physical page) and so on. These parameters are
explicitly gathered in ParamPageFault objects.
Every invocation of an operation generates an instance of Parameter objects. These
objects are enqueued and used to detect conicts between operations with explicit
Conict classes.
Since only one of several conicting operation proceeds, the instance of Parameter
for that operation also serves as a token that grants permission to change various
tables as required by the operation.
Our next aspect to made explicit is the logic for the operations.
88
The states of virtual pages are explicitly represented as state objects. Dierent
types of state objects encode dierent states, and the methods of a state object correspond to dierent operations. For example, a virtual page may be in
two states, PhysicalPage and NoPage. The methods PhysicalPage::pageFault and
NoPage::pageFault implement pagefault handling. If there is no page, NoPage::pageFault
will allocate a physical page, change AddressTranslation and update the hardware,
whereas PhysicalPage::pageFault will simply update the hardware.
The interactions are made explicit as follows:
Interactions between the systems are made explicit by dening Interaction classes
whose methods dene the interactions. These are similar in spirit to Operation
classes
The advantages of this structure may be summarized as follows:
Since call chains are explicit in the Operation objects, changes to old operations
are made by dening new classes rather than changing methods of individual basic
objects as in the traditional design.
Explicit Parameter objects help in precisely dening methods that implement conict detection and resolution, rather than implicitly encoding it in semaphore hierarchies. Changes that add new conicts or change the resolution of old conicts
can be explicitly programmed.
The use of parameter classes as permission tokens greatly simplies concurrency
control.
New states and changes to the logic of operations can be explicitly described via
Object-Oriented State Machines as described below.
The preceding discussion gives an overview of the unique aspects of our software architecture. In the following, we describe the architecture in greater detail, highlighting design
decisions as design patterns.
89
4.5 Architecture of the Virtual Memory System
We present the architecture as a series of patterns that are used to solve design problems.
We start from the point of view of users of the virtual memory system. We show
how virtual memory functionality may be exported to users and to other parts of the
operating system. Next we show how to organize the internals of the system by reifying
operations as objects. The design of concurrency control code follows. This completes
the basic aspects of the design.
Three other aspects are taken up afterward. The rst is the implementation of virtual
memory operations using object-oriented state machines. This allows us to smoothly
add complex logic for operations with features like copy-on-write and distributed shared
memory. Then we present architectural features for adding interactions with remote
virtual memory systems, and discuss some design issues for resource management.
4.5.1 Exporting Functionality
The rst design question is how to export virtual memory functionality to user level
processes and other subsystems like the le system. The design can be tricky, because
the virtual memory and other subsystems may use one another's services recursively. We
describe the design in two steps.
First consider exporting virtual memory services without the recursive aspect.
Context : The virtual memory system provides services to many entities like userlevel processes, le system, process system, external pagers and so on. The virtual
memory system services are implemented by dierent objects within the system.
Problem : Although the virtual memory services are implemented by dierent objects within the system, it is vital that other entities do not depend on the internal
structure. If other entities encode knowledge about the internal virtual memory
structure, changing the structure can be dicult. Also, dierent entities may use
90
dierent services provided by the system, and may need to know the internals of
the system to dierent degrees.
Solution : For each interacting entity, dene a Interactor class that describes the
services provided by the virtual memory system. For example, VMInterface is an
Interactor that provides methods like VMInterface::pageFault .
Consequences : The Interactor classes dene entry points to virtual memory system,
and make the dependencies between virtual memory and other systems explicit. It
allows us to change the internal structure of the system without impacting other
subsystems. New services can be provided by extended the Interactors by inheritance. The degree of exposure of the details of the virtual memory system can
manipulated by designing the proper interface. It is also conceptually elegant, in
that the Interactors dene a notion of a single virtual system subsystem.
A drawback of the design is that there are many interfaces. The programmer must
ensure that service denitions are identical in dierent interfaces, and that there
are not unmanageably many variations.
Notes : By itself, this is the Facade [GHJV93] pattern. But as we see below, we
need a variation.
Next, we look at the impact of recursive relationships between virtual memory and other
subsystems.
Context : We have to implement virtual memory services that use services from
other subsystems. In turn, the requested services may recursively use virtual memory services. For example, the virtual memory system may request le system
services during pageout, and in turn the le system requests virtual memory manipulations for the disk driver.
Problem : Although the division of the operating system into a set of interacting
subsystems is convenient, it partitions the code for operating system functions
91
among the entities. Recursive relationships can arise between the facilities provided
by dierent subsystems. This can make it dicult to understand, change and
optimize the overall system. For example, any changes to the virtual memory
system must guarantee that the le system can safely use the virtual memory
system even when the use is reentrant.
Solution : For each interacting subsystem, dene a Interactor class that describes
the services provided by the virtual memory system, and the services requested by
the system. If the two types of services share results or parameters during some
operation, implement the appropriate checks to validate the relationship.
For example, MemoryObjectCache serves as an Interactor between the virtual memory system and the le system. When the virtual memory system invokes pageOut ,
it uses MemoryObject::write provided by the le system. MemoryObject::write recursively invokes MemoryObjectCache::pageFault to map physical pages for disk output.
The recursive call is detected by the MemoryObjectCache, so that it does not interfere with the changes made to MemoryObjectCache as part of pageOut processing
that precedes the MemoryObject::write call.
Consequences : The Interactor reduces coupling between the interacting systems.
Explicitly validating that the virtual memory and le system services may invoke
one another recursively prevents changes to either the le system or the virtual
memory system from violating assumptions. It localizes the assumptions that would
otherwise be implicit in the code.
An Interactor may also serve as a convenient point to cache the results of virtual
memory services.
A drawback is that the explicit validation may be dicult to implement, especially if the interacting entities and interactions proliferate, On the other hand, the
proliferation may be an indication that redesign is necessary.
92
Notes : An Interactor class combines aspects of the
Mediator [GHJV93] and
Facade [GHJV93]. If caching is implemented, it may have elements of Memento [GHJV93].
4.5.2 Organizing the Internals
The next design issue is the internal structure of the virtual memory system. We argued
previously that distributing the behavior for virtual memory operations among the basic
objects creates diculties. The solution is to create Operation and Parameter objects
that use basic objects. The design decisions involved in their design are explained below.
Lastly we show Interactors use Operation and Parameter objects to actually implement
the services.
4.5.2.1 Designing Operations
First consider the design issues for Operation objects.
Context : Virtual memory operations such as pagefault involve interactions between
many objects.
Problem : The behavior of virtual memory operations is distributed among many
objects, so the associations between objects inuence the behavior code. Adding
new facilities to the virtual memory system can change the existing associations,
and add new behavior. Changing the associations can change existing behavior
as a side eect. When adding new behavior, it can be dicult to decide how to
distribute it among the objects. Moreover, it is tedious and error prone to change
many objects in a system for every new operation.
Solution : Collect the behavior in a Operation object that coordinates the basic
objects. The basic objects need only implement object associations, state inquiry
and state alteration functions.
93
For instance, in the original virtual memory design, pagefault processing was distributed among methods of Domain, MemoryObject, MemoryObjectCache and so on.
In our architecture, there is a OpPageFault class with pageFault that manipulates
Domain and other objects during pagefault processing.
Consequences : Explicit Operation objects centralize operation implementations,
making them easier to understand. If associations between objects are changed,
local changes can be made in the Operation objects. New operations are easily
added without changing the basic objects.
On the other hand, the centralized, monolithic behavior can become complex. Then
we need other ways to reduce the complexity.
Notes : In the virtual memory system, Operations indeed become complex. Objectoriented state machines [SC95] were invented to simplify the operations. Thus, we
can successfully use mediators.
4.5.2.2 Data Management
Next, consider the issues for Parameter objects.
Context : Operations such as pagefault have many parameters such as the virtual
address range, physical pages, process, address translation, memory access permissions, memory objects and so on. The parameters have to be communicated to
other subsystems like the lesystem, and are useful in detecting conicts among
operations. Dierent operations require dierent parameters.
Problem : When Operation classes invoke methods of basic objects, dierent meth-
ods need dierent parameters. Similarly, services provided by other subsystems,
procedures that detect conicts among concurrent operations all need dierent
parameters. When the operations change, the parameters change. Therefore, managing the parameters as parameters of method calls is tedious.
94
Solution : Package all interesting parameters into a Parameter object, and dene
update and inquiry methods as well as methods that implement conict detection.
In our design, there is a VMParameter object that gathers all parameters for operations like pagefaults. When we added distributed shared memory, additional
parameters were added by deriving a DSMParameters.
Consequences : Parameter objects reduce the large collection of parameters into a
single object that is easier to manage. It makes the denition of services more
uniform and makes it easier add new parameters for new virtual memory facilities.
It also centralizes operations like conict detection explicit in the code.
Parameter objects can become complex if there are too many parameters. If a
parameter object is used as the sole input to a method, it simplies the interface but
hides details like parameter types and distinctions between readonly and writable
parameters.
Notes : Parameter objects also help in solving synchronization problems and streamlining the interaction of virtual memory and other subsystems.
Finally, we have show how requests from the clients of the virtual memory system use
the Operation and Parameter objects.
Context : Interactors like VMInterface dene methods like VMInterface::pageFault
that are user-level processes. The functions are actually implemented by Operation
classes like OpPageFault that use Parameter objects like VMParameters. Interactors
and Operations carry no state, and may have single instances for the whole system.
Problem : Interactors should be able to invoke dierent types of Operation classes
and create Parameter objects. Hardcoding the details directly in the methods of Interactors means that we would have to replicate the method for all interactors. Also,
it becomes dicult to change details like how to allocate memory for the parameter objects. We also need to create instances of Interactor classes like VMInterface
without resorting to global variables to store the instance.
95
Solution : The construction process for invoking an operation is similar for all methods dened by interactors: locate the corresponding Operation class, instantiate the
appropriate Parameter object with the parameters provided by the interacting entity and pass the Parameter to the Operation.
Dene an abstract Factory class that encodes this procedure as the makeProduct
method, and dene concrete class for each variation.
When classes have single global instance, let Let the class manage the single instance, providing methods like makeInstance , getInstance , and destroyInstance . This
applies to Interactors, Operations and Factorys.
Consequences : Factory classes organize details like memory management involved
in creating objects. But if the details change for some products, it may become
tedious to extend the factory and its subclasses.
Notes : These are the standard patterns Abstract Factory [GHJV93] and Singleton [GHJV93] applied to virtual memory.
The design presented so far shows how to organize the virtual memory interactions and
the implementation of virtual memory operations.
4.5.3 Concurrency Control
Thereafter, our design goal is to clarify the design of concurrency control.
Context : Virtual memory operations query and modify the state of many objects.
When two operations need to modify the state of the same object, we have to
sequentialize the modications. The usual way is to associate semaphores with
the basic objects and ensure that the operations interact with the objects in a
hierarchical fashion so that there are no cycles leading to deadlock.
Problem : Some operations may not visit the objects in the same order. For example, pagein begins by visiting Domains, while pageout begins by visiting MemoryObjectCache. Some operations may recursively visit objects more than once,
96
creating cycles. For example, pageout invokes MemoryObjectCache::pageOut , it
invokes MemoryObject::write , and in turn the disk driver invokes MemoryObjectCache::pageIn (for the kernel). Other operations, like interprocess communication,
may visit multiple Domains and AddressTranslations.
Solution : The use of Operation objects makes the order of object invocation explicit.
Examine all Operation classes and divide them into categories of operations, such
that operations from one category may lead cycles with operations from another
category. Identify operations that may lead to recursive cyclic visits to objects.
Use semaphores to serialize operations across categories. For example, interprocess
communication operations are serialized among one another before they are allowed
to conict with other operations.
Use Parameter objects as tokens to detect recursion. For example, MemoryObject::write passes along a VMParameter object that represents the pageout operation, and also stores it with the MemoryObjectCache that originates the write call.
When the disk driver invokes MemoryObjectCache::pageIn, it passes the same parameter object as a token. The MemoryObjectCache can thus identify the recursion
and avoid deadlock.
The remaining operations operate on objects in a hierarchical fashion. They can
simply use semaphores associated with the objects.
Consequences : The use of Operation and Parameter object makes the concurrency
control explicit. When new features are added, analyzing the eect of the new
features becomes easier.
So far, we have discussed the design of interfaces for interactions between virtual memory and other subsystems, the internal design of virtual memory using Operations and
Parameters, and the design of concurrency control. The remaining aspects of the design
are:
How to use state machines to simplify the design of Operation objects.
97
Interactions between virtual memory systems on dierent machines for implementing distributed shared memory.
Programming the dynamic distribution of pages.
4.5.4 Operations Using Object-Oriented State Machines
In this section, we show how to use program the code for virtual memory operations in our
architecture. We begin with a basic design pattern for programming state machines. We
then present implementation techniques that make it possible to extend state machines
by inheritance. Subclassing and Composition techniques for state machines are described.
These methods are used to extend a basic virtual memory system to add copy-on-write
and distributed virtual memory.
4.5.4.1 Basic State Machines
Let us begin with the state machines.
Context : The behavior of virtual memory operations can be dened as a change in
the state of the virtual memory data structures. For example, pagefault changes
the state of a virtual page from Mapped (physical page is mapped to virtual page)
to Unmapped (no mapped physical page) while pageout changes it from Unmapped
to Mapped.
Problem : Usually, the state of an object is maintained as values of its instance
variables. If the behavior during a method call depends on the current state, then
the method is programmed using if or case statements. The state is implicit in
the variables, and the transitions are implicit in the variable assignment. Such a
monolithic organization is dicult to understand. If a new state is added, several
cases and methods must be updated together, complicating code maintenance.
98
Alternatively, and explicit state table can be used. The uniform format makes transitions explicit, but the logic for selecting transitions and actions is still implicitly
programmed as tests and assignments on state variables.
Solution : Represent the state directly using state objects, one object for each state.
The behavior of the actual object is implemented as methods of the state objects.
The object maintains a pointer to the current state object and delegates methods
to that object. Methods of the state objects return the next state.
For example, in the original design, MemoryObjectCache maintains tables of pages
and their current state, and implement pageFault and pageOut using conditional
statements. In our design, these methods are delegated to Mapped and Unmapped
objects, and only pointers to these objects are maintained in the MemoryObjectCache.
Consequences : The states and transitions are explicit, and the appropriate transi-
tion is selected by examining a single variable. The changes to variables that dene
the state are grouped within the methods of state objects. The organization simplies the task of adding state or making other changes. In addition, such changes
do not aect the delegatee. For instance, adding new states VMPageReadOnly and
VMPageWritable instead of VMHasPage will change the methods of VMNoPage, but
not aect MemoryObjectCache.
Representing state using objects (the State [GHJV93] pattern) simplies Operation
classes. When new states are added, or existing states change, we implement the changes
by creating new state classes and suitably altering the methods for existing state objects.
For example, consider the state machine in Figure 4.1 for simple virtual memory, and
the state machine in Figure 4.2 that implements copy on write. In Figure 4.1, a virtual
memory page may be mapped into physical memory so that it is accessible. The page
may be unmapped to store it on backing store and release physical memory for other use1.
1
For simplicity, the state machine diagrams do not show loops (transitions that do not change state).
99
pageOut
Mapped
Unmapped
WMapped
makeCopy
pageAccess
Figure 4.1: Page States
in a Simple Virtual
Memory System
RMapped
pageOut
pageRead/pageWrite
pageWrite
pageOut
pageRead
WUnmapped
makeCopy
RUnmapped
Figure 4.2: Page States for Copy-On-Write
The state machine in Figure 4.2 supports copy-on-write. COW allows data created by
one process to be shared with a dierent process without requiring the data to be copied.
Instead, the physical pages on which the data resides are shared between processes until
the processes modify them. Data is \copied" by mapping the associated physical page
into the virtual address space of the target process with read-only access. However,
upon a write access, the \copied" data is duplicated by copying the page to a new
physical page and changing the read-only access to write access. The two gures show
several similarities, for example states RMapped, WMapped, are similar to Mapped and
methods pageRead , pageWrite are similar to pageFault . The two gures have dierences
corresponding to the additional behavior, for example, makeCopy is added and pageWrite
causes transitions from RMapped to WMapped.
4.5.4.2 Derived State Machines
One way to implement the copy-on-write state machine is to copy the code of the original
state machine and alter it as necessary. However this makes code maintenance dicult.
A better alternative is to express the relationship directly, by considering the copy-onwrite machine to be a subclass of the original machine. For example, can we derive both
100
instances of pageOut in Figure 4.2 be dened by inheriting the pageOut method? If so,
we could program the copy-on-write machine as follows:
Derive pageRead and pageWrite from pageFault , and
Program new methods like makeCopy .
One solution is as follows:
Context : We have implemented the virtual memory state machines using the state
pattern. We want to derive both WMapped::pageOut , RMapped::pageOut by inheriting from the method Mapped::pageOut .
Problem : The pageOut method is programmed as follows:
Mapped::pageOut(Page * p)
f p->flushMMU(); p->writeToDisk();
return Unmapped::Instance(); g;
We might derive a class WMapped from the class Mapped, hoping to reuse Mapped::pageOut . But now there is a problem: where Mapped::pageOut returns
Unmapped::Instance , WMapped::pageOut must return WUnmapped::Instance . Although behavior in the two states is similar, the state transitions dier for the COW
machine. If we attempt to reuse Mapped::pageOut by redening Unmapped::Instance
to return WUnmapped::Instance , we nd that RMapped::pageOut cannot reuse Mapped::pageOut ,
as it must return RUnmapped::Instance .
Solution : We use indirection to resolve the problem. Return the next state indirectly through a table of states called StateMap. That is, pageOut is programmed
as follows:
Mapped::pageOut(Page * p)
f p->flushMMU(); p->writeToDisk();
return map->Unmapped(); g;
101
Now, both WMapped and RMapped are derived from class Mapped, but the map
variable is initialized dierently in the two classes. The map in WMapped returns
WUnmapped for the invocation map->Unmapped(). In RMapped, it returns RUnmapped instead.
The general principle is that the StateMap together with the implicit virtual function table (VTable [Str91]) for each state object, expresses the relationships between
state transitions of the base and derived machines. Class WMapped is derived from
Mapped, while its map is initialized to return WMapped and WUnmapped. Thus,
state transitions from Mapped to Unmapped in the base machine map to transitions
from WMapped to WUnmapped.
Consequences : New state machines can be derived from old state machines in a
systematic way. Actions from base state machines can be reused in the derived
machine. But initializing the StateMaps is tedious. In our system, we solve this
problem by dening a small language to express the relationship between base and
derived machines.
The technique of using StateMaps can be easily extended to implement composition and
delegation between state machines. In the virtual memory system, we use composition
extensively to implement distributed shared memory consistency protocols. We briey
present and example, and comment on the implementation. Other features of objectoriented state machines are presented in [SC95].
102
DMapped
pageOut
pageAccess
pageAccess remAccess
Remote
DUnmapped
getPage
herePage
Null
Quiescent
Send
herePage
ackPage
Figure 4.4: DSM-NET State Machine
Figure 4.3: DSM-VM State Machine
MappedQ
Fetch
pageOut
herePage
pageAccess
remAccess
FetchN
SendN
pageAccess
ackPage
UnmappedQ
RemoteQ
Figure 4.5: DSM Composite State Machine
4.5.4.3 Composing State Machines
State machines are composed to combine behaviors dened in component machines.
We demonstrate composition by constructing a distributed shared memory (DSM) protocol machine (Figure 4.5) out of a virtual memory machine (Figure 4.3) and a networking
machine (Figure 4.4). DSM [SMC90] provides the illusion of a global shared address
space over networks of workstations, whose local memories are used as \caches" of the
global address space. The caches have to be kept consistent: a simple approach allows
only one machine to access a shared page at a time. If another machine attempts to
access that page, its virtual memory hardware intercepts the access, and its fault handler
fetches the page from the current page-owner. Thus, behavior for DSM has VM and networking aspects. We dene VM and networking behavior using separate state machines,
and compose them to get a DSM machine.
103
DSM-VM Machine: In a DSM system, pages may be either DMapped, DUnmapped or
Remote (Figure 4.3). The DMapped and DUnmapped states are inherited from the
simple VM machine (Figure 4.1). State Remote represents a page on some remote
machine, and remAccess denes actions for pages accessed by a remote machine.
The transitions are dened as though no networking were necessary. (The special
state Null ignores all VM actions; it is used in composition. We always create Null
and Error states for every state machine.)
DSM-NET Machine: The networking machine (Figure 4.4) implements a trivial pro-
tocol that sends a page to a remote machine, or gets one from a remote machine.
It handles details like fragmentation and sequence numbering.
DSM Machine: In the composite DSM Machine (Figure 4.5), sux Q indicates that
in the composite state, the DSM-Net state is Quiescent, and sux N indicates that
the VM state is Null. We implement transitions to and from RemoteQ using the
networking machine. MappedQ and UnmappedQ inherit behavior from the DSM-
VM machine.
Methods of the composite machine reuse behavior dened in methods of component
machines by initializing StateMaps so that the states returned from component methods
are actually composite states.
For example, in the DSM-NET machine, Quiescent::herePage returns the NET state
Send, but when invoked from the composite state MappedQ, it returns the composite
state SendN. In turn, ackPage when invoked from SendN, returns RemoteQ instead of
Quiescent. RemoteQ later gets used as a VM state.
4.5.5 Implementing Remote Interactions
The logic of consistency protocols for distributed shared memory is implemented using
state machines. But in addition to the logic, the virtual memory system has to interact
with dierent machines.
104
Consider the simple consistency protocol depicted in Figure 4.5. When there is a
pagefault, the pagefault operation visits objects like Domain, MemoryObject, MemoryObjectCache, and eventually invokes pageFault on some state object. If the page resides on
a remote machine, Remote::pageFault is invoked. It must determine the remote machine
that has the page (and is in state DMapped), contact it with the identiers for the virtual
page (i.e., memory object, oset within the object) retrieve the data, and update the
local virtual memory data structures. It must suspend processing when waiting for the
data, and proceed after the page is retrieved across the network.
By virtue of the internal structure of our system, all the necessary information to
locate the remote page, retrieve data, and operate on the local data structures is contained
in the Parameter object associated with the operation. Therefore, we can implement the
remote interaction by transmitting the parameter object to the remote machine. We
ensure that the reply also contains the parameter object, together with the contents
of the page, so that we can simply continue processing when the reply arrives. Thus,
remote interactions t in smoothly with our basic structure. The key design decisions
are examined below in greater detail.
4.5.5.1 Continuations
First we show how to use continuations for ecient remote interactions.
Context : Consider the usual way of implementing the execution of pagefaults in a
distributed shared memory system. When the user process faults, a thread in the
kernel begins pagefault processing by executing the methods of the OpPageFault.
Eventually, the operation requires page contents that are on a remote system. The
thread initiates a remote request and blocks waiting for the reply. On the remote
side, some server thread picks up the request. That thread must retrieve the page
contents, either from memory or from disk if the page has been paged out. There
may be other operations demanded by the memory consistency protocol.
105
Problem : An operating system that supports networking, distributed le systems,
distributed shared memory, supports considerable concurrent processing and may
require many threads. But threads are operating system resources managed by the
kernel. They contain the execution stack, data for schedulers, and tie up slots in
kernel data structures. Therefore threads are too expensive to be left waiting for
activities to complete.
Furthermore, suppose there are concurrent pagefaults such that pagefault processing can be \batched" together (for instance, faults on adjacent pages from dierent
processes). If a thread is dedicated to an operation, the threads for the two operation will block separately; batching could be implemented in some ad-hoc fashion
at some network layer. The issue is that the system has knowledge about memory
operations (as opposed to generic threads) so that operations can be intelligently
scheduled in ways dierent from generic thread scheduling. To exploit this knowledge, we should dissociate threads from operations.
Solution : Instead of threads, use Parameter objects to implement continuations. A
continuation [AS96] at any point of execution informs how to continue the processing. It is an object that contains all the necessary data and a pointer to the code
that continues the operation.
In our architecture, Parameter objects contain all the data necessary for a virtual
memory operation. We add one more parameter to use it as a continuation.
Consider an operation like pagefault that starts on one machine, waits until a
remote reply is received, and continues processing. We divide the operation into
two methods, one for use prior to page fetching, another for use after the page is
received. The Parameter object for the operation has a variable that points to the
second method. When the rst method is completed, processing can be continued
given just the Parameter object.
When a thread T that executes the operation completes the rst method, it adds
the Parameter object queue, schedules a remote request, and may then pick up any
106
other task. When enough replies are received, the networking driver (or a thread
dedicated to reply processing) will schedule a thread U to pick up where T left o.
A similar scheme can be used on the remote machine to process incoming requests.
Consequences : Continuations do not consume slots in kernel structures, so they
are cheap to create and destroy. As fewer threads are used, scheduling and context
switching overhead is minimized. Based on the data in Parameter object, operations
can be batched or redundant operations eliminated.
A drawback is that all operations have to be divided into articial methods, unless
the implementation language supports the creation and use of continuations.
Next we consider message demultiplexing.
4.5.5.2 Active Messages
Ecient message demultiplexing is achieved by using Parameter objects as active messages [vECGS92].
Context : A messages arrives through network drivers as a chunk of uninterpreted
data. The receiver has to interpret the data, and take any action requested. In
case of distributed shared memory, the messages are typed, the address space and
the memory object identiers are included. The types indicate the desired action,
an the object identiers indicate the data structures that are aected. When the
remote request is complete, the originator gets a reply. The reply also contains
identiers that allow it to be matched with the request.
Problem : The usual way of interpreting a message uses some form of table lookup
to get at the action requested by the message and the data structures to be aected.
Table lookups can be expensive, especially because the tables are often protected
by semaphores that serialize concurrent accesses.
Solution : We can avoid interpretation by embedding directly the pointers to meth-
ods and objects in the message. For example, instead of having to interpret message
107
types and decide the action, the program counter for the action code can reside in
the message. If the networked computers are of the same type, and the action is
described by kernel code that always resides at the same address, then the receiving thread can immediately jump to that address. If the action and data structure
addresses dier, they can be determined at some prior time (e.g., as part of setup)
Thus, Parameter objects that contain the continuation information can also serve
as messages.
A message received from the network is usually in a data format resulting from the
serialization of data into a sequence of bytes. If the machine formats are dierent
from the network format, the data has to be interpreted. We can reduce the
interpretation overhead by wrapping the uninterpreted data in an object that has
the same interface as a Parameter, but interprets the raw data on demand, when
an Operation queries or sets parameters.
Consequences : When many messages arrive at the network device driver, it has to
determine the recipient. Active messages hasten this demultiplexing.
4.5.6 Dynamic Page Distribution
In the conventional architecture, operating system pages are distributed among various
subsystems and MemoryObjectCaches. Some pages are permanently allocated: for instance, pages for kernel data structures. Other subsystems such as the process system
or the le system may request and return pages dynamically. But most pages are dynamically allocated by MemoryObjectCaches from a Store that manages physical pages.
A pageout daemon watches the Store to detect excess allocation and periodically visits
MemoryObjectCaches to preempt pages. MemoryObjectCaches dene policies for selecting
least desirable pages that are given up when requested by the pageout daemon. But page
preemption can be expensive, because it may conict with pagein. Similarly, dynamic
page allocation can be expensive in systems like network drivers where it is undesirable for
the driver to pause instead of delivering trac. Resource management overhead can be
108
reduced by spreading it over regular computations. The Resource Exchanger [SC96]
is used in our architecture to make resource management more ecient.
Context : There are dynamically allocated resources like pages used by dierent
allocators in a operating system. The particular resource used by an allocator is
not important, only the quantity matters. The resources can be preempted from
an allocator if necessary. We want to avoid unfair distribution of resources among
allocators; at the same time, some allocators may have a greater claim than others.
The resources should be distributed according to need.
Problem : A common approach to managing preemptable dynamic resources is to
run a daemon process that reclaims them from the allocators. However, this means
that an allocator that needs a resource may have to wait while preemption is occurring. Also, preemption may cause allocators to suspend operations until preemption
is completed. Such pauses are often unacceptable.
Solution : We interleave allocation and deallocation of resources, so that the allocator is not drained of resources unless absolutely necessary.
For example, consider a network driver that uses pages to receive messages. After
a message is received, it must hand the page over to some server for processing: for
instance, assembling fragments. While the page is being used, if the drivers page
pool drains of pages, it may need to allocate new pages, wasting time needed for
communication. We can avoid this if instead of giving up the page to the server,
the driver exchanges a page with the server. Thus, allocation and deallocation are
interleaved. This means that the server needs to preallocate at least one page.
Multiple servers may interact with the driver in this manner. Servers maintain
their own pools of pages ready to exchange with the driver. If a server expects
bursty trac, it preallocates pages. The number of pages given to a server depends
on its credit with the memory system. The credit may be preassigned: for example
a video server would have greater credit than a audio server; or it may vary. If a
109
server runs out of credit and buers, then the driver drops server packets, throttling
resource hogs.
MemoryObjectCaches also use a similar scheme. Every cache has a credit for a
number of pages. During pagefault processing, the cache picks pages that can be
returned if it is approaching the credit limit. If page contents need not be saved to
backing store, it may reuse the page internally; otherwise, the page will be returned
to Store. The decit number of pages are allocated from the Store.
Consequences : Resource exchange ameliorates the need for resource preemption. It
also reduces the time an allocator may have to wait for a resource. But resource exchange means that we must have enough resources that the allocators must have at
least one resource to exchange. Otherwise, we must accept a standard preemption
scheme.
4.6 Summary
This chapter described a new architecture for building virtual memory systems. The
architecture has the following primary attributes.
It separates data structures from operations over data structures. Basic objects
of a virtual memory system implement various types of tables. Typical operations
manipulate the objects in groups. Expressing these manipulations in a centralized
manner makes it easier to understand them. By the same token, it becomes easier
to evaluate the eects of new operations and data structures.
It reies operations and operation parameters into objects. This makes the operations and their eects explicit. Also, the reied objects can be used as continuations to program remote interactions without overuse of threads. Parameter
objects also serve as active messages for fast remote interactions. The operation
objects make the order of basic object invocation explicit, so that it is easier to
110
verify the correctness of concurrency control. Moreover, parameter objects can be
used as concurrency control tokens.
It uses object-oriented state machines to program the logic of the operations. There-
fore, code for new operations can be incrementally added by inheritance and composition, dramatically improving code reuse. Structuring operations as state machines
also makes the logic easier to understand.
It improves resource management by spreading resource allocation and deallocation
during computations. As a result, the need for preemption is reduced.
Experience with the Choices system has proven the worth of the architecture. We began with a virtual memory system without copy-on-write support and rudimentary distributed shared memory. The system with new architecture, with copy-on-write support
and dierent consistency protocols was 30% smaller without loss of performance.
111
Chapter 5
Conclusion
This thesis is concerned with the development of a theoretical basis and a practical architecture for building distributed systems. Our example has been the development of
distributed shared memory protocols. We have developed a new protocol design method,
novel distributed shared memory protocols, and a exible architecture for object-oriented
systems that support concurrent operations on groups of objects, and interact with remote systems. The architecture has been used to implement a virtual memory system
that supports distributed shared memory.
5.1 Summary
In Chapter 2, we presented a method for synthesizing process coordination protocols.
Our method shows how to structure the design trajectory for protocols. We begin with
high-level protocols that use abstract communication operators. These protocols are easy
to analyze, so system wide verication is conducted at this level. The next step is to develop implementations of the abstract communication operators. These implementations
are developed in a notation that hides the details of communication media, but allows
the designer to express how a process can control the execution of another process. We
presented conditions that these operator implementations must obey so that they can be
composed to implement the full protocol. Because of the condition, the protocol imple112
mentation is guaranteed to replicate the behaviors specied in the original specication.
The last step is to translate the second-level implementations into a formalism that can
be easily implemented using shared memory or message passing programs. The form
of the third-level implementations ensures that their composition also implements the
original specications correctly.
At each level, the protocols and subprotocols we encounter have small state spaces.
The original specication is succinct due to its abstractness, while the successive steps
look only at parts of the original protocol. As a result, verication tools that use exhaustive search are eective in validating the protocols.
The implementations of the abstract communication operators constitute a standard
library that can be used in future protocol designs.
Thus, we have developed the basis for an eective method for synthesizing protocols.
In Chapter 3, we developed consistency protocols that implement an ecient distributed shared memory for computers connected with wide-area interconnects. We
showed that communication related to synchronization makes it dicult to use distributed shared memory when the communication latency is high. This is because processes contend with one another for access to synchronization data structures. We can
reduce this contention if we can anticipate requests by processes for the data computed
within a synchronization construct. The performance results showed that this approach
results in good performance over wide area networks.
We also showed how our design method can help guide the development of the protocols by analyzing protocols at various levels of detail.
In Chapter 4, we presented a software architecture used to develop a virtual memory
system that supports our distributed shared memory protocols. But the architecture is
considerably more general, in that it can be applied wherever object-oriented systems
involve concurrent operations over groups of objects. We showed how such systems
can be designed so that the objects, operations, and synchronization aspects can be
113
separated. This separation means that adding new objects and new operations is easier,
because the relationships between objects and the concurrency control code is explicit.
We demonstrated how the operations can be extended smoothly using continuations so
that they can aect objects on remote machines. Another feature of the architecture
is the use of object-oriented state machines that allow complex, state-based logic to be
structured to increase reuse and permit systematic extensions.
In summation, this research has resulted in improvements in protocol design and implementation techniques. We have also shown that distributed shared memory can be
useful over wide-area networks.
5.2 Future Research
In this thesis, we have barely begun the development of the protocol synthesis method.
Future research is need to extend the power of our specication language, and experience
is needed to determine the evolution of our notations. Verication tools have to be
adapted so that our models can be analyzed. A library of standard replacements also
needs to be developed. Another direction is to adapt our approach to notations like
LOTOS.
Distributed shared memory consistency protocols might prove to be useful for maintaining consistency over Web documents and other distributed data. A standard library
for distributed shared memory can be developed, much like the MPI and PVM message
passing libraries.
We believe that the software architecture we have proposed is useful for applications
such as workow. The principles developed for the architecture, like object-oriented state
machines, the systematic use of rst-class representations for operations, and the use of
continuations can be applied to improve operating system design.
114
Bibliography
[AF92]
Hagit Attiya and Roy Friedman. A correctness condition for highperformance multiprocessors. In Proceedings of the 24th ACM Symposium
on the Theory of Computing, pages 679{690, 1992.
[AHJ91]
Mustaque Ahamad, Phillip W Hutto, and Ranjit John. Implementing and
programming causal distributed shared memory. In Proceedings of the 11th
International Conference on Distributed Computing Systems, pages 274{281,
May 1991.
[AHN+93] Mustaque Ahamad, Phillip W. Hutto, Gil Neiger, James E. Burns, and
Prince Kohli. Causal memory: Denitions, implementation and programming. Technical Report GIT-CC-93/55, Georgia Institute of Technology,
1993.
[Alp86]
Bowen Alpern. Proving Temporal Properties of Concurrent Programs: A
Non-Temporal Approach. PhD thesis, Cornell University, February 1986.
[And90]
Thomas E. Anderson. The performance of spin-lock alternatives for sharedmemory multiprocessors. IEEE Transactions on Parallel and Distributed
Systems, 1(1):6{16, January 1990.
[AS96]
Harold Abelson and Gerald Jay Sussman. Structure and Interpretation of
Computer Programs. M.I.T. Press, Cambridge, Mass, 1996.
115
[BvdLV95] Tommaso Bolognesi, Jeroen van de Lagemaat, and Chris Vissers. LOTOSphere: software development with LOTOS. Kluwer Academic Publishers,
1995.
[BZ83]
Daniel Brand and Pitro Zaropulo. On communicating nite state machines.
Journal of the ACM, 30(2):323{342, April 1983.
[BZS93]
Brian N. Bershad, Matthew Zekauskas, and Wayne A. Sawdon. The midway
distributed shared memory system. In IEEE Computer Society International
Conference, pages 528{537, 1993.
[Cam74]
R.H. Campbell. The Specication of process synchronization by PathExpressions. In Lecture Notes in Computer Science, pages 89{102, 1974.
[Cam76]
Roy Harold Campbell. Path Expressions: A technique for specifying process
synchronization. PhD thesis, University of Newcastle Upon Tyne, August
1976.
[CBZ95]
John B. Carter, John K. Bennet, and Willy Zwaenepoel. Techniques for
reducing consistency-related communication in distributed shared memory
systems. ACM Transactions on Computer Systems, 1995. To appear.
[CES86]
E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verication of
nite-state concurrent systems using temporal logic specications. ACM
Transactions on Programming Languages and Systems, 8(2):244{263, April
1986.
[CM86]
K. M. Chandy and J. Misra. How processes learn. Distributed Computing,
1:40{52, 1986.
[DCM+90] Partha Dasgupta, R. C. Chen, S. Menon, M. Pearson, R. Ananthnarayanan,
M. Ahamad, R. Leblanc, W. Applebe, J. M. Bernabeu-Auban, P. W. Hutto,
M. Y. A. Khalidi, and C. J. Wilenkloh. The design and implementation
116
of the Clouds distributed operating system. Computing Systems Journal,
Winter 1990.
[Dil96]
David L. Dill. The mur' verication system. In 8th International Conference
on Computer Aided Verication, pages 390{393, July/August 1996.
[DKCZ93] Sandhya Dwarkadas, Pete Keleher, Alan L. Cox, and Willy Zwaenepoel.
Evaluation of release consistent software distributed shared memory on
emerging network technology. In Proceedings of the 20th International Symposium on Computer Architecture, 1993.
[DSB86a] Michael Dubois, Christoph Scheurich, and Faye Briggs. Memory access dependencies in shared-memory multiprocessors. In International Symposium
on Computer Architecture, pages 434{442, May 1986.
[DSB86b] Michael Dubois, Christoph Scheurich, and Faye Briggs. Memory access dependencies in shared-memory multiprocessors. In International Symposium
on Computer Architecture, pages 434{442, May 1986.
[FHMV95] Ronald Fagin, Joseph Halpern, Yoram Moses, and Moshe Vardi. Knowledgebased programs. In Proceedings of the 14th ACM Symposium on Principles
of Distributed Computing, pages 129{143. Association for Computing Machinery, ACM Press, 1995.
[FLP85]
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed
consensus with one faulty processor. Journal of the ACM, 32(2):374{382,
April 1985.
[FLR+94] Babak Falsa, Alvin R. Leibeck, Steven K. Reinhardt, Iannis Schoinas,
Mark D. Hill, James R. Larus, Anne Rogers, and David A. Wood.
Application-specic protocols for user-level shared memory. In Supercomputing 94, 1994.
117
[FP89]
Brett Fleisch and Gerald Popek. Mirage: A coherent distributed shared
memory design. In ACM Symposium on Operating System Principles, 211223, 1989.
[Gab87]
Dov Gabbay. Modal and temporal logic programming. In Antony Galton,
editor, Temporal Logics and Their Applications, chapter 6, pages 197{237.
Academic Press, New York, 1987.
[GH85]
M. G. Gouda and J. Y. Han. Protocol validation by fair progress state
exploration. Computer Networks and ISDN Systems, 9:353{361, 1985.
[GHJV93] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
patterns: Abstraction and reuse of object-oriented design. In Proceedings
of the European Conference on Object-Oriented Programming, number 707
in Lecture Notes in Computer Science, pages 406{431. Springer-Verlag, New
York, 1993.
[GHP92]
P. Godefroid, G.J. Holzmann, and D. Pirottin. State space caching revisited.
In Proc. 4th Computer Aided Verication Workshop, Montreal, Canada, June
1992. also in: Formal Methods in System Design, Kluwer, Nov. 1995, 1-15.
[GLL+90] K. Gharachorloo, D. Lenoski, J. Laudon, P.Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared memory
multiprocessors. In Proceedings of the 17th International Symposium on
Computer Architecture, 1990.
[GW94]
Patrice Godefroid and Pierre Wolper. A partial approach to model checking.
Information and Computation, 110(2):305{326, May 1994.
[GY84]
M. G. Gouda and Y. T. Yu. Protocol validation by maximal progress exploration. IEEE Transactions on Communications, COM-32(1):94{97, 1984.
[HFM88]
D. Hensgen, R. Finkel, and Udi Manber. Two algorithms for barrier synchronization. International Journal of Parallel Programming, January 1988.
118
[HM90]
Joseph Y. Halpern and Yoram Moses. Knowledge and common knowledge in
a distributed environment. Journal of the ACM, 37(3):549{587, July 1990.
Also in Proceedings of the 4th ACM Symposium on Principles of Distributed
Computing(1984).
[Hol91]
Gerard J. Holzmann. Design and validation of computer protocols. Prentice
Hall, Englewood Clis, New Jersey, 1991.
[HU79]
John E. Hopcroft and Jerey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company,
Reading, Massachusetts, 1979.
[Hu95]
Alan John Hu. Techniques for Ecient Formal Verication Using Binary
Decision Diagrams. PhD thesis, Stanford University, December 1995.
[HZ87]
Joseph Y. Halpern and Lenore D. Zuck. A little knowledge goes a long
way: Simple knowledge-based derivations and correctness proofs for a family
of protocols. In ACM Symposium on Principles of Distributed Computing,
pages 269{280. ACM, 1987.
[Ip96]
Chung-Wah Norris Ip. State Reduction Methods for Automatic Formal Verication. PhD thesis, Stanford University, December 1996.
[JA94]
Ranjit John and Mustaque Ahamad. Evaluation of causal distributed shared
memory for data-race-free programs. Technical Report GIT-CC-94/34, Georgia Institute of Technology, 1994.
[JKW95]
Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. Crl: Highperformance all-software distributed shared memory. In ACM Symposium
on Operating System Principles, 1995.
[JR91]
Ralph E. Johnson and Vincent F. Russo. Reusing object-oriented designs. Technical Report UIUCDCS-91-1696, University of Illinois at UrbanaChampaign, May 1991.
119
[KHvB92] Christian Kant, Teruo Higashino, and Gregor von Bochmann. Deriving protocol specications from service specication written in LOTOS. Technical
Report 805, Universite de Montreal, January 1992.
[KN93]
Yousef A. Khalidi and Michael N. Nelson. The Spring virtual memory system.
Technical Report TR-93-9, Sun Microsystems, February 1993.
[Kur94]
Robert P. Kurshan. Computer-aided verication of coordinating processes :
the automata-theoretic approach. Princeton University Press, 1994.
[LAA87]
M. C. Loui and H. H. Abu-Amara. Memory requirements for agreement
among unreliable asynchronous processes. Advances in Computing Research,
4:163{183, 1987.
[Lam79]
Leslie Lamport. How to make a multiprocessor computer that correctly
executes multiprocess programs. IEEE Transactions on Computers, C28(9):690{691, September 1979.
[Lam94]
Leslie Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3):872{923, May 1994.
[LH89]
Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321{359, November
1989.
[Lim95]
Swee Boon Lim. Adaptive Caching in a Distributed File System. PhD thesis,
University of Illinois at Urbana-Champaign, 1995.
[LM95]
Hong Liu and Raymond E. Miller. Generalized fair reachability analysis
for cyclic protocols with nondeterminism and internal transitions. Technical Report UMCP-CSD:CS-TR-3422, University of Maryland, College Park,
February 1995.
120
[Lon93]
David E. Long. Model Checking, Abstraction and Compositional Verication.
PhD thesis, Carnegie Mellon University, July 1993.
[LS88]
Richard J. Lipton and Jonathan S. Sandberg. Pram: A scalable shared
memory. Technical Report CS-TR-180-88, Princeton University, 1988.
[McM92]
Ken McMillan. Symbolic Model Checking: An Approach to the State Explosion Problem. PhD thesis, Carnegie Mellon University, 1992.
[MF90]
Ronald G. Minnich and David J. Farber. Reducing host load, network load
and latency in a distributed shared memory. In International Conference on
Distributed Computing Systems, 1990.
[MP91]
Zohar Manna and Amir Pnueli. The Temporal Logic of Reactive and Concurrent Systems, volume 1. Specication. Springer-Verlag, New York, 1991.
[MW84]
Zohar Manna and Pierre Wolper. Synthesis of communicating processes from
temporal logic specications. ACM Transactions on Programming Languages
and Systems, 6(1):68{93, January 1984.
[PD97]
Fong Pong and Michel Dubois. Verication techniques for cache coherence
protocols. ACM Computing Surveys, 29(1):82{126, March 1997.
[Pon95]
Fong Pong. Symbolic State Model: A New Approach for the Verication of
Cache Coherence Protocols. PhD thesis, University of Southern California,
1995.
[Pos81]
Jon Postel. Transmission control protocol. Internet RFC 793, Sep 1981.
[PS91]
Robert. L. Probert and Kassim Saleh. Synthesis of communication protocols:
Survey and assessment. IEEE Transactions on Computers, 40(4):468{476,
April 1991.
121
[RJY+87] Richard Rashid, Avadis Tevanian Jr., Michael Young, David Golub, Robert
Baron, David Black, William Bolosky, and Jonathan Chew. Machineindependent virtual memory management for paged uniprocessors and multiprocessor architectures. In Proceedings of the 2nd International Conference
on Architectural Support for Programming Languages and Operating Systems,
pages 31{39, 1987.
[Rus91]
Vincent Frank Russo. An Object-Oriented Operating System. PhD thesis,
University of Illinois at Urbana-Champaign, 1991.
[SC95]
Aamod Sane and Roy H. Campbell. Object-oriented state machines: Subclassing, composition, delegation and genericity. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA'95), pages 17{32, October 1995.
[SC96]
Aamod Sane and Roy Campbell. Resource exchanger: A behavioral pattern
for low overhead concurrent resource management. In Pattern Languages
of Program Design. Addison-Wesley Publishing Company, Reading, Massachusetts, 1996. (To appear).
[SMC90]
Aamod Sane, Ken MacGregor, and Roy Campbell. Distributed virtual memory consistency protocols: Design and performance. In Second IEEE workshop on Experimental Distributed Systems, 1990.
[Str91]
Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley Publishing Company, Reading, Massachusetts, 2 edition, 1991.
[Tan92]
Andrew S. Tanenbaum. Modern Operating Systems. Prentice Hall, Englewood Clis, New Jersey, 1992.
[vECGS92] Thorsten von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser.
Active messages: a mechanism for integrated communication and compu122
tation. In Proceedings of the 19th International Symposium on Computer
Architecture, May 1992.
[VSvSB91] Chris A. Vissers, Giuseppe Scollo, Marten van Sinderen, and Ed Brinksma.
Specication styles in distributed systems design and verication. Theoretical
Computer Science, 89(1):179{206, October 1991.
[Wes78]
Colin H. West. General technique for communications protocol validation.
IBM Journal of Research and Development, 22(3):393{404, 1978.
[WG93]
P. Wolper and P. Godefroid. Partial-order methods for temporal verication.
In Proc. CONCUR '93, volume 715 of Lecture Notes in Computer Science,
pages 233{246, Hildesheim, August 1993. Springer-Verlag.
[WL93]
Pierre Wolper and Denis Leroy. Reliable hashing without collision detection.
In 5th International Conference on Computer Aided Verication, number
697 in Lecture Notes in Computer Science, June 1993.
123