Merging Semantics for Conflict Updates in Geo

Merging Semantics for Conflict Updates in Geo-Distributed File Systems
Vinh Tao
Scality and INRIA-LIP6
Marc Shapiro
INRIA-LIP6
Introduction The existing approaches to the problem of
synchronization in geo-distributed file system are classified into two groups: operation-based such as IceCube [2], and state-based such as DFS-R [1].
None of the aforementioned approaches solves the
merging problem correctly. The operation-based approaches do not guarantee to preserve all updates because combining operations from all sites is generally
complex. They usually require global synchronizations,
which are not practical. The state-based approaches usually do not have the full specification for file systems. For
example, they do not work with hardlinks of files. This
incorrect model may lead to anomalous behaviours.
Vianney Rancurel
Scality
update could be broken down into explicit conflicts.
Our Merging Semantics Element Preservation The
state of an updated replica is always preserved in merge.
Relationship Preservation The mappings and the hierarchy relations are preserved by the merge function.

merge(rA , rB ) = r0 : rA ∈ r0 , rB ∈ r0



r 6= r ⇐⇒ merge(r , r∗ ) 6= merge(r , r∗ )
x
y
x x
y y
,
∗)
∗)

r
r
⇒
merge(r
,
r
merge(r
,
r
x
y
x
y

x
y


rv → ri ⇒ merge(rv , rv∗ ) → merge(ri , ri∗ )
where rA and rB are a pair of diverged replicas of an element on sites A and B, respectively, {rx , ry , rv , ri } are
replicas on the same site of {x, y, v, i}.
Formally, these semantics are known in the field of
order theory as order-preserving and order-reflecting,
which are the requirements to embed a poset into another.
Implementation With CRDT CRDT, which stands for
Conflict Free Replicated Data Type [3], is a set of specifications for data types to support eventual consistency.
Implementations of the system should follow the specifications of CRDT, it means, to make the merging policies become idempotent, commutative, and associative
w.r.t implementations’ definition of ≤ the partial order.
We define the partial order ≤ based on the timestamps
for the states, i.e., if st and st+1 are the states of foo
before and after an update or a merge, then st ≤ st+1 . We
use version vector, which is a frequently used technique,
for assigning timestamps in our system.
Our Approach We model a file system as a partially ordered set (poset) which includes (1) a namespace that
presents the hierarchical structure of file systems and (2)
the mapping between the namespace and data.
The namespace is a directed rooted-tree whose edges
point away from the root. Any two vertices are ordered
by a hierarchy relation , which is a one-to-many relationship. The mapping between the namespace and data,
denoted by →, is a separate mapping relation between
vertices and data objects (named inodes).
Conflict cases We classify conflicts as explicit and implicit. Explicit conflicts are caused by concurrent updates
that target the same elements such as vertices, inodes,
or mappings. We divide explicit conflicts, based on our
system model, into (1) state conflict: an element is concurrently deleted and modified by users on different sites
(2) data conflict: users from different sites concurrently
write to the same file (3) naming conflict: this type of
conflict happens when vertices with the same path exist
(4) mapping conflict: users on different sites map different names to a directory inode.
Implicit conflicts are caused by updates that target different elements but cause anomalous results when merging. The common pattern is that, updates which target
different elements interfere the path of each other. These
References
[1] B JØRNER , N. Models and software model checking of a distributed file replication system. Formal methods and hybrid
real-time systems (2007), 1–23.
[2] K ERMARREC , A.-M., ET AL . The icecube approach to the reconciliation of divergent replicas. In PODC (2001), ACM.
[3] S HAPIRO , M., ET AL . A comprehensive study of convergent and
commutative replicated data types.
1
Merging Semantics for Conflict Updates
in Geo-Distributed File Systems
1,2
2
1
Vinh Tao
Marc Shapiro Vianney Rancurel
1
2
Scality www.scality.com, INRIA-LIP6 www.lip6.fr
Introduction
Our Merging Policies
Merging Policies For Explicit Conflicts
I existing synchronization systems for geo-distributed file system
. operation-based does not preserve updates
. state-based does not fully model file systems
file
file
A
file.A
(a)
B
i4
foo
file.B
i4
i 4’
Research:
file
file
bar
i 4”
foo
(c)
foo
A
bar
System Model
i4
i4
(b)
foo (d)
bar
i4
root
root
bar
foo
foo
root
root
root
foo
bar
(h)
foo
bar
foo
foo
bar
file
file
file
i4
i4
i4
A
B
B
i 4’
(g)
bar
A
bar.A
(f)
i4
foo
B
bar
bar
i4
foo
foo
B
bar
A
I A full model of file system.
I Correct merging semantics.
(e)
foo
A
Merging Policies For Implicit Conflicts
i 4”
foo
bar
A
(i)
B
root
foo
bar
bar
foo
B
i4
i4
Figure 5 : Merging policies for conflicts. We only display the final outcome on each site.
I File Systems partially ordered set of hierarchy and mapping relations.
(a)
(b)
root
foo
qux
file
qux
qux
file
i1
foo
bar
file
(c)
i0
root
foo
i2
bar
qux
qux
file
Legend
root
bar
file
file
foo
directory
file
file
i4
qux
inode
hierarchy relation
i3
i4
i4
i5
mapping relation
Figure 1 : System model, with (a) as the namespace as a rooted-directed tree, (b) as the full
system model, and (c) as the simplified system model.
I Conflict Cases
. Explicit conflicts are caused by updates that target the same elements.
. Implicit conflicts are caused by updates that target different elements, but
the merging result would be anomalous.
Implicit Conflicts
Explicit Conflicts
foo
bar
foo
file
file
bar
i4
i4
A
(b) B
(a) B
A
foo
foo
foo
bar
bar
bar
A
A (e)
(d) B
foo
foo
bar
bar
i
B
A
foo
root
foo
file
qux
bar
bar
root
foo
(f)
file
foo
file
i
i4
i4
i4
B
(init1)
A
(g) B
root (init2)
root
(c) B
A
foo
root
file
bar
(h)
foo
A
root
B
root
bar
foo
root
foo
i4
bar
merged
foo
bar
Implementation With CRDT
I CRDT, which stands for Conflict Free Replicated Data Type, is a set of
specifications for data types to support eventual consistency.
. The state of each replica advances upward after modifications w.r.t a
partial order ≤. Formally, if si and sj are the states of a replica before and
after an update, then si ≤ sj.
. The merge function computes the Least Upper Bound (LUB) of these
replicas w.r.t ≤. The LUB of a pair of states si and sj under the partial
order ≤ is defined as
(
si ≤ s, sj ≤ s
.
(3)
s = LUB(si, sj) :
@s0 ≤ s : si ≤ s0, sj ≤ s0
. By definition, LUB function is idempotent, commutative, and associative:


idempotent
LUB(s,
s)
=
s



commutative LUB(s , s ) = LUB(s , s )
i j
j i
.
(4)

associative
LUB(s
,
LUB(s
,
s
))
=
i
j
k




LUB(LUB(si, sj), sk)
merged
Figure 2 : Conflict cases with a, b, c, d, e, and f as examples of explicit conflict and g and h
as examples of implicit conflict.
I Implementations of the system should follow the specifications of CRDT, it
means, to make the merging policies become idempotent, commutative, and
associative w.r.t implementations’ definition of ≤ the partial order.
Our Merging Semantics
I Element Preservation The state of an updated replica of an element is always
preserved in a pairwise merge.
(
merge(rA, rB) = r0 : rA ∈ r0, rB ∈ r0
(1)
rx 6= ry ⇐⇒ merge(rx, rx∗) 6= merge(ry, ry∗)
I Relationship Preservation The mappings and the hierarchy relations are
preserved by the merge function.
(
rx ry ⇒ merge(rx, rx∗) merge(ry, ry∗)
(2)
rv → ri ⇒ merge(rv, rv∗) → merge(ri, ri∗)
(a)
root
(b)
foo
foo
root
foo
root
root
foo
(init1)
foo
file
file
file.B
bar
file.A
qux
qux
bar
file
i0
i0
A
i0
i1
B
merged
i0
i2
A
i0
i1
B
(c)
(init2) root
i2
merged
root
root
foo
bar
root
root
root
(b)
root
foo
foo
bar
root
bar
correspondence
between input and
output elements
bar
foo
bar
foo
foo
A
B
merged
foo
bar
root
foo.A
i0
i1
bar
foo
qux
bar
qux
quz
(init1)
A
qux
quz
quz
bar
bar
foo
foo
quz
quz
qux
qux
bar
root
foo
foo
B
i0
root
(init2)
B
Table 1 : Evaluation of our merging semantics with commercial systems. Abbreviations: Db
for Dropbox, GD for Google Drive, and OD for Microsoft OneDrive.
bar
Keys
A
merged
I We define the partial order ≤ based on the timestamps for the states, i.e., if
st and st+1 are the states of foo before and after an update or a merge, then
st ≤ st+1. We use version vector, which is a frequently used technique, for
assigning timestamps in our system.
I We compare the features of our implementation and other commercial
synchronization services such as Dropbox, Google Drive, and Microsoft
OneDrive.
root
Figure 3 : Examples of merging diverged replicas of a file system with the element
preservation semantics.
(a)
Evaluation
merged
foo.B
i0
qux
Figure 4 : Examples of merging diverged replicas of a file system with the relationship
preservation semantics.
Formally, these semantics (Equations 1 and 2) are known in the field of
order theory as order-preserving and order-reflecting, which are the
requirements to embed a poset into another.
Db
X
×
×
X
X
X
X
X
GD
×
×
×
dvg.a
lwwb
d.w.c
d.w.
arb.d
OD GeoFS
X
X
×
X
×
X
X
X
X
X
X
X
X
X
×
X
a
i2
qux
Feature/Support
Preserve Updates
Preserve Structure
Hardlink
Same name dir./files
Write || Write
Explicit Delete || Edit
Implicit Delete || Edit
Cycles
Diverge: elements are preserved, but replicas’ structures diverged.
b
Last-Writer-Wins: the write with the last timestamp wins over the others.
c
Delete-Wins: the element, if deleted on any site, is deleted after merging.
d
Arbitrary: the directories in the cycles are placed at root after merge.
Conclusions
I Our implementation outperforms the existing systems in feature completeness.
Created withLATEXbeamerposterhttp://www-i6.informatik.rwth-aachen.de/~dreuw/latexbeamerposter.php
fisrt.last@{scality.com,lip6.fr,acm.org}