Merging Semantics for Conflict Updates in Geo-Distributed File Systems Vinh Tao Scality and INRIA-LIP6 Marc Shapiro INRIA-LIP6 Introduction The existing approaches to the problem of synchronization in geo-distributed file system are classified into two groups: operation-based such as IceCube [2], and state-based such as DFS-R [1]. None of the aforementioned approaches solves the merging problem correctly. The operation-based approaches do not guarantee to preserve all updates because combining operations from all sites is generally complex. They usually require global synchronizations, which are not practical. The state-based approaches usually do not have the full specification for file systems. For example, they do not work with hardlinks of files. This incorrect model may lead to anomalous behaviours. Vianney Rancurel Scality update could be broken down into explicit conflicts. Our Merging Semantics Element Preservation The state of an updated replica is always preserved in merge. Relationship Preservation The mappings and the hierarchy relations are preserved by the merge function. merge(rA , rB ) = r0 : rA ∈ r0 , rB ∈ r0 r 6= r ⇐⇒ merge(r , r∗ ) 6= merge(r , r∗ ) x y x x y y , ∗) ∗) r r ⇒ merge(r , r merge(r , r x y x y x y rv → ri ⇒ merge(rv , rv∗ ) → merge(ri , ri∗ ) where rA and rB are a pair of diverged replicas of an element on sites A and B, respectively, {rx , ry , rv , ri } are replicas on the same site of {x, y, v, i}. Formally, these semantics are known in the field of order theory as order-preserving and order-reflecting, which are the requirements to embed a poset into another. Implementation With CRDT CRDT, which stands for Conflict Free Replicated Data Type [3], is a set of specifications for data types to support eventual consistency. Implementations of the system should follow the specifications of CRDT, it means, to make the merging policies become idempotent, commutative, and associative w.r.t implementations’ definition of ≤ the partial order. We define the partial order ≤ based on the timestamps for the states, i.e., if st and st+1 are the states of foo before and after an update or a merge, then st ≤ st+1 . We use version vector, which is a frequently used technique, for assigning timestamps in our system. Our Approach We model a file system as a partially ordered set (poset) which includes (1) a namespace that presents the hierarchical structure of file systems and (2) the mapping between the namespace and data. The namespace is a directed rooted-tree whose edges point away from the root. Any two vertices are ordered by a hierarchy relation , which is a one-to-many relationship. The mapping between the namespace and data, denoted by →, is a separate mapping relation between vertices and data objects (named inodes). Conflict cases We classify conflicts as explicit and implicit. Explicit conflicts are caused by concurrent updates that target the same elements such as vertices, inodes, or mappings. We divide explicit conflicts, based on our system model, into (1) state conflict: an element is concurrently deleted and modified by users on different sites (2) data conflict: users from different sites concurrently write to the same file (3) naming conflict: this type of conflict happens when vertices with the same path exist (4) mapping conflict: users on different sites map different names to a directory inode. Implicit conflicts are caused by updates that target different elements but cause anomalous results when merging. The common pattern is that, updates which target different elements interfere the path of each other. These References [1] B JØRNER , N. Models and software model checking of a distributed file replication system. Formal methods and hybrid real-time systems (2007), 1–23. [2] K ERMARREC , A.-M., ET AL . The icecube approach to the reconciliation of divergent replicas. In PODC (2001), ACM. [3] S HAPIRO , M., ET AL . A comprehensive study of convergent and commutative replicated data types. 1 Merging Semantics for Conflict Updates in Geo-Distributed File Systems 1,2 2 1 Vinh Tao Marc Shapiro Vianney Rancurel 1 2 Scality www.scality.com, INRIA-LIP6 www.lip6.fr Introduction Our Merging Policies Merging Policies For Explicit Conflicts I existing synchronization systems for geo-distributed file system . operation-based does not preserve updates . state-based does not fully model file systems file file A file.A (a) B i4 foo file.B i4 i 4’ Research: file file bar i 4” foo (c) foo A bar System Model i4 i4 (b) foo (d) bar i4 root root bar foo foo root root root foo bar (h) foo bar foo foo bar file file file i4 i4 i4 A B B i 4’ (g) bar A bar.A (f) i4 foo B bar bar i4 foo foo B bar A I A full model of file system. I Correct merging semantics. (e) foo A Merging Policies For Implicit Conflicts i 4” foo bar A (i) B root foo bar bar foo B i4 i4 Figure 5 : Merging policies for conflicts. We only display the final outcome on each site. I File Systems partially ordered set of hierarchy and mapping relations. (a) (b) root foo qux file qux qux file i1 foo bar file (c) i0 root foo i2 bar qux qux file Legend root bar file file foo directory file file i4 qux inode hierarchy relation i3 i4 i4 i5 mapping relation Figure 1 : System model, with (a) as the namespace as a rooted-directed tree, (b) as the full system model, and (c) as the simplified system model. I Conflict Cases . Explicit conflicts are caused by updates that target the same elements. . Implicit conflicts are caused by updates that target different elements, but the merging result would be anomalous. Implicit Conflicts Explicit Conflicts foo bar foo file file bar i4 i4 A (b) B (a) B A foo foo foo bar bar bar A A (e) (d) B foo foo bar bar i B A foo root foo file qux bar bar root foo (f) file foo file i i4 i4 i4 B (init1) A (g) B root (init2) root (c) B A foo root file bar (h) foo A root B root bar foo root foo i4 bar merged foo bar Implementation With CRDT I CRDT, which stands for Conflict Free Replicated Data Type, is a set of specifications for data types to support eventual consistency. . The state of each replica advances upward after modifications w.r.t a partial order ≤. Formally, if si and sj are the states of a replica before and after an update, then si ≤ sj. . The merge function computes the Least Upper Bound (LUB) of these replicas w.r.t ≤. The LUB of a pair of states si and sj under the partial order ≤ is defined as ( si ≤ s, sj ≤ s . (3) s = LUB(si, sj) : @s0 ≤ s : si ≤ s0, sj ≤ s0 . By definition, LUB function is idempotent, commutative, and associative: idempotent LUB(s, s) = s commutative LUB(s , s ) = LUB(s , s ) i j j i . (4) associative LUB(s , LUB(s , s )) = i j k LUB(LUB(si, sj), sk) merged Figure 2 : Conflict cases with a, b, c, d, e, and f as examples of explicit conflict and g and h as examples of implicit conflict. I Implementations of the system should follow the specifications of CRDT, it means, to make the merging policies become idempotent, commutative, and associative w.r.t implementations’ definition of ≤ the partial order. Our Merging Semantics I Element Preservation The state of an updated replica of an element is always preserved in a pairwise merge. ( merge(rA, rB) = r0 : rA ∈ r0, rB ∈ r0 (1) rx 6= ry ⇐⇒ merge(rx, rx∗) 6= merge(ry, ry∗) I Relationship Preservation The mappings and the hierarchy relations are preserved by the merge function. ( rx ry ⇒ merge(rx, rx∗) merge(ry, ry∗) (2) rv → ri ⇒ merge(rv, rv∗) → merge(ri, ri∗) (a) root (b) foo foo root foo root root foo (init1) foo file file file.B bar file.A qux qux bar file i0 i0 A i0 i1 B merged i0 i2 A i0 i1 B (c) (init2) root i2 merged root root foo bar root root root (b) root foo foo bar root bar correspondence between input and output elements bar foo bar foo foo A B merged foo bar root foo.A i0 i1 bar foo qux bar qux quz (init1) A qux quz quz bar bar foo foo quz quz qux qux bar root foo foo B i0 root (init2) B Table 1 : Evaluation of our merging semantics with commercial systems. Abbreviations: Db for Dropbox, GD for Google Drive, and OD for Microsoft OneDrive. bar Keys A merged I We define the partial order ≤ based on the timestamps for the states, i.e., if st and st+1 are the states of foo before and after an update or a merge, then st ≤ st+1. We use version vector, which is a frequently used technique, for assigning timestamps in our system. I We compare the features of our implementation and other commercial synchronization services such as Dropbox, Google Drive, and Microsoft OneDrive. root Figure 3 : Examples of merging diverged replicas of a file system with the element preservation semantics. (a) Evaluation merged foo.B i0 qux Figure 4 : Examples of merging diverged replicas of a file system with the relationship preservation semantics. Formally, these semantics (Equations 1 and 2) are known in the field of order theory as order-preserving and order-reflecting, which are the requirements to embed a poset into another. Db X × × X X X X X GD × × × dvg.a lwwb d.w.c d.w. arb.d OD GeoFS X X × X × X X X X X X X X X × X a i2 qux Feature/Support Preserve Updates Preserve Structure Hardlink Same name dir./files Write || Write Explicit Delete || Edit Implicit Delete || Edit Cycles Diverge: elements are preserved, but replicas’ structures diverged. b Last-Writer-Wins: the write with the last timestamp wins over the others. c Delete-Wins: the element, if deleted on any site, is deleted after merging. d Arbitrary: the directories in the cycles are placed at root after merge. Conclusions I Our implementation outperforms the existing systems in feature completeness. Created withLATEXbeamerposterhttp://www-i6.informatik.rwth-aachen.de/~dreuw/latexbeamerposter.php fisrt.last@{scality.com,lip6.fr,acm.org}
© Copyright 2025