A Cover Sheet

A Cover Sheet
Please replace this page with cover sheet.
CISE Cross-Cutting Programs: FY 2010
Data Intensive Computing
Program Solicitation # 09-558
Title: DC:Small:Collaborative Research:DARE: Declarative and Scalable Recovery
PI: Joseph M. Hellerstein
Professor
University of California, Berkeley
E-Mail: [email protected]
Co-PI: Andrea C. Arpaci-Dusseau
Professor
University of Wisconsin, Madison
E-Mail: [email protected]
B Project Summary
The field of computing is changing. In the past, improving computational performance of a single machine was
the key to improving application performance. Today, with the advent of scalable computing, improving application
performance can be as simple as adding more machines; systems are now capable of scaling and redistributing data
and jobs automatically. But a new challenge has come: with many thousands of devices, failure is not a rarity but rather
a commonplace occurrence. As failure becomes the norm, when data is lost or availability is reduced, we should no
longer blame the failure, but rather the inability to recover from the failure. Thus, we believe a key aspect of system
design that must be scrutinized more than ever before is recovery.
To address the challenges of large-scale recovery, in this proposal, we describe the Berkeley-Wisconsin Declarative
and Scalable Recovery (DARE) Project, as part of our vision of building systems that “DARE to fail.” To reach this
vision, we will proceed in three major directions. The first is offline recovery testing. As thousands of servers produce
“millions of opportunities” for component failures each day [79], our focus is to inject multiple failures, including the
rare combinations, such that recovery is extensively exercised. The second is online recovery monitoring. “Surprising”
failures take place in deployment [79], and hence more recovery problems often appear during actual deployment.
Thus, systems should have the ability to monitor recovery online and alert system administrators when recovery bugs
are observed. Finally, as we expect system builders to learn from failures over time and refine recovery continuously
(“try again, fail again, fail better” [87]), we believe systems builders need new approaches to design and robustly
implement recovery. Thus, we plan to design an executable recovery specification, a declarative language for writing
recovery specifications that can be translated into executable protocols.
A.1 Intellectual Merit
Intellectual merit and importance: The DARE project will advance the state of knowledge in large-scale storage
systems in three fundamental ways. First, by introducing many (ideally all) possible failures, we will understand the
deficiency of today’s systems and develop the next state-of-the-art of large-scale recovery. Second, we will explore
the design space of a new paradigm: online failure scheduling and recovery monitoring. Finally, we will demonstrate
the utility of declarative languages for specifying and implementing various aspects of recovery management.
Qualifications: We believe we are well positioned to make progress on this demanding problem, having assembled
different sets of expertise. Professor Joseph Hellerstein is a leader in the application of data management concepts
to system design, and this project will leverage his expertise in declarative programming [23, 54, 58, 108, 109, 111],
distributed monitoring [92, 93, 136], and scalable architectures [19, 55, 72, 94, 110, 142, 157]. Professor Andrea
Arpaci-Dusseau is an expert in file and storage systems. This project will leverage her expertise in storage system
reliability [33, 35, 76, 77, 78, 104, 133, 145] and high-performance storage clusters [25, 27, 39, 40, 75, 156].
Organization and access to resources: From an organizational viewpoint, our goal is to perform “low-cost, highimpact” research. Hence, the bulk of funding requested within this proposal is found in human costs; we will leverage
donations from industry for much of the infrastructure.
A.2 Broader Impacts
Advancing discovery while promoting teaching, training, and learning: In general, we work to give students hands-on
training with cutting-edge systems technology. We also plan to incorporate our research directly into undergrad and
grad courses (as we have done in the past), and develop the next generation of engineers that are critical to the future
of science and engineering of our country.
Enhancing infrastructure for research and disseminating results: We plan to disseminate the results of our research in
three ways: through the classic medium of publication, which in the past has impacted the design and implementation
of various storage systems including the EMC Centera [75], NetApp filers [104], and Yahoo cloud services [59];
through the development of numerous software artifacts, which we have shared with the open source community, parts
of which have been adopted, for example, into next generation Linux file systems [133] and MySQL [147]; and finally,
through our work with various industry partners to help shape their next generation storage systems.
Benefits to society: We believe the DARE project falls in the same directions set by federal agencies; a recent HEC
FSIO workshop declared “research in reliability at scale” and “scalable file system architectures” as topics that are
very important and greatly need research [36]. We also believe that our project will benefit society at large; in the
near future, users will store in the Internet all of their data (emails, work documents, generations of family photos and
videos, etc.). As John Sutter of CNN put it: “This is not just data. It’s my life. And I would be sick if I lost it” [149].
Unfortunately, data loss still happens in reality [113, 118, 121]. Through the DARE project, we will build the next
generation large-scale storage systems that will meet the performance and reliability demands of current society.
Keywords: scalable recovery, declarative recovery, parallel file systems, cloud computing, testing, online monitoring.
C Table of Contents
(This page will be automatically generated)
D Project Description
D.1 Introduction
Three characteristics dominate today’s large-scale computing systems. The first is the prevalence of large storage clusters. Storage clusters at the scale of hundreds or thousands of commodity machines are increasingly being deployed.
At companies like Amazon, Google, Yahoo, and others, thousands of nodes are managed as a single system [3, 38, 73].
This first characteristic has empowered the second one: Big Data. A person on average produces a terabyte of digital
data annually [112], a scientific project could capture hundreds of gigabytes of data per day [141, 152], and an Internet
company could store multiple petabytes of web data [120, 130]. This second characteristic attracts the third: large
jobs. Web-content analysis becomes popular [57, 130], scientists now run a large number of complicated queries [85],
and it is becoming typical to run tens of thousands of jobs on a set of files [20, 51].
Nevertheless, as large clusters have brought many benefits, they also bring a new challenge: a growing number
and frequency of failures that must be managed [18, 42, 79, 87]. Bits, sectors, disks, machines, racks, and many other
components fail. With millions of servers and hundreds of data centers, there are “millions of opportunities” for these
components to fail [79]. Failing to deal with failures will directly impact the reliability and availability of data and
jobs. Unfortunately, we still hear data-loss stories even recently. For example, in March 2009, Facebook lost millions
of photos due to simultaneous disk failures that “should” rarely happen at the same time [118] (but it happened); in July
2009, a large bank was fined a record total of £3 millions after losing data on thousands of its customers [121]; more
recently, in October 2009, T-Mobile Sidekick, which uses Microsoft’s cloud service, also lost its customer data [113].
These incidents have shown that existing large-scale storage systems are still fragile to failures.
We believe a key aspect of system design that must be scrutinized more than ever before is recovery. As failure
becomes the norm, when data is lost or availability is reduced, we should no longer blame the failure, but rather the
inability to recover from the failure. Although recovery principles have been highlighted before [127], recovery was
proposed in much smaller settings [42]. At a larger scale, we believe recovery has more responsibilities: it should
anticipate not only all individual failures but also rare combinations of failures [79]; it must be efficient and scale to a
large number of failures; it must also consider rack-awareness [43] and geographic locations [61]. In short, as James
Hamilton, the Vice President and Distinguished Engineer of the Amazon Web Services, suggested: “a rigorously
specified, scalable form [of recovery] is very much needed” [42]. This leaves many pressing challenges for largescale system designers: What should recovery look like for scalable clusters? What are the possible combinations of
failures that the system should anticipate? In what ways should recovery be formally specified? How should recovery
specifications be checked?
To address the challenges of large-scale recovery, in this proposal, we describe the Berkeley-Wisconsin Declarative
and Scalable Recovery (DARE) Project wherein we want to (1) seek the fundamental problems of recovery in today’s
scalable world of computing, (2) improve the reliability, performance, and scalability of existing large-scale recovery,
and (3) explore formally grounded languages to empower rigorous specification of recovery properties and behaviors. Our vision is to build systems that “DARE to fail”: systems that deliberately fail themselves, exercise recovery
routinely, and enable easy and correct deployment of new recovery policies.
There are three major thrusts in this project which forms what we call the
Iterative lifecycle
DARE iterative lifecycle, as depicted in Figure 1. First, we begin our work with offline recovery testing of large-scale file systems. Large-scale recovery is inherently
complex as it has to deal with multiple failures, including the rare combinations,
Offline
Online
against which recovery is rarely tested. Furthermore, correct recovery decisions
Testing
Monitoring
must be made based on many metrics such as system load, priority, location, cost,
and many more. Thus, by testing recovery extensively and learning through the
Executable
The
results, we can sketch out the fundamental principles of recovery in the context of
Specification
large-scale file systems.
System
To complement offline testing, our second thrust is about online recovery monitoring. In actual deployment, recovery is faced with more scenarios that might
Figure 1: DARE lifecycle.
not have been covered in offline testing. Thus, interesting problems appear when
the system is deployed in a large cluster of hundreds or thousands of machines [79, 87]. In fact, we have observed that
system builders learn new issues from real-world deployment [81]. Therefore, we believe there is a need for an online
recovery monitoring with the ability to monitor recovery in action and alert system administrators when recovery bugs
are observed.
1
Finally, in our third thrust, we advocate executable recovery specification. We believe that system builders can
benefit greatly by using declarative languages to specify recovery policies in manner that is both formally checkable
and also executable in the field. We expect system builders to learn from failures over time and refine recovery
continuously (“try again, fail again, fail better” [87]). Declarative recovery specifications can allow system builders
to analyze the correctness and scope of their code, and easily prototype different recovery strategies by focusing
on the high-level invariants maintained by their recovery protocols. With executable specification, recovery code
becomes a form of explicit documentation, and this documentation is also precisely the running code that enforces the
specification. Finally, a rich tradition in “shared-nothing” parallelization of declarative languages suggests that this
language design will promote scalable recovery.
We plan to do all of the above to two classes of widely used large-scale file systems: Internet service (“cloud”) file
systems [6, 9, 14, 15, 16, 43, 69] and parallel file systems [8, 44, 50, 155]. For this project, we will focus on one file
system in each class: the Yahoo Hadoop File System (HDFS) [43] and the Sun Lustre Parallel File System [44]. We
note that the existence of these open-source systems brings great opportunities for researchers to analyze the design
space of large-scale recovery. In response, our contributions will also be directly useful to them.
As illustrated in Figure 1, we see the three phases of the DARE lifecycle as an iterative, pay-as-you-go approach.
That is, we believe the lessons learned from each phase will benefit the other phases in parallel. Thus, our plan is to
rapidly prototype and improve all of them hand-in-hand. We anticipate specific major contributions in our thrusts:
•Offline Recovery Testing: To emulate failures, we will automatically insert failure points using recent advances
in Aspect-Oriented Programming [56]. For example, we have used AspectJ [5] to insert hundreds of fault-injection
hooks around all I/O-related system calls without intruding the HDFS base code. Since our goal is to exercise various
combinations of failures, we will develop methodologies to intelligently sample the large test space. We also plan
to explore the idea of declarative testing to explicitly specify which important fault risk scenarios to prioritize. We
expect to uncover reliability, performance, and scalability bugs. We will submit this aspect-oriented test harness and
our findings to the HDFS and Lustre communities.
•Online Recovery Monitoring: We plan to extend existing scalable monitoring tools [59, 116, 134] to monitor
detailed recovery actions. To infer the high-level states of the system being monitored, these tools depend on log
messages generated by the system. Thus, to infer high-level recovery actions, one challenge that we will deal with
is log provenance (i.e., we need to know which log messages are in the context of recovery, and furthermore, due to
which specific failures). We will also develop declarative analyses to help system administrators easily declare what
they intend to monitor and analyze.
•Executable Recovery Specification: Hellerstein’s group has a great deal of experience in the design and use of
declarative languages like Overlog to build full-function distributed systems [23, 107]. With this experience and the
active learning done in the first two phases above, we will develop a domain specific language for recovery specifications, which then will be directly translated into executable protocols. With executable specification, we also expect
to be able to specify recovery performance and scalability goals, and generate the code that meets the goals.
In the remainder of this proposal, we present the extended motivation for DARE. We then give an overview of the
DARE project, and present the details of each component of our research. We continue by describing our research
plan and the educational impact of our work. We conclude by presenting related work and our prior funded efforts.
D.2 Extended Motivation
At the heart of large-scale computing are the large-scale storage systems capable of managing hundreds or thousands
of machines. Millions of users fetch data from these systems, and computations are “pushed” to the data inside the
systems. Thus, they must maintain high data-availability; when failures occur, data recovery must be correct and
efficient. In this section, we estimate how often data recovery takes place in large-scale storage settings. Then, we
present our initial study of large-scale recovery from real-world deployment. Finally, we present extended motivation
for the three thrusts of the DARE project.
D.2.1 How Often Does Data Recovery Take Place?
Data recovery can be triggered due to whole-disk failure. A study of 100,000 disks over a period of five years by
Schroeder et al. showed that the average percentage of disks failing per year is around 3% [138]; they also found a
2
set of disks from a manufacturer had a 13.5% failure rate. Google also released a similar rate: 8.6% [129]. All these
failure rates suggest that a 1-PB cluster might recover between 30 to 135 TB of data yearly (90 to 400 GB daily).
Disk could also fail partially; the disk might be working but some blocks may not be accessible (e.g., due to latent
sector errors [96, 97]), or the blocks are still accessible but have been corrupted (e.g., due to hardware and software
bugs [52, 67, 106, 148, 158]). Bairavasundaram et al. found that a total of 3.45% of 1.53 million disks in production
systems developed latent sector errors [32]; some “bad” disk drives had more than 1000 errors. In the same population,
they also found that 400,000 blocks were corrupt [33]. If an inaccessible or corrupt block stores metadata information,
the big data sitting behind the metadata can become unreachable, and thus orders of magnitude more data needs to be
regenerated than what was actually lost.
Disk failures are not the only triggers of recovery; as disks are placed underneath layers of software (e.g., OS,
device drivers, device firmware), an error in one layer can make the system crash or hang, and hence data becomes
unavailable although the disks are working fine [71, 139]. Amazon has seen this in practice; a firmware bug “nuked”
whole server [87]. Other hardware failures such as network and power-cord failures can also bring down data availability. As human reaction is slow [87], some machines could be offline and unattended for a while, and hence unable
to send “I’m alive” heartbeat messages for a long period of time. In this case, in some architectures such as HDFS,
the master server will treat these machines as dead and regenerate the data stored in these unattended machines. In
summary, as failure is a commonplace occurrence, system builders should not disregard the veracity of Murphy’s Law:
“anything that can go wrong will go wrong.”
D.2.2 What Are The Issues of Large-Scale Recovery?
To understand the fundamental problems of large-scale recovery, we have performed an initial study of real-world
issues faced from the deployment of a large-scale file system, the Hadoop File System (HDFS) [43]. We picked HDFS
as it has been widely deployed in over 80 medium to large organizations including Amazon, Yahoo, and Facebook, in
the form of 4- to 4000-nodes clusters [3]. HDFS is a complex piece of code (25 KLOC) and serves as a foundation
for supporting Hadoop MapReduce jobs; it keeps three replicas of data, and computations are typically pushed to the
data. We have studied almost 600 issues reported by the HDFS developers [81] and found that 44 are about recovery.
Below, we summarize some of our interesting findings.
•Too-aggressive recovery: HDFS developers have observed a whole-cluster crash (600 nodes) caused by simultaneous failures of only three nodes [83]. This happened when the cluster was too aggressive in regenerating the lost
copies in the three dead nodes such that some heartbeats from the healthy nodes were “buried” in this busy recovery.
As the heartbeats from some healthy nodes could not reach the master node, these healthy nodes were also considered
dead, which then caused the cluster to regenerate more data as it saw more dead nodes. From this point on the cluster
was constantly regenerating more data and losing more nodes and finally became dysfunctional.
•Too-slow recovery: There was a case where recovery was very slow because the replication procedure was coupled
with the wrong heartbeat [84], and hence the recovery ran too slow even if resources such as network bandwidth were
free. As a result, data recovery of one failed node took three hours, although it should have taken only one hour.
•Coarse-grained recovery: A “tiny” failure can make the whole cluster unavailable if recovery is not fine-grained. As
an example, HDFS refused to start even if there was only one bad file entry, making the whole cluster unusable [82].
Since we believe that there are more problems that have not been reported, we also have manually injected some
hand-picked data failures into HDFS, and found more interesting findings, as listed below. Again, these findings point
to the fact that recovery is often under-specified and hence must be rigorously tested and monitored.
•Expensive recovery: We found that many important metadata are not
replicated such that a loss of a metadata will result in an expensive recovery.
Node 2
One example is the non-replicated local directory block which potentially
Node 1
stores references to thousands of large files; a loss of a 4-KB directory block
0
200 s
400 s
600 s
will result in hundreds of GB being regenerated. Another example is the
non-replicated version file kept in each machine. If the version file of a
Figure 2: Four restarted jobs due to a
node gets corrupted or inaccessible, GBs or TBs of data in the node must
late reporting of data failure.
be regenerated, although the node is perfectly healthy.
•Late recovery: We also have identified a late recovery due to a delayed
reporting of data failure, as illustrated in Figure 2. When a job runs on a bad copy of a file (in Node 1), HDFS forgets
to directly report the bad copy to the master server. The bad copy is only reported by a background scan that runs at
3
a very slow rate (e.g., hourly). As a result, jobs wastefully run on the bad copy (dashed lines) only to find themselves
restarted to the other good copies (solid lines).
D.2.3 Why Offline Recovery Testing?
Recovery should definitely be reliable. However, in the scalable world of computing, recovery must also scale in three
dimensions: number of failures, size of data, and number of jobs dependent on the data. To achieve these, recovery
must be fast, efficient, and conscious of the jobs running on the system. If it is not fast, recovering from a large number
of failures will not scale in time. If it is not efficient (e.g., recovery generates much more data than what was lost),
recovering a big data loss will not scale in size. If it is not conscious of the scale of the jobs running on each data, a
large number of jobs will experience a performance degradation (e.g., if a loss of a popular file is not prioritized).
Evaluating whether existing recovery strategies are robust and scalable is a hard problem, especially when recovery
protocols are often under-documented. As a result, component interactions are hard to understand and system builders
tend to unearth important design issues “late in the game” [42]. Thus, we plan to apply an extensive fault-injection
technique that enables us to insert many possible failures, including the rare combinations, and thus exercise many
recovery paths. Unlike previous work on testing, we also plan to explore the idea of declarative testing to intelligently
sample important fault risk scenarios from the large test space; for example, we might wish to run a testing specification such as “Combinations=1 & Workload=STARTUP” (i.e., insert all possible single failures only in the reboot
process).
Hypothesis 1: Unlike small-scale recovery, large-scale recovery must be rigorously tested against various
combinations of failures, including the rare ones.
D.2.4 Why Online Recovery Monitoring?
The bug reports we examined for HDFS reflect that system builders learn new issues from real-world deployment;
interesting problems appear when the system is deployed in a big cluster of hundreds or thousands of machines. This
is because new solutions tend to be tested only in small settings.
Therefore, we envision that failures should be deliberately scheduled during runtime, but furthermore, we believe
there is a need for an online framework that monitors recovery actions lively and reasons about their correctness and
scalability. More specifically, we expect administrators and system administrators to ask high-level analysis questions
such as: “How many bytes are transferred when a large number of disks fails? How are jobs affected when a replica
is missing? How long to recover a missing replica when the system is 90% busy?” Thus, we plan to build an online
monitoring framework where system builders or administrators can declare what they intend to monitor and analyze
(e.g., in the form of declarative queries).
Hypothesis 2: More recovery problems will be uncovered if failures are deliberately injected during
actual deployment, and if the corresponding recovery reactions are monitored and analyzed online.
D.2.5 Why Executable Recovery Specification?
Testing and analysis alone cannot ensure high-quality, robust, and scalable storage systems; the resulting insight must
be turned into action. However, fixing recovery subsystems has been proven hard for several reasons. First, recovery
code is often written in low-level system languages (e.g., C++ or Java) in which recovery specification is not easy to
express. As a pragmatic result, recovery is often under-specified; the larger the failure scenarios are, the harder it is to
formally check whether all scenarios have been handled properly. Second, low-level system languages tend to make
recovery code scattered [76, 78, 133]; a simple change must be added in many places. Third, although many failure
scenarios have been considered in initial design, it is inevitable that “surprising” failures take place in deployment [79];
as developers unearth new issues, ad-hoc patches are applied, and over time the code becomes unnecessarily complex.
Finally, the design space of large-scale recovery is actually vast, involving many metrics and components. Yet, it is
common that early releases do not cover all the design space.
Ultimately, we believe recovery code must evolve from low-level system languages (e.g., C++ or Java) into higherlevel specifications. Declarative language approaches are attractive in this regard. With recovery specified declaratively, one can formally check its correctness, add new specifications, and more easily evaluate performance-reliabilityscalability tradeoffs of different approaches. We note that the culture of large-scale recovery programming is to start
4
with “simple and nearly stupid” strategies [79]. This is because additional optimizations tend to introduce more complexity but not always bring orders of magnitude of improvement. In contrast, we believe declarativity will enable
developers to explore a broad spectrum of recovery strategies (even complex ones). Moreover, with declarativity, we
believe we can parallelize recovery tasks explicitly, and hence radically improve recovery performance and scalability.
In addition to specifying recovery declaratively, our goal is also to translate the specifications into executable
protocols such that system developers do not have to write the code twice. Recently, we have used a declarative
data-centric programming language to rapidly build cloud services (e.g., MapReduce, HDFS) [23]. We will use this
experience to design a domain specific language for writing executable recovery specifications.
Hypothesis 3: A declarative language for recovery specifications will lead to more reliable and manageable large-scale storage systems
D.2.6 Summary: Declarativity and Iterative Lifecycle
One central theme of DARE is the use of declarative languages. Having presented the various usages of declarativity
in our DARE lifecycle, we highlight again its four important benefits that greatly fit the context of large-scale recovery.
First, declarative languages enable natural expression of system invariants without regard to how or whether they are
maintained. For example, in the analysis phase, system administrators can declaratively express “healthy” and “faulty”
scenarios without the need to know the detailed implementation of the system. Second, declarative expressions can be
directly executable. Thus, not only can recovery be written down as a form of documentation, but the documentation
also becomes precisely the running code that enforces the specification. This corresponds to a direct specification of
the distributed systems notion of system “Safety” (as in “Safety and Liveness”). Third, declarative languages tend
to be data-centric. This fits well for storage systems in which the bulk of the invariants is all about ensuring correct,
consistent representations of data via data placement, movement, replication, regeneration, and so on. Finally, work
in the database community has shown that declarative languages can naturally be scaled out via data parallelism. This
feature fits well for large-scale recovery where recovery must be scalable.
Another central theme of DARE is the iterative, pay-as-you-go nature of its three phases. In the testing and analysis
phases, we will improve the recovery of existing code bases (i.e., HDFS and Lustre) and submit our refinements to the
corresponding communities. At the same time, we will start formulating fundamental principles of large-scale recovery
and sketch them down as executable specifications. This process inadvertently forms an N-version programming [29].
That is, the executable specification becomes an evolving, formal document of the base code properties, and hence can
be tested for compliance with the base code as both evolve. Iteratively, as the specification becomes more powerful, it
can guide and improve the efficiency of the testing phase. For example, by leveraging the logical constraints defined in
the specification, we can exclude impossible or uncorrelated combinations of failures during the offline testing phase.
Thus, we believe the DARE lifecycle is a powerful software-engineering approach to recovery-oriented enrichment of
existing systems.
D.3 The Berkeley-Wisconsin DARE Project
The main goal of the DARE project is to build robust and scalable recovery as part of the next generation large-scale
storage systems. To get there, recovery must become a first-class component in today’s large-scale computing, and
hence must be tested and analyzed rigorously. Furthermore, as recovery is inherently complex in large-scale settings,
we believe declarative approaches must be explored in this context.
We believe the collaboration we have assembled is well positioned to make progress on this demanding problem,
as we draw on expertise in storage system reliability [33, 35, 76, 77, 78, 104, 133, 145], high-performance storage
clusters [25, 27, 39, 40, 75, 156], distributed monitoring [92, 93, 136], declarative programming [23, 54, 58, 108, 109,
111], and scalable architectures [19, 55, 72, 94, 110, 142, 157]. By having students each work under multiple advisors
with different expertise, we will build in a structural enforcement of the interdisciplinary nature of this proposal; hence,
students will learn techniques and knowledge from a much broader range of topics, and be able to solve the problems
in question. We now discuss the challenges of each component in DARE, along with our concrete plans. We begin
with offline recovery testing, progress to online recovery monitoring, and finally conclude the research portion of this
proposal with executable recovery specification.
5
D.4 Offline Recovery Testing
In this section we describe our approach to offline recovery testing. Our goal is to extensively test recovery against
various combinations of failures such that recovery bugs could be discovered “early in the game.” We begin with
describing our fault model and technique to insert fault-injection hooks. We follow with the challenges of covering
the failure sample space. Finally, we describe our intention to explore the idea of declarative testing specification to
better drive the testing process.
D.4.1 Fault Model and Failure Points
Users store data. Jobs run on data. Reboot needs data. Thus, our initial focus is on the storage fault model. Our
fault model will range from coarse-grained ones (e.g., machine failure, whole-disk failure) to fine-grained ones (e.g.,
latent sector error, byte corruption). We note that recovery of coarse-grained failures has been examined rigorously in
literature and practice. For example, a job running on a failed machine is simply restarted [62]; a lost file is simply
regenerated from the surviving replicas. However, fine-grained fault models are often overlooked. For example,
Amazon S3 storage system had an outage for much of a day when a corrupted value “slipped” into their gossipbased subsystem [42]. Thus, as Hamilton suggested, we should mimic security threat modeling which considers each
possible security threat and implements adequate mitigation [79]. We believe the same approach should be applied in
designing fault resiliency and recovery.
After defining the storage fault model, the first challenge is to decide how to emulate the faults in the model. We
decide to add a fault-injection framework as part of the system under test (i.e., a white-box framework). We believe
this integration is necessary to enable a large number of analysis; having internal knowledge of the system will allow
us to test many components of the system. We will start with placing failure points around system calls that interact
with local storage (e.g., read, write, stat). To do this without making the original code look “ugly”, we plan to use
recent advances in aspect oriented programming (AOP) [56]; the HDFS developers have successfully used AspectJ [5],
the AOP for Java, as a tool to inject failures [154]. With AspectJ, we have easily identified around 500 I/O-related
failure points within HDFS. We plan to do the same thing for Lustre file system, but with AspectC [4]. As we test
the HDFS base code, we will also test our declarative version of HDFS (BOOM-FS) hand-in-hand [23]. BOOM-FS
is comprised of the Overlog declarative language and its Java runtime support, Java Overlog Library (JOL). Thus, we
will also use AspectJ to insert failure points in JOL. BOOM-FS and JOL will be explained more in Section D.6.1.
Task 1: Add a white-box fault-injection framework to each system we plan to test (i.e., HDFS, Lustre, and
BOOM-FS) with aspect-oriented technology.
D.4.2 Coverage Scheduling
Once failure points are in place, the next challenge revolves around the three dimensions of failure scheduling policies:
what failures to inject, when and where to inject them. We translate these challenges into the problems of failure,
sequence, and workload coverage.
Failure coverage requires us to ideally inject all possible failure scenarios, ranging from individual to combinations
of failures. Without careful techniques, the sample space (i.e., the set of all possible failures) can be very big. For
example, even in the case of a single failure, a value can be corrupted to any arbitrary value, which will take a long time
to exhaust. A worse scenario is to exhaust all combinations of multiple failures. To reduce the coverage space without
reducing the coverage quality, we plan to adopt two methods. The first is the type-aware fault injection technique that
we have used successfully in the past [31, 34, 77, 131, 133]. The second is the combinatorial design from the field of
bio-computation which suggests that if we cover all combinations of two possible failures (versus all combinations of
all possible failures), we might cover almost all important cases [143]. After dealing with two-failure scenarios, we
will explore more techniques to cover more multiple failures.
Sequence coverage presents the idea that different sequences of failures could trigger different reactions. In other
words, it is not enough to say “inject X and Y”, but rather the we must be able to say “inject X, then Y” and vice
versa. This observation came from one of our initial experiments where a sequence of two failures result in a data
loss while the opposite sequence is not dangerous. For example, in HDFS, if a copy of a metadata file could not be
written (the first failure) just because of a transient failure, this copy is considered stale; all future updates will not be
reflected to this file. If then, the update-time of the file is corrupted (the second failure) such that it becomes more
recent than those of the other good copies, HDFS will read this stale file during the next reboot, which implies that all
6
updates after the first failure will be lost. The reverse scenario is safe because when the write fails, HDFS also updates
the update-time. Thus, sequence coverage is important, but it explodes the sample space further, and hence we plan to
look for solutions for this coverage too.
Finally, workload coverage requires us to exercise failures in different parts of the code; the same failure can be
handled in different segments of the code. Thus, we must also cover all possible workloads (e.g., start-up, data transfer,
numerous client operations, background check, and even during recovery itself). To do this, we will leverage existing
programming language techniques in directed testing [70].
Task 2: Develop techniques to ensure that the injected failures have high-qualities of failure, sequence,
and workload coverage.
D.4.3 Declarative Testing Specification
We note that testing is often performed in a manner that is oblivious to the source it is run against. For example, in
the field of fault injection, a range of different failures are commonly inserted into a system with little or no sense as
to whether a small or large portion of the failure handling code is being exercised [102]. We therefore believe there
is an opportunity to explore the feasibility of writing declarative testing specification to better drive testing. Such
“specification-aided” testing exploits domain specific knowledge (e.g., program logic) to help developers specify the
high-level test objectives. For example, typically it is hard to verify how many failures a system can survive before
going down. In this case, we ideally want to run a test specification such as “Combinations=2 & Server6=DOWN”
(i.e., insert any possible two failures and verify that the system is never down). Or, in the case of systems with crashonly recovery [48], the reboot phase must be thoroughly tested. Thus, we might wish to run a testing specification
such as “Combinations=1 & Workload=STARTUP” (i.e., insert all possible single failures in the reboot process).
We also note that declarative testing specification advocates experiment repeatability, something that bug fixers highly
appreciate.
Task 3: Explore how program logic (available from the source-code or program specification) can be
used to better drive testing.
D.5 Online Declarative Monitoring
To complement offline testing, our next goal is to analyze recovery when failures occur in actual deployment. We
note that many novel analyses have been proposed to pinpoint root-causes of failures in large-scale settings [37, 95,
122, 124, 153, 161]. What we plan to do is fundamentally different: in our case, failures are deliberately injected, and
recovery is the target of the analysis. In other words, our focus is not about finding out what causes failures to happen,
but rather why the system cannot recover from failures. Therefore, a different kind of framework is needed than what
has been proposed before. Below, we describe the three components of our online monitoring framework (as depicted
in Figure 3): online failure scheduling, recovery monitoring, and declarative analysis.
The System
Failure
Scheduler
Log: "X fails"
pre−recovery state
Log: "Regenearing File Y"
post−recovery state
Recovery
Monitoring
queries
Declarative
Analyses
Figure 3: Architecture for Online Declarative Analysis.
D.5.1 Online Failure Scheduling
We note that recently there is a big interest in industry to have failures deliberately injected during actual runs [79, 87,
126], rather than waiting for failures to happen. We name this new trend as online failure scheduling. Unfortunately,
to the best of our knowledge, there has been no work that lays out the possible designs in this space. Thus, we plan
to explore this paradigm as part of our online monitoring. We believe this piece of project will foster a new breed of
research that will directly influence existing practices.
To begin with, we will use the same fault-injection framework that we will build for our offline testing, as described
in the previous section. In addition, we will explore the fundamental differences between offline and online faultinjections. We have identified three challenges that we need to deal with. First, performance is important in online
7
testing. Thus, every time the system hits a failure point, the fault-injection decision making must be done rapidly
(e.g., locally without consulting to a remote scheduler). This requires us to build distributed failure schedulers that run
on each machine but communicate with each other; such feature is typically not required in an offline testing [162].
Second, timing is important; offline testing can direct the program execution to any desired states, but in deployment
settings, such freedom might be more restricted. For example, to test the start-up protocol, we cannot obliviously
reboot the system. Finally, reliability is important; we cannot inject a failure and lose customers’ data, especially if
there is no guarantee that recovery is reliable [135]. On the other hand, the fact that the failure is scheduled presents an
opportunity for the system to prepare for the failure (e.g., by backing up the data that might be affected by the injected
failure). A safe implication is that a buggy recovery can be caught without actual consequences (e.g., real data loss),
and hence the system “dares to fail”. One big challenge here is to design an efficient preparation state (i.e., identifying
which data that should be backed up).
Task 4: Design and implement distributed failure schedulers that run on each machine, but still coordinate failure-timings efficiently.
Task 5: Develop techniques for efficient preparation from failures (i.e., identify which data that are
potentially affected by to-be-injected failures).
D.5.2 Recovery Monitoring
After failures are injected online, our next goal is to detect bad recovery actions. This requires a recovery monitoring
framework that can gather sufficient information during recovery and construct the complex recovery behaviors of
the system under analysis. Administrators can then plug in their analyses on this monitoring framework. Below, we
describe the challenges of such a specific monitoring. We note that this framework can also be used for our offline
testing phase.
•Global health: Correct recovery decisions must be made based on many metrics such as system load, priority,
location, cost, and many more. We assume a large-scale system has a monitoring infrastructure that captures the
general health of the system such as CPU usage, disk usage, network usage, and number of jobs from log messages [2,
37, 134, 153, 161]. This general information will be useful to help us understand the global condition when recovery
takes place. We do not plan to reinvent the wheel, instead, we plan to extend existing infrastructure with recovery
monitoring in mind.
•Log provenance: We find that existing monitoring tools are not sufficient for our purpose; they depend on general
log messages that do not capture important component interactions during recovery. What is fundamentally missing
is the contextual information of each message. Other researchers have raised the same concern. For example, Oline
and Stearley studied 1 billion log messages from five of the world’s most powerful supercomputers, and their first
insight was that current logs do not contain sufficient information, which then impedes important diagnosis [123].
Without context, it is hard to correlate one event (e.g., “failure X occurred”) with another event (e.g., “Z bytes were
transferred”). As a result, tedious techniques such as filtering [75], timing correlation [22, 165], perturbation [30, 45,
115], source-code analysis [161] are needed.
We believe context is absent because monitoring was not a first class entity in the past, while today, monitoring
is almost a must for large-scale systems [79]. We note that contextual information already exists in the code (e.g., in
the form of states), but it has not immersed down to the realm of logging. We also believe that context is important to
classify log messages and hence will enable richer and more focused analysis. Thus, we plan to develop techniques
to establish concrete provenance of log messages (“lint for logs”). Fortunately, today’s systems have leveraged more
sophisticated logging services beyond “printf”. For example, HDFS employs Apache log4j logging services [2].
We will explore how context can be incorporated into existing logging services, and hence providing a transparent log
provenance. If context transparency is impossible, we still believe that adding context in log messages would be a
one-time burden that will benefit future analysis.
Task 6: Establish concrete provenance of log messages, and explore if existing logging services can
incorporate log provenance transparently.
D.5.3 Declarative Analysis
After having rich contextual information in log messages, we can continuously record them in a “recovery database”.
We plan to adopt existing strategies that use an intermediate database to transform log messages into well-formatted
8
structures [116, 134, 161]. To catch bad recovery actions, the next step is to run a set of online analyses on the recovery
database, specifically by writing them as declarative queries. This approach is motivated by our initial success in using
declarative queries to find state inconsistencies in the context of local file systems. We first describe this initial success
to illustrate how we will express declarative analysis.
•Initial work (Declarative Local Fsck): We have built a declara- firstBlk = sb->sFirstDataBlk;
tive file system checker named SQCK [77] to express the high-level lastBlk = firstBlk + blksPerGrp;
intent of the local file system consistency checker (fsck) in the form for (i = 0, gd=fs->grpDesc;
i < fs->grpDescCnt;
of declarative queries (with SQL). This was accomplished by loadi++, gd++) {
ing file system states into a database on which the consistency-check
if (i == fs->grpDescCnt - 1)
lastBlk = sb->sBlksCnt;
queries run. With such declarativity, we have rewritten the core logic
if ((gd->bgBlkBmap < firstBlk) ||
of a popular fsck (150 consistency checks) from 16 thousands lines of
(gd->bgBlkBmap >= lastBlk)) {
C code into 1100 lines of SQL statements. Figure 4 shows an exampx.blk = gd->bgBlkBmap;
if (fixProblem(PR0BBNOTGRP,...))
ple of a check that we transformed from C into a declarative query;
gd->bgBlkBmap = 0; } }
one can see that the high-level intent of the declarative check is more
----------------------------------clearly specified. This initial experience provides a major foundation SELECT
*
for our online declarative analysis.
FROM
GroupDescTable G
•Declarative Analysis: With the recovery database, we provide ad- WHERE G.blkBitmap NOT BETWEEN
G.start AND G.end
ministrators the ability to extract the recovery protocols (i.e., a statecentric view of recovery as query processing and update over system Figure 4: The C-version vs. declarative
state). For example, to identify whether recovery of X is expensive version of a consistency check.
or not, one can query the statistics (e.g., bytes transferred) during the
context of recovering X. Thus, the challenge is to design the format of the recovery database and establish a set of
queries that infer various aspects of recovery such as timing, scheduling, priority, load-balancing, data placement, and
many more.
More importantly, our goal here is to enable system administrators to plug in their specifications for both healthy
and faulty scenarios. For example, after recovery, the system should enter a stable state again. Thus, we can express
a specification that will alert the administrator if the healthy scenario is not met. One could also write a specification
that tracks a late recovery (as described in Section D.2.2) in a query such as “Alert if the same exact error is found
more than once.” To catch a chained-reaction of failures, one can plug in a rule such as “Alert if the number of underreplicated files is not shrinking over some period of time.” This set of expected invariants is by no means complete.
In fact, our goal is to enrich them as we learn more about bad recovery behaviors from the offline testing phase. Thus,
another challenge is to find out how to ease this process of going from manual analysis to useful online specifications.
Task 7: Design a proper structure for the recovery database, establish a set of analyses that infer various
aspects of recovery, and a set of specifications that express healthy and faulty recovery scenarios.
D.6 Executable Recovery Specification
Although the first two phases of DARE will uncover recovery problems, developers still need to fix them in the same
code base. Yet, developers often introduce new bugs as they fix the old bugs, especially when the system is complex
and written in low-level system languages [164]. Thus, we believe recovery code must evolve from low-level system
languages (e.g., C++ or Java) into declarative specifications. In this section, we describe our plan to design a declarative
language for writing executable recovery specification. To date, we only know of one work on declarative recovery,
and that is in the context of sensor networks [74]. In the context of large-scale file systems, we believe this will be
a challenging task, yet doable; we are encouraged by our initial success with rewriting HDFS declaratively [23]. We
begin with describing this initial success, followed with our detailed design and evaluation plans.
D.6.1 Initial Work: Data-Centric Programming
After reviewing some of the initial datacenter infrastructure efforts in the literature [46, 62, 63, 69], it seemed to us
that most of the non-trivial logic involves managing various forms of asynchronously-updated state (e.g., sessions,
protocols, storage). This suggests that the Overlog language used for declarative networking [109] would be wellsuited to these tasks and could significantly ease development. With this observation, we have successfully used the
Overlog declarative language [109] to rewrite the HDFS metadata management and communication protocols [23].
Our declarative implementation (BOOM-FS) is ten times shorter than the original HDFS implementation: less than
9
500 lines of Overlog with 1500 lines of additional Java code; the Java code is part of our Java Overlog Library (JOL)
which functions as the Overlog runtime support. Furthermore, to show how easy we can add complex distributed
functionalities, we have added the Paxos algorithm [68] into BOOM-FS in 400 lines of Overlog rules; with Paxos,
BOOM-FS can manage more than one master server with strong consistency.
To give a flavor of how protocol specifications are written
in Overlog, we give a more detailed description of our BOOM- // The set of nodes holding each chunk
FS implementation. The first step of our HDFS rewrite was to computeChunkLocs(ChunkId, set<NodeAddr>) :hbChunk(NodeAddr, ChunkId, _);
represent file system metadata as a collection of relations. For
example, we have an hbChunk relation which carries informa- // If chunk exists => return set of nodes
tion about the file chunks stored by each data node in the sys- response(@Src, RequestId, true, NodeSet) :request(@Master, RequestId, Src,
tem. The master server updates this relation as new heartbeats
’ChunkLocations’, ChunkId),
arrive; if the master server does not receive a heartbeat from
computeChunkLocs(ChunkId, NodeSet);
a data node within a configurable amount of time, it assumes
that the data node has failed and removes the corresponding // Chunk does not exist => return failure
RequestId, false, null) :rows from this table. After defining all relations in the system, response(@Src,
request(@Master, RequestId, Src,
we can write rules for each command in the HDFS metadata
’ChunkLocations’, ChunkId),
notin hbChunk(_, ChunkId, _);
protocol. For example, Figure 5 shows a set of rules (run on
the master server) that return the set of data nodes that hold a
Figure 5: A sample of Overlog Rules.
given chunk. These rules work on the information provided in
the hbChunk relation. Due to space constraint we are not able to describe the rule semantics. However, our main point
is that metadata protocols can be specified concisely with Overlog rules.
D.6.2 Evaluating Declarativity
To design a declarative language for recovery, we plan to explore whether existing declarative languages such as
Overlog can naturally express recovery. More concretely, we have defined four challenges that a proper declarative
language for recovery should meet:
•Checkability: We believe recovery is often under-specified. Yet, a small error in recovery could lead to a data loss or
degrade performance and availability. Declarativity is a way to turn recovery into formal system specification. Thus,
the chosen declarative language must be able to support formal verification on top of it. We also want to note that
formal verification will greatly complement testing; as testing is often considered expensive and unlikely to cover
everything, our hope is to have formal checks cover what testing does not and vice versa.
•Flexibility: Large-scale recovery management is complex; many parameters of different components play a role in
making recovery decisions. As an illustration, Table 1 lists some of the metrics commonly used in making recovery
decisions. Even though the list is not complete, there are many metrics to consider already. Thus, possibilities for recovery decisions are nuanced. For example, we might want to prioritize data recovery for first-class customers [100];
on the other hand, migrating recovery over time especially during surge load period is considered as good provisioning [80]. Recovery must also be adaptive. For example, if a cluster at one geographic location is not reachable, the
system might consider temporarily duplicating the unreachable data. Often, what we have in existing systems are
static policies. Therefore, our final solution should enable system builders to investigate different schemes, exploring
the large design space without being constrained with what is available today.
Metrics
#Replicas
Popularity
Access
#Jobs
Foreground request
Free space
Heterogeneous hardware
Utilization
Rack awareness
Geographic location
Financial cost
Network bandwidth
Possible policies based on the metrics
Prioritize recovery if the number of available replicas is small
Prioritize recovery of popular files
Prioritize recovery of files that are being used by jobs
Try not to impact foreground workload
Try to piggyback recovery with foreground request
Try to replicate files to machines with more free space
Try to recover important files to fast machines
Try to pick less utilized machines to perform recovery
Ensure that different copies are in different racks
Ensure that one copy is available in a different geographic location
If multiple recovery options are available, choose the cheapest one
Throttle recovery if bandwidth is scarce
Table 1: Recovery metrics and policies.
10
•Scalability: We believe a declarative language suitable for large-scale systems must be able to express parallelism
explicitly. We notice that although the power of elastic computing is available, some form of recovery has not evolved,
and hence does not exploit the available compute power in the system. An example is the classic file system checker
(fsck) whose job is to cross-check the consistency of all metadata in the system and then repair all broken invariants [12,
88, 117, 159]. A great deal of performance optimizations have been introduced to fsck [41, 89, 90, 128], but only
locally. Hence, a complete check still often take hours or days [90], and hence run very rarely. Thus, the chosen
declarative approach should enable easy parallelization of recovery tasks. As a start, in our prior work on declarative
fsck [77], we made concrete suggestions on how to run consistency checks in parallel (e.g., via traditional data-parallel
methods such as MapReduce and/or database-style queries).
•Integration: We note that some recent work has also introduced solutions to build flexible data management (e.g.,
with modularity [114]). However, they typically omit the recovery part. Thus, previous approaches although novel, are
not a holistic approach for building a complete storage system, as sooner or later recovery code needs to be added [77].
Thus, our goal is to design a holistic approach where the chosen declarative language will cover data management as
well as recovery; extending Overlog would be a great beginning as we have used Overlog to build the former.
D.6.3 Declarative Recovery Design
We believe that coming up with a good design requires careful exploration of possible alternatives. Thus, with the
four axes of evaluation in mind, there are several directions that we will explore. First, we will investigate if Overlog
is well-suited to express recovery protocols. With Overlog, metadata operations are treated as query processing and
updates over system state; recovery could have the same state-centric view, and hence a high chance of success for this
direction. Second, we will explore AOP for declarative languages (in our case, we will begin with AOP for Overlog).
This direction comes from our observation that if recovery rules pervade the main operational rules, the latter could
become unnecessary complex even if both are written declaratively. Thus, if declarative recovery can be written in an
AOP-style, we can provide a formal and executable “recovery plane” for real systems. Finally, if we find that Overlog
is not suitable for our goal, we will investigate other declarative techniques, rather than being constrained with what
we currently have.
We also want to note that recovery is constrained to how the data and metadata are managed; a policy that recovers
a lost data from a surviving replica would not exist if the data is not replicated in the first place; a failover policy for
a failed metadata server would be impossible if only one metadata server exists. This implies that we also need to
re-evaluate how far existing declarative languages such as Overlog enable flexibility in managing data and metadata.
Task 8: Prototype recovery specifications for HDFS in Overlog, and then evaluate the prototype based
on the four axes of evaluation above. Based on this experience, develop guidelines for a more appropriate
domain-specific language for recovery.
Task 9: Evaluate the reliability of our recovery specifications using the first two phases of DARE (hence,
the iterative nature of the DARE lifecycle).
D.7 Putting It All Together
From our initial study of existing large-scale recovery (Section D.2.2), we learn two important lessons. First, early
designs usually do not cover all possible failures; unpredictable failures occur in real deployment, and thus system
builders learn from previous failures. Second, even as new failures are handled, the corresponding recovery strategies
must be designed carefully; a careless recovery could easily bring down availability, overuse resources, and affect
a large number of applications. Therefore, we believe the three components of DARE form a novel framework for
improving recovery: the offline testing phase unearths recovery bugs early, the online monitoring phase catches more
problems found in deployment, and the executable specification enables system builders try out different recovery
strategies. Because we see these three phases as an iterative, pay-as-you-go approach, our plan is to rapidly prototype
the three phases and improve them hand-in-hand. As mentioned throughout previous sections, we believe the lessons
learned from each phase will benefit the other phases in parallel.
11
D.8 Related Work
In this section, we highlight the distinctions between our project and previous major work on recovery and scalability.
•Recovery: Recovery-Oriented Computing (ROC) was extensively proposed eight years ago [127]. This powerful
paradigm has convinced the community that errors are inevitable, unavailability is costly, and thus failure management
should be revisited. Since then, a long line of efficient recovery techniques have been proposed. For example, Nooks
ensures that a driver failure does not cause whole-OS failure [150, 151]; Micro-reboot advocates that a system should
be compartmentalized to prevent full reboot [49]; More recently, CuriOS prevents failure propagation by isolating
each service [60]. Today, as systems scale, recovery should be revisited again at a larger scale [42].
•Scalable Recovery: Recently, some researchers have started to investigate the scalability of job recovery. For example, Ko et al. found that a failed MapReduce task could lead to a cascaded re-execution of all preceding tasks [101].
Similarly, Arnold and Miller also pointed out a similar issue in the context of tree-based overlay computation [24]; a
node failure could lead to a re-execution of all computations in the sub-tree. Our project falls along the same line with
a different emphasis: scalable data recovery.
•Data Recovery: Fast data recovery was first introduced in the context of RAID around 16 years ago [91]. Recently,
RAID recovery has been evaluated at a larger scale [160]. However, there is a shift from using disk-level RAID to
distributed file-level RAID [159] or simple replication across cluster of machines [43, 69]. We believe that recovery should be revisited again in this new context. Fast data recovery during restart (log-based recovery) has also
been considered important. For example, the ARIES database system exploits multi-processors by making redo/undo
processes parallel [119]. More recently, Sears and Brewer extended the idea with a new technique, segment-based
recovery, which can run across large-scale distributed systems [140]. In our project, in addition to log-based recovery,
we will revisit the scalability of all recovery tasks of the systems we will analyze.
•Pinpointing problems: There has been a growing number of work attempting to pinpoint problems in large-scale
settings: Priya’s group at CMU built a black-box analysis [124], white-box analysis [153], and automated online
fingerpointing tool [37] for monitoring MapReduce jobs; Oliner et al. developed unsupervised algorithms to analyze
raw logs and automatically alert administrators [122]; Xu et al. employs machine learning techniques to detect anomaly
from logs [161]. We will extend this existing work to capture recovery problems.
•Declarative Recovery: Keeton et al. pointed out the complexity of recovery in a multi-tier storage system [98, 99,
100]; recovery choices are myriad (e.g., failover, reprovision, failback, restore, mirror). To help administrators, they
introduced novel recovery graphs that process input constraints (e.g., tolerated loss, cost) and produce a solution.
However, they only provide a high-level framework but stop short in providing a new language. To date, we only
know of one work on declarative recovery, and that is in the context of sensor networks. As sensor applications often
deal with failures, lots of recovery concerns are tangled with the main program, making it hard to understand and
evolve [74]. Thus, the same lesson should be applied to large-scale file systems.
•Parallel workloads: Today, all kinds of workloads are made scalable: B-Tree data structures can coordinate itself
across machines [21]; Tree-learning algorithm now runs on multiple machines [125]; all subcomponents of database
(query planning, query optimizer, etc.) are made parallel [66, 157]; even monitoring tools are formed as scalable
jobs [134, 161]. This trend forces us to revisit again the scalability of all file system subcomponents; we will begin
with the recovery component.
•Bug finding: Many approaches have been introduced to find bugs in storage system code [47, 163, 164]. We note that
the type of failure analysis we are advocating is not simply the search for “bugs”, rather, we also seek to understand
the recovery policy of the system under analysis at a high level. In the past, we have shown that high-level analyses
unearth problems not found by bug finding tools [31, 34, 133].
D.9 Research and Management Plan
In this section, we first discuss the global research plan we will pursue, and then discuss the division of labor across
research assistants.
D.9.1 Global Research Plan
We believe strongly in an iterative approach to systems building via analysis, design, implementation, evaluation, and
dissemination. We believe that this cycle is crucial, allowing system components to evolve as we better understand the
relevant issues. We discuss these in turn.
12
Year
1
2
3
UC Berkeley (Postdoc1 for 1 year + RA1 for 2 years)
UW Madison (RA2 for 3 years)
• Prototype DARE phase 2 (online monitoring) for HDFS: deal
• Prototype DARE phase 1 (offline testing) for HDFS: insert
with log provenance, establish a list of online analyses.
failure points with AOP, auto-generate combinations of failures.
• Prototype DARE phase 3 (executable specification) for
• Document deficiencies of existing recovery (for Postdoc1 ).
HDFS: bootstrap spec-writing in Overlog into BOOM-FS.
• Improve test performance with distributed failure schedulers.
• From RA2 ’s testing results, refine the two phases above.
• (RA1 will continue Postdoc1 ’s work)
• Apply the three phases of DARE to Lustre.
• Evaluate our recovery domain specific language (DSL) (veri• Develop techniques to improve failure coverage and intellifiability, flexibility, performance, parallelism, robustness).
gent sampling of test scenarios.
• Use this experience to develop more appropriate DSLs and
• Design online failure scheduling, and integrate this to
design patterns for recovery.
Postdoc1 ’s monitoring framework.
• Revisit the whole process again, identify, and tackle new interesting research challenges.
• Extend DARE to three other areas of cloud-related research: structured stores such as memcached [13] and Hadoop HBase
(BigTable) [11], job management such as Hadoop MapReduce [10], and resource management such as Eucalyptus [7].
Table 2: Preliminary Task Breakdown.
•“By hand” analysis: Detailed protocols of existing large-scale file systems are not highly documented. Thus, the
initial stage is to analyze the source code “by hand”.
•Design: As we gain exposure to the details of existing systems, we can start sketching down recovery specifications
declaratively, and also envision new testing and monitoring tools. Every piece of software we plan to build has a
serious design phase.
•Implementation: One of our group mantras is “learning through doing.” Thus, all of our research centers around
implementation as the final goal; because this project is focused on finding real problems in real systems, we do not
anticipate the usage of simulation in any substantial manner.
•Evaluation: At all stages throughout the project, we will evaluate various components of DARE via controlled
micro-benchmarks. Measurements of system behavior guide our work, informing us which of our ideas have merit,
and which have unanticipated flaws.
•Dissemination: Our goal is to disseminate our findings quickly. Thus, one metric of success is to share useful
findings, tools, and prototypes to the HDFS and Lustre communities at the end of each research year. We also believe
that, beyond file systems, the vision of DARE should immerse to other areas as well (e.g., databases, job/resource
management). Hence, another metric of success is to achieve that in the final year (see Year 3 in Table 2).
D.9.2 Division of Labor
We request funding for one postdoctoral scholar at UC Berkeley for one year (Postdoc1), one graduate student researcher (GSR) at UC Berkeley for two years (RA1 ), and another GSR at UW Madison for three years (RA2 ). Table 2
presents a rough division of labor across the different components of the project. We emphasize that the bulk of
funding requested within this proposal is found in human costs; we have pre-existing access to large-scale computing
infrastructure described below.
We believe our Berkeley-Wisconsin collaboration will generate a fruitful project; PIs Hellerstein and ArpaciDusseau have co-authored significant papers before [25, 26, 28]; Furthermore, Postdoc1 and RA2 have been working
together in completing some of the 1st-year items, and will continue to do so as described in Table 2. Exchange of
ideas and decision makings will continue via phone/video conferencing and several visits.
D.10 Facilities for Data-Intensive Computing Research
As mentioned before, one of the goals of this project is to test recovery at scales beyond those considered in prior
research. We believe that as systems get larger in size, recovery must be tested at the same scale. Thus, we plan to
use two kinds of facilities (medium and large scale). The first one is our internal cluster of 32 Linux machines with
2.2 GHz processor, 1 GB of memory, and 1 TB of disk on each machine. As we have exclusive access to this cluster
anytime, we will test our prototypes on this cluster before deploying them in a larger cluster.
For a larger evaluation, we plan to use the Yahoo M45 [17] cluster and/or Amazon EC2 [1]. We have been granted
access to the Yahoo M45 cluster, which has approximately 4,000 processor-cores and 1.5 PBs of disks. Hellerstein’s
group has an intensive pre-existing collaboration with Yahoo Research, and an accompanying letter of support from
their management documents both that relationship and their interest in supporting our work. In addition, we also
plan to use Amazon EC2 which allows users to rent computers on which to run their own computer applications;
13
researchers are able to rent 200 nodes cheaply [161]. This fits well for scalability-related evaluations, and thus, we
allocate a modest budget ($1000/year) for using Amazon EC2.
Our goal is to get the most result with the least cost. Thus, we will explore which of the large-cluster options will
give us the most benefit. We also want to acknowledge that human processes should not interfere with large-scale,
online evaluation (i.e., everything should be automated). This frames our approach in this proposal. For example,
our fault-injection framework is integrated as part of the system under test, and hence failures can be automatically
scheduled; we also plan to gather runtime information in as much detail as possible (e.g., in a recovery database) such
that we can analyze our findings offline.
D.11 Broader Impact and Education Plan
In this section we first describe the impact of this project to undergraduates; we strongly believe that these students
are critical to the future of science and engineering in our country, and thus we have been working to engage as many
of them in various ways. Then, we close with the broader impact of this project.
•Teaching impact: In the past, Arpaci-Dusseau has incorporated modern distributed file systems such as Google
File System into her undergraduate Operating System curriculum, and Hellerstein has incorporated MapReduce into
his undergrad Database Systems curriculum. As a result, the students were deeply excited to understand the science
behind the latest technology. Furthermore, unlike in the past, today’s distributed file systems are surprisingly very
easy to install; a first-timer can install HDFS on multiple machines in an undergraduate computer lab in less than ten
minutes. This ease encourages us to use HDFS for future course projects.
•Training impact: As large-scale reliability research is fairly new, we believe we will find and fix many vulnerabilities. We plan to take simple but fundamental fixes that can be repeated by undergraduate students. Thus, rather than
building “toy” systems in all of their course projects, they can experience fixing a real system. We believe this is a
good practice before they join industry. Furthermore, we also realize that students typically implement a single design
predefined by the instructor. Although this is a great starting point, we feel that students tend to take for granted the
fundamental decisions behind the design. Thus, we also plan to give a project where students use a high-level language
(probably ours) to rapidly implement more than one design and perform performance-reliability tradeoffs.
•Undergraduate independent study: Due to the significant number of large-scale file systems that we wish to
analyze, the DARE project is particularly suited to allowing undergraduate students gain exposure to true systems
research. More specifically, we will pair undergraduates with graduate students in our groups such that exchanges of
knowledge are possible. This gives them a chance to work on cutting-edge systems research, and hence makes them
more ready to go to industry or national labs. At both UW Madison and UC Berkeley,our efforts along this direction
have been quite successful; undergraduate students have published research papers, interned at national labs, and gone
on to graduate school and prominent positions in academia and industry.
•Community outreach and dissemination: One of our priorities is to generate novel research results and publish
them in the leading scholarly venues. In the past, we have focused upon systems and databases venues such as
SOSP, OSDI, FAST, USENIX, SIGMOD, VLDB, ICDE, and related conferences, and plan to continue along those
lines here; we feel this is one of the best avenues to expose our ideas to our academic and industrial peers. We are
also committed to developing software that is released in the open source and highly documented. In the past we
have worked with Linux developers to move code into the Linux source base. For example, some of our work in
improving Linux file systems has been adopted into the main kernel, specifically in the Linux ext4 file system [133],
and many bugs and design flaws we have found have already been fixed by developers [78]. We have also worked
extensively in recent years with Yahoo, the main locus of development behind Hadoop, HDFS, and PIG. Our recent
open-source extensions to those systems [59] have generated significant interest in the open source community, and
we are working aggressively to get them incorporated into the open-source codebase. In the past, we have also made
major contributions to open source systems like PostgreSQL and PostGIS.
•Technology transfer: We place significant value on technology transfer; some of our past work has led to direct
industrial impact. For example, the ideas behind the D-GRAID storage system have been adopted by EMC in their
Centera product line [145]. Further, we also worked with NetApp to transfer some of our earlier fault injection technology into their product development and testing cycle [133]; this technology transfer led directly to follow-on research
in which we found design flaws in existing commercial RAID systems, which has spurred some companies to fix
the problems we found [104]. In recent years we have been involved in the scale-out both of core database infrastructure [157] and the use of parallel databases to do online advertising via rich statistical methods at unprecedented
14
scales [57]. Earlier research has been transferred to practice in database products from IBM [105], Oracle [146] and
Informix [86, 103].
•Benefits to society: We believe the DARE project falls in the same directions set by federal agencies; a recent HEC
FSIO workshop declared “research in reliability at scale” and “scalable file system architectures” as topics that are
very important and greatly need research [36]. Moreover, our analysis of and potential improvements to parallel file
systems will definitely benefit national scientists who use these systems on daily basis. We also believe that our project
will benefit society at large; in the near future, users will store in the “cloud” all of their data (emails, work documents,
generations of family photos and videos, etc.). Through the DARE project, we will build the next generation largescale Internet and parallel file systems that will meet the performance and reliability demands of the current society.
D.12 Prior Results
PI Joseph M. Hellerstein has been involved in a number of NSF awards. Three are currently active, but entering
their final year: MUNDO: Managing Uncertainty in Networks with Declarative Overlays (NGNI-Medium, 09/08
- 08/10, $303,872) SCAN: Statistical Collaborative Analysis of Networks (NeTS-NBD, 0722077, 01/08 - 12/10,
$249,000), and Dynamic Meta-Compilation in Networked Information Systems (III-COR 0713661, 09/07 - 08/10,
$450,000). The remainder are completed: Adaptive Dataflow: Eddies, SteMs and FLuX (0208588, 08/02 - 07/05,
$299,998), Query Processing in Structured Peer-to-Peer Networks (0209108, 08/02 - 07/05, $179,827), Robust Large
Scale Distributed System (ITR 5710001486, 10/02 - 09/07, $2,840,869), Data on the Deep Web: Queries, Trawls,
Policies and Countermeasures (ITR 0205647, 10/02 - 09/07, $1,675,000), and Mining the Deep Web for Economic
Data (SGER 0207603, 01/02 - 06/03, $98,954).
The three current awards all share at their core the design and use of declarative languages and runtimes, in
different contexts. These languages and runtimes form a foundation for the declarative programming aspects of the
work we propose here, and have supported many of the recent results cited earlier [23, 53, 58, 59] as well as a
variety of work that is out of the scope of this proposal. The emphasis on scalable storage recovery in this proposal
is quite different from the prior grants, which focused on Internet security (SCAN), distributed machine learning
(MUNDO), and the internals of compilers for declarative languages (III-COR 713661). Prior awards have led to a
long list of research results that have been recognized with awards. A series of students have been trained with this
support, including graduate students who have gone on to distinguished careers in academia (MIT, Penn, Maryland)
and industrial research (IBM, Microsoft, HP), and undergraduates who have continued on into graduate school at
Berkeley, Stanford and elsewhere.
Co-PI Andrea C. Arpaci-Dusseau has been involved with a number of NSF-related efforts. The active ones are:
HaRD: The Wisconsin Hierarchically-Redundant, Decoupled Storage Project (HEC, 09/09 - 08/10, $221,381), Skeptical Systems (CSR-DMSS-SM, 08/08 - 08/11, $420,000), and WASP: Wisconsin Arrays in Software Project (CPACSA, 07/08 - 08/10, $217,978). The completed ones are: PASS: Formal Failure Analysis for Storage Systems (HEC,
09/06 - 08/09, $951,044), SAFE: Semantic Failure Analysis and Management (CSR–PDOS, 09/05 - 08/08, $700,000),
WISE: Wisconsin Semantic Disks (ITR-0325267, 08/03 - 08/07, $600,000), Robust Data-intensive Cluster Programming (CAREER proposal, CCR-0092840, 09/01 - 08/06, $250,000), WiND: Robust Adaptive Network-Attached Storage (CCR-0098274, 09/01 - 08/04, $310,000), and Wisconsin DOVE (NGS-0103670, 09/01 - 08/04, $596,740).
The most relevant awards are the PASS, SAFE, and WiND projects. Through the PASS and SAFE projects, we
have gained expertise on the topic of storage reliability [31, 33, 34, 35, 64, 65, 75, 76, 77, 78, 104, 131, 132, 133,
137, 144, 145, 147]. We began by analyzing how commodity file systems react to a range of more realistic disk
failures [131, 133]. We have shown that commodity file system failure policies (such as those in Linux ext3, ReiserFS,
and IBM JFS) are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial
disk failures. We also have extended our analysis techniques to distributed storage systems [75], RAID systems [65],
commercial file systems [33, 34], and virtual memory systems [31]. Most recently, we have employed formal methods
(model checking and static analysis) to find flaws in file system code [78, 137] and RAID designs [104]. Finally, we
were able to propose novel designs for building more reliable file systems; one design received a best paper award [35],
and the other two appeared in top conferences [76, 77].
Although most of our recent work has focused on the analysis of reliability in storage systems, our earlier efforts
in I/O focused greatly on performance [39, 40, 156]. We have on more than one occasion held the world-record in
external (disk-to-disk) sorting [25]. We thus will bring to bear this older expertise on the DARE project.
15
References
[1] Amazon EC2. http://aws.amazon.com/ec2.
[2] Apache Logging Services Project. http://logging.apache.org/log4j/.
[3] Applications and organizations using Hadoop/HDFS. http://wiki.apache.org/hadoop/PoweredBy.
[4] AspectC. www.aspectc.org.
[5] AspectJ. www.eclipse.org/aspectj.
[6] CloudStore / KosmosFS. http://kosmosfs.sourceforge.net.
[7] Eucalyptus. http://www.eucalyptus.com.
[8] GFS/GFS2. http://sources.redhat.com/cluster/gfs.
[9] GlusterFS. www.gluster.org.
[10] Hadoop MapReduce. http://hadoop.apache.org/mapreduce.
[11] HBase. http://hadoop.apache.org/hbase.
[12] HDFS User Guide. http://hadoop.apache.org/common/docs/r0.20.0/hdfs user guide.html.
[13] Memcached. http://memcached.org.
[14] MogileFS. www.danga.com/mogilefs.
[15] Tahoe-LAFS. http://allmydata.org/trac/tahoe.
[16] XtreemFS. www.xtreemfs.org.
[17] Yahoo M45. http://research.yahoo.com/node/1884.
[18] Daniel Abadi. Data Management in the Cloud: Limitations and Opportunities. IEEE Data Engineering Bulletin, 32(1):3–12,
March 2009.
[19] Daniel J. Abadi, Michael J. Cafarella, Joseph M. Hellerstein, Donald Kossmann, Samuel Madden, and Philip A. Bernstein.
How Best to Build Web-Scale Data Managers? A Panel Discussion. PVLDB, 2(2), 2009.
[20] Parag Agrawal, Daniel Kifer, and Christopher Olston. Scheduling Shared Scans of Large Data Files. In Proceedings of the
34th International Conference on Very Large Data Bases (VLDB ’08), Auckland, New Zealand, July 2008.
[21] Marcos Aguilera, Wojciech Golab, and Mehul Shah. A Practical Scalable Distributed B-Tree. In Proceedings of the 34th
International Conference on Very Large Data Bases (VLDB ’08), Auckland, New Zealand, July 2008.
[22] Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. Performance
Debugging for Distributed Systems of Black Boxes. In Proceedings of the 19th ACM Symposium on Operating Systems
Principles (SOSP ’03), Bolton Landing, New York, October 2003.
[23] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and Russell C Sears. BOOM: DataCentric Programming in the Datacenter. UC. Berkeley Technical Report No. UCB/EECS-2009-113, 2009.
[24] Dorian C. Arnold and Barton P. Miller. State Compensation: A Scalable Failure Recovery Model for Tree-based Overlay
Networks. UW Technical Report, 2009.
[25] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and Dave Patterson. HighPerformance Sorting on Networks of Workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on
Management of Data (SIGMOD ’97), Tucson, Arizona, May 1997.
[26] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and Dave Patterson. Searching for the Sorting Record: Experiences in Tuning NOW-Sort. In The 1998 Symposium on Parallel and Distributed Tools
(SPDT ’98), Welches, Oregon, August 1998.
[27] Remzi H. Arpaci-Dusseau. Run-Time Adaptation in River. ACM Transactions on Computer Systems (TOCS), 21(1):36–86,
February 2003.
[28] Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and Dave Patterson. The
Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs. In Proceedings of the 4th
International Symposium on High Performance Computer Architecture (HPCA-4), Las Vegas, Nevada, February 1998.
[29] Algirdas A. Aviˇzienis. The Methodology of N-Version Programming. In Michael R. Lyu, editor, Software Fault Tolerance,
chapter 2. John Wiley & Sons Ltd., 1995.
[30] Saurabh Bagchi, Gautam Kar, and Joseph L. Hellerstein. Dependency Analysis in Distributed Systems Using Fault Injection.
In 12th International Workshop on Distributed Systems, Nancy, France, October 2001.
1
[31] Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Dependability Analysis of Virtual Memory Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’06),
Philadelphia, Pennsylvania, June 2006.
[32] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. An Analysis of Latent Sector Errors
in Disk Drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer
Systems (SIGMETRICS ’07), San Diego, California, June 2007.
[33] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. ArpaciDusseau. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Symposium on File and
Storage Technologies (FAST ’08), pages 223–238, San Jose, California, February 2008.
[34] Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and
Michael M. Swift. Systematically Benchmarking the Effects of Disk Pointer Corruption. In Proceedings of the International
Conference on Dependable Systems and Networks (DSN ’08), Anchorage, Alaska, June 2008.
[35] Lakshmi N. Bairavasundaram, Swaminathan Sundararaman, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
Tolerating File-System Mistakes with EnvyFS. In Proceedings of the USENIX Annual Technical Conference (USENIX ’09),
San Diego, California, June 2009.
[36] Marti Bancroft, John Bent, Evan Felix, Gary Grider, James Nunez, Steve Poole, Rob Ross, Ellen Salmon, and Lee Ward.
High End Computing Interagency Working Group (HECIWG) HEC File Systems and I/O 2008 Roadmaps. http://
institutes.lanl.gov/hec-fsio/docs/HEC-FSIO-FY08-Gaps RoadMap.pdf.
[37] Keith Bare, Michael P. Kasick, Soila Kavulya, Eugene Marinelli, Xinghao Pan, Jiaqi Tan, Rajeev Gandhi, and Priya
Narasimhan. ASDF: Automated and Online Fingerpointing for Hadoop. CMU PDL Technical Report CMU-PDL-08-104,
2008.
[38] Luiz Barroso, Jeffrey Dean, and Urs Hoelzle. Web search for a planet: The Google cluster architecture. IEEE Micro,
23(2):22–28, 2003.
[39] John Bent, Doug Thain, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit Control in a
Batch-Aware Distributed File System. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI ’04), pages 365–378, San Francisco, California, March 2004.
[40] John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley, Andrea C. Arpaci-Dusseau, Remzi H.
Arpaci-Dusseau, and Miron Livny. Flexibility, Manageability, and Performance in a Grid Storage Appliance. In Proceedings
of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 11), pages 3–12, Edinburgh,
Scotland, July 2002.
[41] Eric J. Bina and Perry A. Emrath. A Faster fsck for BSD Unix. In Proceedings of the USENIX Winter Technical Conference
(USENIX Winter ’89), San Diego, California, January 1989.
[42] Ken Birman, Gregory Chockler, and Robbert van Renesse. Towards a Cloud Computing Research Agenda. ACM SIGACT
News, 40(2):68–80, June 2009.
[43] Dhruba Borthakur.
design.html.
HDFS Architecture.
http://hadoop.apache.org/common/docs/current/hdfs
[44] Peter J. Braam and Michael J. Callahan. Lustre: A SAN File System for Linux. www.lustre.org/docs/luswhite.pdf, 2005.
[45] A. Brown, G. Kar, and A. Keller. An Active Approach to Characterizing Dynamic Dependencies for Problem Determination
in a Distributed Environment. In The 7th IFIP/IEEE International Symposium on Integrated Network Management, May
2001.
[46] Mike Burrows. The Chubby lock service for loosely-coupled distributed systems Export. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), Seattle, Washington, November 2006.
[47] Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: Unassisted and Automatic Generation of High-Coverage Tests
for Complex Systems Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation
(OSDI ’08), San Diego, California, December 2008.
[48] George Candea and Armando Fox. Crash-Only Software. In The Ninth Workshop on Hot Topics in Operating Systems
(HotOS IX), Lihue, Hawaii, May 2003.
[49] George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Microreboot – A Technique for
Cheap Recovery. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04),
pages 31–44, San Francisco, California, December 2004.
[50] Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and Rajeev Thakur. PVFS: A parallel file system for linux clusters. In
Atlanta Linux Showcase (ALS ’00), Atlanta, Georgia, October 2000.
2
[51] Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. SCOPE: Easy
and Efficient Parallel Processing of Massive Data Sets. In Proceedings of the 34th International Conference on Very Large
Data Bases (VLDB ’08), Auckland, New Zealand, July 2008.
[52] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An Empirical Study of Operating System
Errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), pages 73–88, Banff,
Canada, October 2001.
[53] David Chu, Joseph M. Hellerstein, and Tsung te Lai. Optimizing declarative sensornets. In Proceedings of the 6th International Conference on Embedded Networked Sensor Systems (SenSys ’08), Raleigh, NC, November 2008.
[54] David Chu, Lucian Popa, Arsalan Tavakoli, Joseph M. Hellerstein, Philip Levis, Scott Shenker, and Ion Stoica. The design
and implementation of a declarative sensor network system. In Proceedings of the 5th International Conference on Embedded
Networked Sensor Systems (SenSys ’07), Sydney, Australia, November 2007.
[55] Brent N. Chun, Joseph M. Hellerstein, Ryan Huebsch, Shawn R. Jeffery, Boon Thau Loo, Sam Mardanbeigi, Timothy
Roscoe, Sean C. Rhea, Scott Shenker, and Ion Stoica. Querying at Internet-Scale. In Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD ’04), Paris, France, June 2004.
[56] Yvonne Coady, Gregor Kiczales, Mike Feeley, and Greg Smolyn. Using AspectC to Improve the Modularity of Path-Specific
Customization in Operating System Code. In Proceedings of the Joint European Software Engineering Conference (ESEC)
and 9th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE-9), September 2001.
[57] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. MAD Skills: New Analysis Practices
for Big Data. PVLDB, 2(2):1481–1492, 2009.
[58] Tyson Condie, David Chu, Joseph M. Hellerstein, and Petros Maniatis. Evita raced: metacompilation for declarative networks. PVLDB, 1(1), 2008.
[59] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell C Sears. MapReduce
Online. UC. Berkeley Technical Report No. UCB/EECS-2009-136, 2009.
[60] Francis M. David, Ellick M. Chan, Jeffrey C. Carlyle, and Roy H. Campbell. CuriOS: Improving Reliability through
Operating System Structure. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI
’08), San Diego, California, December 2008.
[61] Jeff Dean. Panel: Desired Properties in a Storage System (For building large-scale, geographically-distributed services). In
Workshop on Hot Topics in Storage and File Systems (HotStorage ’09), Big Sky, Montana, October 2009.
[62] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the
6th Symposium on Operating Systems Design and Implementation (OSDI ’04), pages 137–150, San Francisco, California,
December 2004.
[63] Giuseppe Decandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami
Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s Highly Available Key-Value Store. In Proceedings
of the 21st ACM Symposium on Operating Systems Principles (SOSP ’07), Stevenson, Washington, October 2007.
[64] Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Journal-guided Resynchronization for
Software RAID. In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST ’05), pages 87–
100, San Francisco, California, December 2005.
[65] Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Deconstructing Storage Arrays. In Proceedings of the 11th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS XI), pages 59–71, Boston, Massachusetts, October 2004.
[66] David J. DeWitt and Jim Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun.
ACM, 35(6):85–98, 1992.
[67] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as Deviant Behavior: A General
Approach to Inferring Errors in Systems Code. In Proceedings of the 18th ACM Symposium on Operating Systems Principles
(SOSP ’01), pages 57–72, Banff, Canada, October 2001.
[68] Eli Gafni and Leslie Lamport. Disk Paxos. In International Symposium on Distributed Computing (DISC ’00), Toledo,
Spain, October 2000.
[69] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In Proceedings of the 19th ACM
Symposium on Operating Systems Principles (SOSP ’03), pages 29–43, Bolton Landing, New York, October 2003.
[70] Patrice Godefroid, Nils Klarlund, and Koushik Sen. DART: Directed Automated Random Testing. In Proceedings of the
ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI ’05), Chicago, Illinois,
June 2005.
3
[71] Jim Gray and Catharine Van Ingen. Empirical Measurements of Disk Failure Rates and Error Rates. Microsoft Research
Technical Report MSR-TR-2005-96, December 2005.
[72] Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David Culler. Scalable and Distributed Data Structures for
Internet Service Construction. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation
(OSDI ’00), San Diego, California, October 2000.
[73] Robert L. Grossman and Yunhong Gu. On the Varieties of Clouds for Data Intensive Computing. IEEE Data Engineering
Bulletin, 32(1):44–50, March 2009.
[74] Ramakrishna Gummadi, Nupur Kothari, Todd D. Millstein, and Ramesh Govindan. Declarative failure recovery for sensor
networks. In Proceedings of the 6th International Conference on Aspect-Oriented Software Development (AOSD ’07),
Vancouver, Canada, March 2007.
[75] Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Jiri Schindler. Deconstructing Commodity Storage Clusters. In Proceedings of the 32nd Annual International Symposium on Computer Architecture
(ISCA ’05), pages 60–73, Madison, Wisconsin, June 2005.
[76] Haryadi S. Gunawi, Vijayan Prabhakaran, Swetha Krishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
Improving File System Reliability with I/O Shepherding. In Proceedings of the 21st ACM Symposium on Operating Systems
Principles (SOSP ’07), pages 283–296, Stevenson, Washington, October 2007.
[77] Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. SQCK: A Declarative
File System Checker. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI ’08),
San Diego, California, December 2008.
[78] Haryadi S. Gunawi, Cindy Rubio-Gonzalez, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO:
Error Handling is Occasionally Correct. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies
(FAST ’08), pages 207–222, San Jose, California, February 2008.
[79] James Hamilton. On Designing and Deploying Internet-Scale Services. In Proceedings of the 21st Large Installation System
Administration Conference (LISA ’07), Dallas, Texas, November 2007.
[80] James Hamilton. Where Does the Power Go in High-Scale Data Centers? Keynote at USENIX 2009, June 2009.
[81] HDFS JIRA. http://issues.apache.org/jira/browse/HDFS.
[82] HDFS JIRA. A bad entry in namenode state when starting up. https://issues.apache.org/jira/browse/
HDFS-384.
[83] HDFS JIRA. Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes. http://issues.
apache.org/jira/browse/HADOOP-572.
[84] HDFS JIRA. Replication should be decoupled from heartbeat. http://issues.apache.org/jira/browse/
HDFS-150.
[85] Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Hongyi Wang, and Lidong Zhou. Wave Computing
in the Cloud. In the 12th Workshop on Hot Topics in Operating Systems (HotOS XII), Monte Verita, Switzerland, May 2009.
[86] Joseph M. Hellerstein, Ron Avnur, and Vijayshankar Raman. Informix under CONTROL: Online Query Processing. Data
Min. Knowl. Discov., 4(4):281–314, 2000.
[87] Alyssa Henry. Cloud Storage FUD: Failure and Uncertainty and Durability. In Proceedings of the 7th USENIX Symposium
on File and Storage Technologies (FAST ’09), San Francisco, California, February 2009.
[88] Val Henson. The Many Faces of fsck. http://lwn.net/Articles/248180/, September 2007.
[89] Val Henson, Zach Brown, Theodore Ts’o, and Arjan van de Ven. Reducing fsck time for ext2 file systems. In Ottawa Linux
Symposium (OLS ’06), Ottawa, Canada, July 2006.
[90] Val Henson, Arjan van de Ven, Amit Gud, and Zach Brown. Chunkfs: Using divide-and-conquer to improve file system
reliability and repair. In IEEE 2nd Workshop on Hot Topics in System Dependability (HotDep ’06), Seattle, Washington,
November 2006.
[91] Mark Holland, Garth A. Gibson, and Daniel P. Siewiorek. Fast, on-line failure recovery in redundant disk arrays. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing (FTCS-23), pages 421–433, Toulouse, France,
June 1993.
[92] Ling Huang, Minos N. Garofalakis, Joseph M. Hellerstein, Anthony D. Joseph, and Nina Taft. Toward sophisticated detection
with distributed triggers. In Proceedings of the 2nd Annual ACM Workshop on Mining Network Data (MineNet ’06), Pisa,
Italy, September 2006.
4
[93] Ling Huang, XuanLong Nguyen, Minos N. Garofalakis, Joseph M. Hellerstein, Michael I. Jordan, Anthony D. Joseph,
and Nina Taft. Communication-Efficient Online Detection of Network-Wide Anomalies. In The 26th IEEE International
Conference on Computer Communications (INFOCOM ’07), Anchorage, Alaska, May 2007.
[94] Ryan Huebsch, Brent N. Chun, Joseph M. Hellerstein, Boon Thau Loo, Petros Maniatis, Timothy Roscoe, Scott Shenker,
Ion Stoica, and Aydan R. Yumerefendi. The Architecture of PIER: an Internet-Scale Query Processor. 2005.
[95] Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. Detailed diagnosis
in enterprise networks Export. In Proceedings of SIGCOMM ’09, Barcelona, Spain, August 2009.
[96] Hannu H. Kari. Latent Sector Faults and Reliability of Disk Arrays. PhD thesis, Helsinki University of Technology, September 1997.
[97] Hannu H. Kari, H. Saikkonen, and F. Lombardi. Detection of Defective Media in Disks. In The IEEE International Workshop
on Defect and Fault Tolerance in VLSI Systems, pages 49–55, Venice, Italy, October 1993.
[98] Kimberly Keeton, Dirk Beyer, Ernesto Brau, Arif Merchant, Cipriano Santos, and Alex Zhang. On the Road to Recovery:
Restoring Data after Disasters. In Proceedings of the EuroSys Conference (EuroSys ’06), Leuven, Belgium, April 2006.
[99] Kimberly Keeton and Arif Merchant. A Framework for Evaluating Storage System Dependability. In Proceedings of the
International Conference on Dependable Systems and Networks (DSN ’04), Florence, Italy, June 2004.
[100] Kimberly Keeton, Cipriano Santos, Dirk Beyer, Jeffrey Chase, and John Wilkes. Designing for disasters. In Proceedings of
the 3rd USENIX Symposium on File and Storage Technologies (FAST ’04), San Francisco, California, April 2004.
[101] Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta. On Availability of Intermediate Data in Cloud Computations.
In the 12th Workshop on Hot Topics in Operating Systems (HotOS XII), Monte Verita, Switzerland, May 2009.
[102] Phil Koopman. What’s Wrong with Fault Injection as a Dependability Benchmark? In Workshop on Dependability Benchmarking (in conjunction with DSN-2002), Washington DC, July 2002.
[103] Marcel Kornacker. High-Performance Extensible Indexing. In Proceedings of the 25th International Conference on Very
Large Databases (VLDB ’99), San Francisco, CA, September 1999.
[104] Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. ArpaciDusseau, and Remzi H. Arpaci-Dusseau. Parity Lost and Parity Regained. In Proceedings of the 6th USENIX Symposium
on File and Storage Technologies (FAST ’08), pages 127–141, San Jose, California, February 2008.
[105] T. Y. Cliff Leung, Hamid Pirahesh, Joseph M. Hellerstein, and Praveen Seshadri. Query rewrite optimization rules in IBM
DB2 universal database. Readings in Database Systems (3rd ed.), pages 153–168, 1998.
[106] Xin Li, Michael C. Huang, , and Kai Shen. An Empirical Study of Memory Hardware Errors in A Server Farm. In The 3rd
Workshop on Hot Topics in System Dependability (HotDep ’07), Edinburgh, UK, June 2007.
[107] Boon Thau Loo, Tyson Condie, Minos Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking. Commun. ACM, 52(11):87–95, 2009.
[108] Boon Thau Loo, Tyson Condie, Minos N. Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking: language, execution and optimization. In Proceedings of
the ACM SIGMOD International Conference on Management of Data (SIGMOD ’06), Chicago, Illinois, June 2006.
[109] Boon Thau Loo, Tyson Condie, Joseph M. Hellerstein, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Implementing
Declarative Overlays. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05), Brighton,
United Kingdom, October 2005.
[110] Boon Thau Loo, Joseph M. Hellerstein, Ryan Huebsch, Scott Shenker, and Ion Stoica. Enhancing p2p file-sharing with an
internet-scale query processor. In Proceedings of the 30th International Conference on Very Large Databases (VLDB ’04),
Toronto, Canada, September 2004.
[111] Boon Thau Loo, Joseph M. Hellerstein, Ion Stoica, and Raghu Ramakrishnan. Declarative routing: extensible routing with
declarative queries. In Proceedings of the ACM SIGCOMM 2005 Conference on Applications, Technologies, Architectures,
and Protocols for Computer Communications (SIGCOMM ’05), Philadelphia, Pennsylvania, August 2005.
[112] Peter Lyman and Hal R. Varian. How Much Information?
projects/how-much-info-2003.
2003.
www2.sims.berkeley.edu/research/
[113] Om Malik. When the Cloud Fails: T-Mobile, Microsoft Lose Sidekick Customer Data. http://gigaom.com.
[114] Mike Mammarella, Shant Hovsepian, and Eddie Kohler. Modular data storage with Anvil. In Proceedings of the 22nd ACM
Symposium on Operating Systems Principles (SOSP ’09), Big Sky, Montana, October 2009.
[115] Richard P. Martin, Amin M. Vahdat, David E. Culler, and Thomas E. Anderson. Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. In Proceedings of the 24th Annual International Symposium on Computer
Architecture (ISCA ’97), pages 85–97, Denver, Colorado, May 1997.
5
[116] Matthew L. Massie, Brent N. Chun, and David E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing, 30(7), July 2004.
[117] Marshall Kirk McKusick, Willian N. Joy, Samuel J. Leffler, and Robert S. Fabry. Fsck - The UNIX File System Check
Program. Unix System Manager’s Manual - 4.3 BSD Virtual VAX-11 Version, April 1986.
[118] Lucas Mearian. Facebook temporarily loses more than 10% of photos in hard drive failure. www.computerworld.com.
[119] C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, and Peter M. Schwarz. ARIES: A Transaction Recovery
Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database
Syst., 17(1):94–162, 1992.
[120] Curt Monash. eBay’s Two Enormous Data Warehouses. DBMS2 (weblog), April 2009. http://www.dbms2.com/
2009/04/30/ebays-two-enormous-data-warehouses.
[121] John Oates. Bank fined 3 millions pound sterling for data loss, still not taking it seriously. www.theregister.co.uk/
2009/07/22/fsa hsbc data loss.
[122] Adam J. Oliner, Alex Aiken, and Jon Stearley. Alert Detection in System Logs. In Proceedings of the International
Conference on Data Mining (ICDM ’08), Pisa, Italy, December 2008.
[123] Adam J. Oliner and Jon Stearley. What supercomputers say: A study of five system logs. In Proceedings of the International
Conference on Dependable Systems and Networks (DSN ’07), Edinburgh, UK, June 2007.
[124] Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. Ganesha: Black-box Fault Diagnosis for
MapReduce Environments. In the 2nd Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics ’09), Seattle, Washington, June 2009.
[125] Biswanath Panda, Joshua Herbach, Sugato Basu, and Roberto Bayardo. PLANET: Massively Parallel Learning of Tree
Ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB ’09),
Lyon, France, August 2009.
[126] Shankar Pasupathy. Personal Communication from Shankar Pasupathy of NetApp, 2009.
[127] David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike Chen, James Cutler, Patricia Enriquez, Armando Fox,
Emre Kiciman, Matthew Merzbacher, David Oppenheimer, Naveen Sastry, William Tetzlaff, Jonathan Traupman, and Noah
Treuhaft. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical Report
CSD-02-1175, U.C. Berkeley, March 2002.
[128] J. Kent Peacock, Ashvin Kamaraju, and Sanjay Agrawal. Fast Consistency Checking for the Solaris File System. In
Proceedings of the USENIX Annual Technical Conference (USENIX ’98), pages 77–89, New Orleans, Louisiana, June 1998.
[129] Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. Failure Trends in a Large Disk Drive Population. In
Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST ’07), pages 17–28, San Jose, California,
February 2007.
[130] Lucian Popa, Mihai Budiu, Yuan Yu, and Michael Isard. DryadInc: Reusing Work in Large-scale Computations. In Workshop
on Hot Topics in Cloud Computing (HotCloud ’09), San Diego, California, June 2009.
[131] Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Model-Based Failure Analysis of Journaling File Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN ’05), pages
802–811, Yokohama, Japan, June 2005.
[132] Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Model-Based Failure Analysis of Journaling File Systems. To appear in IEEE Transactions on Dependable and Secure Computing (TDSC), 2006.
[133] Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and
Remzi H. Arpaci-Dusseau. IRON File Systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05), pages 206–220, Brighton, United Kingdom, October 2005.
[134] Ari Rabkin, Andy Konwinski, Mac Yang, Jerome Boulon, Runping Qi, and Eric Yang. Chukwa: a large-scale monitoring
system. In Cloud Computing and Its Applications (CCA ’08), Chicago, IL, October 2008.
[135] American Data Recovery. Data loss statistics. http://www.californiadatarecovery.com/content/adr
loss stat.html.
[136] Frederick Reiss and Joseph M. Hellerstein. Declarative network monitoring with an underprovisioned query processor. In
Proceedings of the 22nd International Conference on Data Engineering (ICDE ’06), Atlanta, GA, April 2006.
[137] Cindy Rubio-Gonzalez, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error
Propagation Analysis for File Systems. In Proceedings of the ACM SIGPLAN 2009 Conference on Programming Language
Design and Implementation (PLDI ’09), Dublin, Ireland, June 2009.
6
[138] Bianca Schroeder and Garth Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to
you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST ’07), pages 1–16, San Jose,
California, February 2007.
[139] Thomas Schwarz, Mary Baker, Steven Bassi, Bruce Baumgart, Wayne Flagg, Catherine van Ingen, Kobus Joste, Mark
Manasse, and Mehul Shah. Disk failure investigations at the internet archive. In NASA/IEEE Conference on Mass Storage
Systems and Technologies (MSST) Work in Progress Session, 2006.
[140] Russell Sears and Eric A. Brewer. Segment-based recovery: Write ahead logging revisited. PVLDB, 2(1):490–501, 2009.
[141] Simon CW See. Data Intensive Computing. In Sun Preservation and Archiving Special Interest Group (PASIG ’09), San
Fancisco, California, October 2009.
[142] Mehul A. Shah, Joseph M. Hellerstein, and Eric A. Brewer. Highly-Available, Fault-Tolerant, Parallel Dataflows. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’04), Paris, France, June 2004.
[143] Dennis Shasha. Biocomputational Puzzles: Data, Algorithms, and Visualization. In Invited Talk at the 11th International
Conference on Extending Database Technology (EDBT ’08), Nantes, France, March 2008.
[144] Muthian Sivathanu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Somesh Jha. A Logic of File Systems.
In Proceedings of the 4th USENIX Symposium on File and Storage Technologies (FAST ’05), pages 1–15, San Francisco,
California, December 2005.
[145] Muthian Sivathanu, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Improving Storage
System Availability with D-GRAID. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST
’04), pages 15–30, San Francisco, California, April 2004.
[146] Michael Stonebraker and Joseph M. Hellerstein. Content integration for e-business. In Proceedings of the 2001 ACM
SIGMOD International Conference on Management of Data (SIGMOD ’01), Santa Barbara, California, May 2001.
[147] Sriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. ArpaciDusseau, and Jeffrey F. Naughton. Impact of Disk Corruption on Open-Source DBMS. In Proceedings of the 26th International Conference on Data Engineering (ICDE ’10), Long Beach, California, March 2010.
[148] Rajesh Sundaram. The Private Lives of Disk Drives. www.netapp.com/go/techontap/matl/sample/0206tot
resiliency.html, February 2006.
[149] John D. Sutter. A trip into the secret, online ’cloud’. www.cnn.com/2009/TECH/11/04/cloud.computing.
hunt/index.html.
[150] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Reliability of Commodity Operating Systems. In
Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), Bolton Landing, New York, October
2003.
[151] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Recovering device drivers. In Proceedings of the 6th Symposium
on Operating Systems Design and Implementation (OSDI ’04), pages 1–16, San Francisco, California, December 2004.
[152] Alexander Szalay and Jim Gray. 2020 Computing: Science in an exponential world. Nature, (440):413–414, March 2006.
[153] Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. SALSA: Analyzing Logs as StAte Machines.
In the 1st Workshop on the Analysis of System Logs (WASL ’08), San Diego, CA, September 2008.
[154] Hadoop Team. Fault Injection framework: How to use it, test using artificial faults, and develop new faults. http://
issues.apache.org.
[155] PVFS2 Development Team. PVFS Developer’s Guide. www.pvfs.org.
[156] Doug Thain, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Pipeline and Batch
Sharing in Grid Workloads. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed
Computing (HPDC 12), pages 152–161, Seattle, Washington, June 2003.
[157] Florian M. Waas and Joseph M. Hellerstein. Parallelizing Extensible Query Optimizers. In Proceedings of the 2009 ACM
SIGMOD International Conference on Management of Data (SIGMOD ’09), Providence, Rhode Island, June 2009.
[158] Glenn Weinberg. The Solaris Dynamic File System. http://members.visi.net/∼thedave/sun/DynFS.pdf,
2004.
[159] Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. Scalable
Performance of the Panasas Parallel File System. In Proceedings of the 6th USENIX Symposium on File and Storage
Technologies (FAST ’08), San Jose, California, February 2008.
[160] Qin Xin, Ethan L. Miller, and Thomas Schwarz. Evaluation of distributed recovery in large-scale storage systems. In
Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC 13), Honolulu,
Hawaii, June 2004.
7
[161] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. Detecting Large-Scale System Problems by
Mining Console Logs. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP ’09), Big Sky,
Montana, October 2009.
[162] Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang,
and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th
Symposium on Networked Systems Design and Implementation (NSDI ’09), Boston, Massachusetts, April 2009.
[163] Junfeng Yang, Can Sar, and Dawson Engler. EXPLODE: A Lightweight, General System for Finding Serious Storage
System Errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), Seattle,
Washington, November 2006.
[164] Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using Model Checking to Find Serious File System
Errors. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), San Francisco,
California, December 2004.
[165] Shaula Alexander Yemini, Shmuel Kliger, Eyal Mozes, Yechiam Yemini, and David Ohsie. High Speed and Robust Event
Correlation. IEEE Communications, 34(5):82–90, May 1996.
8
E Supplement: Postdoctoral Mentoring Plan
This proposal includes budget in the first year to support Dr. Haryadi Gunawi, a post-doctoral researcher in Hellerstein’s group at Berkeley. Gunawi did his Ph.D. at UW-Madison under the co-direction of co-PI Andrea ArpaciDusseau and her frequent collaborator Remzi Arpaci-Dusseau. Gunawi serves as a key technical bridge between
the research agendas of the two PIs, and will naturally interact with and be mentored by both PIs. Arpaci-Dusseau
mentored Gunawi through his Ph.D., and the two have an established relationship which evolved during that period.
Gunawi will sit at Berkeley with Hellerstein’s group during his postdoc internship.
The intent of the postdoctoral mentoring in this project is to give Gunawi the experience of defining and leading
a research project, while also learning to collaborate with peers, senior researchers, industrial partners, and students
at the graduate and undergraduate level. As a first step in this process, Gunawi has already been given a lead role
in defining and writing this proposal with his familiar mentor Arpaci-Dusseau, and his less-familiar new collaborator
Hellerstein. The proposal-writing process established a tone that we intend to follow throughout the post-doc: Gunawi
led the identification of research themes and drafted his ideas in text, Arpaci-Dusseau and Hellerstein commented and
injected their ideas into Gunawi’s context, and Gunawi himself was responsible for synthesizing the mix of ideas into
a cogent whole, both conceptually and textually.
At Berkeley, Gunawi will share office space with Hellerstein’s graduate students, a few steps from Hellerstein’s
office. He will attend weekly meetings of Hellerstein’s research group, which include graduate students, fellow professors from other areas in computer science (e.g. Programming Languages and Machine Learning), and collaborators
from industry (e.g. a weekly visiting collaborator from Yahoo Research). Gunawi will be asked to present occasional
lectures in courses on both Operating Systems and Database Systems, at the graduate and undergraduate levels. He
will of course be expected to aggressively lead research and publication efforts, as he has already done during his
Ph.D. years (when he won the departmental research excellence award at UW-Madison).
The PIs believe that one of the key challenges for a very successful recent Ph.D. like Gunawi is to learn to thoughtfully extend their recently-acquired personal research prowess to enhance and leverage contributions from more junior
(graduate and undergraduate students) and more senior (faculty and industrial) collaborators. To that end, the PIs
intend to facilitate opportunities for Gunawi to work relatively independently with students and mature collaborators,
but also to engage with him in reflection on the process of those collaborations, along with tips on mechanisms for
sharing and subdividing research tasks, stepping back to let more junior researchers find their way through problems,
and translating concepts across research areas.
Because Gunawi has had the benefit of mentorship from Arpaci-Dusseau for many years in graduate school, the
plan is to shift his main point of contact to Hellerstein and provide him with a new experience at Berkeley. Structurally,
the plan is for Hellerstein, Arpaci-Dusseau and Gunawi to meet as a group bi-weekly via teleconference, and for
Hellerstein and Gunawi to meet on a weekly basis. Hellerstein has a strong track record of mentoring PhD students
(11 completed to date, 7 underway) into successful careers in academic research and teaching positions, and industrial
research and development.