The design and implementation of a log

The Design and Implementation
Log-Structured
File System
MENDEL
ROSENBLUM
University
of California
and JOHN
of a
K. OUSTERHOUT
at Berkeley
This paper presents a new technique for disk storage management
called a log-structured
file
system, A log-structured
file system writes all modifications
to disk sequentially
in a log-like
structure, thereby speeding up both file writing and crash recovery. The log is the only structure
on disk; it contains indexing information
so that files can be read back from the log efficiently.
In order to maintain large free areas on disk for fast writing, we divide the log into segments and
from heavily fragmented
segments. We
use a segment cleaner to compress the live information
present a series of simulations
that demonstrate the efficiency of a simple cleaning policy based
on cost and benefit. We have implemented
a prototype logstructured
file system called Sprite
LFS; it outperforms
current Unix file systems by an order of magnitude
for small-file
writes
while matching
or exceeding Unix performance
for reads and large writes. Even when the
overhead for cleaning is included, Sprite LFS can use 70% of the disk bandwidth
for writing,
whereas Unix file systems typically
can use only 5 – 10%.
Systems]: Storage Management–aUocCategories and Subject Descriptors: D 4.2 [Operating
strategies;
secondary
storage; D 4.3 [Operating
Systems]: File Systems Manation / deallocation
organization,
du-ectory
structures,
access methods;
D.4. 5 [Operating
Systems]:
agement—file
measurements,
simReliability—checkpoint
/restart; D.4.8 [Operating
Systems]: Performance–
ulation,
operatzon
anatysis;
H. 2.2 [Database
Management]:
Physical Design— recouery and
restart;
H. 3 2 [Information
Systems]: Information
Storage —file organization
General
Terms: Algorithms,
Design,
Measurement,
Performance
Additional
Key Words and Phrases: Disk storage management,
organization,
file system performance, high write performance,
fast crash recovery, file system
logging, log-structured,
Unix
1. INTRODUCTION
Over
the
last
access times
decade
have
CPU
only
the future
and it will
To lessen
the
speeds
improved
cause more
impact
of this
have
slowly.
This
and more
problem,
increased
dramatically
trend
is likely
applications
we have
devised
while
disk
to continue
to become
a new
in
disk-bound.
disk
storage
The work described here was supported m part by the National Science Foundation under grant
CCR-8900029, and in part by the National
Aeronautics
and Space Administration
and the
Defense Advanced Research Prolects Agency under contract NAG2-591.
Authors’
addresses: J. Ousterhoust,
Computer
Science Division,
Electrical
Engineering
and
Computer Sciences, University
of California,
Berkeley, CA 94720; M. Rosenblum,
Computer
Science Department,
Stanford University,
Stanford, CA 94305.
Permission to copy without fee all or part of this material is granted provided that the copies are
not made or distributed
for direct commercial advantage, the ACM copyright notice and the title
of the publication
and its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery.
To copy otherwise, or to republish, requires a fee and/or
specific permission.
@ 1992 ACM 0734-2071/92/0200-0026
$01.50
ACM Transactions
on Computer
Systems,
Vol
10,
No.
1, February
1992,
Pages
26-52
The Design and Implementation
management
technique
an order of magnitude
Log-structured
cached
in
caches
more
disk
traffic
writes
This
file
main
all
called a log-structured
more efficiently
than
systems
memory
and more
will
approach
and
are
based
that
effective
become
new
of a Log-Structured
write
read
by writes.
examine
consistency
the most
The notion
incorporated
recovery
[8, 9]. However,
the permanent
storage
structure
For
but
permanently
file
large
segments
other
systems
contrast,
system
the
log.
by eliminating
file
system
system
extents
use the log only
al-
need only
a log-structured
files
can
to operate
of free
for temporary
is in a traditional
is no other
so that
systems.
structure
file
system
on disk.
be read
back
efficiently,
space available
design
based
random-access
stores
The log con-
with
efficiency
it must
ensure
that
for writing
new
data.
of a log-structured
on large extents
file
called
to explore
different
cleaning
policies
and discovered
a
algorithm
based on cost and benefit:
it segregates
older,
slowly changing
data from
differently
during
cleaning.
We have
file
called
where
a segment
cleaner
process continually
regenerates
empty
by compressing
the live data from heavily
fragmented
segments.
We used a simulator
simple but effective
more
them
As a result,
also permits
much faster
must scan the entire disk
a log-structured
This is the most difficult
challenge
in the
system.
In this paper we present
a solution
segments,
are
the
and a number
of recent file systems have
structure
to speed up writes
and crash
in the log: there
a log-structured
are always
In
files
make
of the log.
these
on disk.
[18].
structure
dramatically
home for information
tains
indexing
information
comparable
to current
file
there
a crash,
portion
of logging
is not new,
a log as an auxiliary
storage;
data
after
recent
that
sizes will
requests
27
uses disks
A log-structured
most all seeks. The sequential
nature
of the log
crash recovery:
current
Unix file systems typically
to restore
assumption
in a sequential
performance
which
systems.
file
memory
at satisfying
to disk
increases
on the
.
system,
current
increasing
dominated
information
file
File System
constructed
a prototype
younger
rapidly
changing
log-structured
file
data
system
and
called
treats
Sprite
LFS, which is now in production
use as part of the Sprite network
operating
system [17]. Benchmark
programs
demonstrate
that the raw writing
speed of
Sprite LFS is more than an order of magnitude
greater than that of Unix for
small
files.
Even for other workloads,
such as those including
reads and
large-file
accesses, Sprite LFS is at least as fast as Unix in all cases but one
(files read sequentially
after being written
randomly).
long-term
overhead
for cleaning
in the production
LFS permits
about 65-75%
of a disk’s raw bandwidth
We also measured
the
system.
Overall,
Sprite
to be used for writing
new data (the rest is used for cleaning).
For comparison,
Unix systems can
only utilize
5-1090 of a disk’s raw bandwidth
for writing
new data; the rest of
the time is spent seeking.
The remainder
of this paper
reviews
the issues in designing
is organized
file systems
into
six sections.
Section
2
for computers
of the 1990s.
Section
3 discusses the design alternatives
for a log-structured
file system
and derives the structure
of Sprite LFS, with particular
focus on the cleaning
mechanism.
Section 4 describes
the crash recovery
system for Sprite
LFS.
ACM
Transactions
on Computer
Systems,
Vol.
10, No.
1, February
1992.
28
.
Section
M. Rosenblum and J. K. Ousterhout
5 evaluates
measurements
file
systems,
2. DESIGN
Sprite
LFS
of cleaning
and Section
using
overhead.
benchmark
Section
programs
6 compares
and
Sprite
long-term
LFS
to other
7 concludes.
FOR FILE SYSTEMS
OF THE 1990’s
File system
design
is governed
by two general
forces: technology,
provides
a set of basic building
blocks, and workload,
which determines
of operations
technology
system
Sprite
that
design.
LFS
be carried
that
are
and shows
how
out
efficiently.
underway
It also describes
the workloads
2.1
must
changes
file
This
describes
the workloads
current
and technology
and
that
systems
section
their
summarizes
impact
influenced
which
a set
the
are ill-equipped
on file
design
of
to deal with
changes.
Technology
Three
components
of technology
are particularly
significant
for
file
system
design: processors,
disks, and main memory.
Processors
are significant
because their speed is increasing
at a nearly exponential
rate, and the improvements seem likely
to continue
through
much of the 1990s. This puts pressure
on all the other elements
of the computer
system to speed up as well,
the system doesn’t become unbalanced.
Disk technology
is also improving
rapidly,
but the improvements
been
primarily
There
are
access time.
improvement
in the
two
Although
is much
can be improved
disks [19], but
determined
tion
causes
application
areas
components
of cost and
of disk
both
slower
capacity
rather
performance:
motions
that
a sequence
of small
is not likely
to experience
disk
are hard
transfers
much
performance.
separated
and
the rate of
bandwidth
and parallel-head
access time
(it
to improve).
speedup
have
bandwidth
of these factors
are improving,
than for CPU speed. Disk transfer
substantially
with the use of disk arrays
no major
improvements
seem likely
for
by mechanical
than
transfer
so that
is
If an applica-
by seeks,
over the next
then
the
ten years,
even with faster processors.
The third component
of technology
is main memory,
which is increasing
in
size at an exponential
rate. Modern
file systems cache recently
used file data
in main memory,
and larger main memories
make larger file caches possible.
This
has two effects
on file
system
behavior.
First,
larger
file
caches alter
the
workload
presented
to the disk by absorbing
a greater
fraction
of the read
requests
[2, 18]. Most write requests
must eventually
be reflected
on disk for
safety,
so disk traffic
(and disk performance)
will
become more and more
dominated
by writes.
The second impact
of large file caches is that
they can serve as write
buffers
where
large
numbers
of modified
blocks
can be collected
before
writing
any of them to disk. Buffering
may make it possible
to write
the
blocks more efficiently,
for example by writing
them all in a single sequential
transfer
with only one seek. Of course, write-buffering
has the disadvantage
of increasing
the amount
of data lost during
a crash. For this paper we will
assume that crashes are infrequent
and that it is acceptable
to lose a few
ACM
TransactIons
on
Computer
Systems,
Vol.
10,
No
1, February
1992
The Design and Implementation
of work
seconds or minutes
crash recovery,
nonvolatile
2.2
of a Log-Structured
.
29
in each crash; for applications
that require
RAM may be used for the write buffer.
better
Workloads
Several
tions.
different
file
One of the
system
most
is found
efficiently
neering
applications
in
workloads
difficult
found
dominated
for file
not
such
files
system
sequentially
out
focus on the efficiency
2.3
LFS
Problems
Current
them
file
spread
work
well
systems
information
separate
disk,
It takes
name.
files.
the file’s
as those
but
for ensuring
performance
and leave
accesses.
for large
from
two
the
files
that
tends
it to hardware
Fortunately,
as well
general
and
disk
to be
design-
the techniques
as small
problems
workloads
a way
in
Furthermore,
contents,
at least
exist
so 1/0
accesses,
technologies
around
different
from
such
problems,
and memory
subsystems
rather than the
a log-structured
file system we decided to
ones.
five
the
as is the
separate
that
causes
attributes
1/0s,
make
it hard
1990s.
First,
for
they
too many
small
system (Unix FFS) [121 is
on disk, but it physically
directory
disk
that
of the
accesses. For example,
the Berkeley
Unix fast file
out each file sequentially
quite effective
at laying
separates
files,
File Systems
suffer
the
to large
of techniques
for large-file
with Existing
to cope with
to handle
to file system
and blocks of
also pose interesting
on
of small-file
bandwidth
used in Sprite
accesses
A number
limited
by the bandwidth
of the 1/0
file allocation
policies.
In designing
to improve
applica-
designs
dominated
by updates
to locate the attributes
environments,
software.
laid
are
computer
in
system
file sizes of only a few kilobytes
[2, 10, 18, 231.
small random
disk 1/0s, and the creation
and
by sequential
supercomputing
in
for file
office and engineering
environments.
Office and engitend to be dominated
by accesses to small files; several
deletion
times for such files are often
“ metadata”
(the data structures
used
the file).
Workloads
common
are
workloads
studies have measured
mean
in
Small files usually
result
ers
File System
(” inode”)
entry
for
a
containing
each preceded
file
the
are
file’s
by a seek, to
FFS: two different
accesses to the file’s attributes
create a new file in Unix
data, and the directory’s
plus one access each for the file’s data, the directory’s
small files in such a system, less than 5% of the
is used for new data; the rest of the time
is spent
attributes.
When writing
disk’s potential
bandwidth
seeking.
The second
problem
synchronously:
the
than
continuing
even
though
metadata
while
Unix
the
FFS
structures
chronously.
nated
with
by
application’s
application
For
the
current
application
write
such
file
as
with
data
many
synchronous
metadata
to
from
for
CPUS.
ACM Transactions
the
they
to
the
background.
inodes
files,
the
disk
also
and
example
file
system
written
writes
the
write
For
traffic
make
defeat
to
rather
are
Synchronous
disk
They
tend
complete,
asynchronously,
small
of
that
write
and
writes.
that
faster
in
is
the
blocks
directories
performance
to benefit
systems
wait
is handled
writes
workloads
file
must
it
synis
domi-
couple
hard
potential
the
for
use
the
of
on Computer Systems, VO1 10, No 1, February 1992
30
M. Rosenblum and J. K. Ousterhout
.
the file cache as a write
[21]
to
have
buffer.
Unfortunately,
network
introduced
additional
synchronous
This
simplified
crash
exist.
has
behavior
recovery,
but
file
systems
like
NFS
where
it
not
used
it
has
did
reduced
write
performance.
Throughout
FFS)
this
paper
as an example
tured
file
mented
The
systems.
in the
The
most
3. LOG-STRUCTURED
The
fundamental
performance
and
then
data
all
the
used
not
system
it
Unix
(Unix
it to log-struc-
because
unique
to
is
well
operating
to Unix
disk
the
docu-
systems.
FFS
and
can
writes
sequential
file
For
that
file
can
utilize
the
a single
operation
and
system
traditional
in
in
write
system.
file
of
transfers
sequentially
in the
is to improve
changes
directories
log-structured
random
system
system
to disk
blocks,
manage
a
file
of file
written
index
to
files,
synchronous
chronous
changes
attributes,
small
are
a sequence
information
information
many
section
used
file
compare
popular
of a log-structured
The
blocks,
is
several
fast
and
systems.
buffering
writing
operation.
design
in
Unix
design
FILE SYSTEMS
idea
by
FFS
this
file
Berkeley
system
used
in
other
the
file
and
presented
in
use
Unix
literature
problems
be found
we
of current
write
file
cache
disk
write
includes
almost
all
workloads
file
the
that
other
contain
converts
the
many
small
systems
into
large
asyn-
nearly
100~o
of the
raw
disk
bandwidth.
Although
two
the
key
logging
this
the
free
for
writing
space
The
used
Sprite
by
in
LFS
detail
LFS.
to
Our
goal
cal
to
called
an
sions,
more
indirect
In
ten
Once
the
each
a file,
ACM Transactions
available
data
data
log;
manage
is the
on-disk
the
on
or exceed
the
read
basic
each
the
inode
also
of which
a file’s
inode
topic
of
structures
structures
Systems,
Vol.
there
ten
the
the
been
found,
in
Sprite
LFS
location
10,
yields
No
the
performance
contains
calculation
sequential
is not
are
by
1, February
Sprite
LFS
disk
1992
are
given
identi-
structure
owner,
file;
permisfor
files
of one
of more
number
Unix
To
to permit
addresses
addresses
are
Sprite
FFS.
log
of the
disk
the
in
a data
(type,
blocks
on disk;
the
in the
exists
and
scans
case
of Unix
structures
has
is at a fixed
that
attributes
first
contains
is identical
inode
this
used
file
file’s
of the
each
a simple
index
structures
for
addresses
file
Computer
outputs
FFS;
the
suggest
log,
contains
disk
blocks,
FFS
for
might
the
The
blocks,
to read
Unix
always
it
the
to
paper.
from
LFS
Unix
which
the
blocks.
number
Sprite
in
inode,
than
indirect
to match
used
larger
are
are
of the
from
is how
issue;
of the
problems;
of the
information
goal,
etc. ) plus
required
above
the
sections
space
difficult
solve
retrievals.
those
of free
issue
there
benefits
information
second
more
“log-structured”
was
this
random-access
extents
a much
is simple,
potential
and Reading
retrieve
accomplish
The
a summary
later
the
to retrieve
I contains
to
in
system
achieve
below.
large
is
file
to
is how
3.1
This
Table
the term
required
issue
so that
data.
File Location
Although
first
on disk
3.2-3.6.
be resolved
of Section
new
Sections
of a log-structured
must
subject
discussed
idea
that
approach.
is the
3.1
basic
issues
data
of disk
or
or
1/0s
FFS.
the
address
identifying
of the
file’s
The Design and Implementation
Table I.
Summary
of the Major
Data structure
Incde
Inmfe map
of a Log-Structured
Data Structures
File System
Purpose
Segmentusage!able
Superblock
Checkpoint region
Dmxtory changelog
31
Stored on Disk by Sprite LFS.
Locates blocks of file, holds protection bits, modify time, etc.
Locates posiuon of mode in log, holds time of fast accessPIUS
version number.
Locates blocks of large files.
Identifies contents of segment (file number and offset for each
block).
stomalastwrite timefor
Counts live bytes still left in segmen!s,
data in segments.
Hotds static configuration information such as numlxr of segments and segmentsize.
Locates blocks of inode map and segmentusagetable, identifies
last checkpoint in log.
Records directory opxations to maintain consistency of referencecounts in inodes.
Indirect block
Segmentsummary
.
Location
Log
Log
Section
Log
Log
3.1
3.2
Log
3.6
Fixcd
None
Fixed
4.1
Log
4.2
3.1
3.1
For each data structure the table indicates the purpose served by the data structure in Sprite
LFS. The table also indicates whether the data structure is stored in the log or at a fixed position
on disk and where in the paper the data structure is discussed in detail. Inodes, indirect blocks,
and superblocks are similar. to the Unix FFS data structures with the same names. Note that
Sprite LFS contains neither a bitmap or a free list.
inode.
In contrast,
written
to the
maintain
the
for a file,
inode.
Sprite
log.
current
the inode
The inode
checkpoint
blocks.
doesn’t
LFS
map must
on each
cached
in
into
main
are
memory:
Given
that
the
the
an
the disk
locations
to
lookups
are
to
map
number
address
of the
to the log; a fixed
of all
enough
map
they
inode
identifying
are written
compact
inode
positions;
called
to determine
blocks
identifies
maps
at fixed
structure
inode.
be indexed
disk
inode
inodes
a data
of each
map is divided
region
place
uses
location
Fortunately,
portions
LFS
Sprite
the
inode
keep
rarely
the
map
active
require
disk
accesses.
Figure
FFS
1 shows
after
layouts
creating
have
duces a much
the
3.2
The most
ment
difficult
layouts
new
files
logical
compact
is much
Free Space
disk
two
same
more
of Sprite LFS
just as good.
the
structure,
arrangement.
better
Management:
design
that
would
in different
than
Unix
the
occur
in Sprite
LFS
directories.
Although
log-structured
file
As a result,
FFS,
while
the write
its read
and Unix
the
system
two
pro-
performance
performance
is
Segments
issue
for log-structured
of free space. The goal is to maintain
file
large
systems
is the manage-
free extents
for writing
new
data. Initially
all the free space is in a single extent on disk, but by the time
the log reaches the end of the disk the free space will have been fragmented
into many
small
overwritten.
From this point
extents
corresponding
on, the file
system
to
the
files
has two choices:
that
were
threading
deleted
or
and copying.
These are illustrated
in Figure
2. The first alternative
is to leave the live
data in place and thread
the log through
the free extents.
Unfortunately,
threading
will cause the free space to become severely
fragmented,
so that
large contiguous
writes won’t be possible and a log-structured
file system will
ACM Transactions
on Computer Systems, Vol. 10, No. 1, February 1992.
32
.
M, Rosenblum and J. K. Ousterhout
dirl
dir2
filel
Lag +
Disk
Sprite
filel
.*
&w
~:
file2
‘1
~
~:
,,
,:::
~=
.....
:;I
,,,,
+
LFS
‘?
dirl
file2
...
;:
!!J
~
dir2
&
Disk
Unix FFS
Fig. 1. A comparison between Sprite LFS and Unix FFS. This example shows the modified disk
blocks written by Sprite LFS and Unix FFS when creating two single-block files named dirl /filel
and dlr2 /flle2. Each system must write new data blocks and inodes for file 1 and flle2, plus new
data blocks and inodes for the containing
directories.
Unix FFS requires ten nonsequential
writes for the new information
(the inodes for the new files are each written
twice to ease
recovery from crashes), while Sprite LFS performs the operations in a single large write. The
same number of disk accesses will be required to read the files in the two systems. Sprite LFS
also writes out new inode map blocks to record the new inode locations
Block
Copy and Compact
Threaded log
Key:
Old data block
New data block
Prcwously deleted
la
Old log end
New log end
Old log end
New log end
II
II
Fig, 2. Possible free space management
solutions for log-structured
file systems, In a log-structured file system, free space for the log can be generated either by copying the old blocks or by
threading
the log around the old blocks. The left side of the figure shows the threaded log
approach where the log skips over the active blocks and overwrites blocks of files that have been
Pointers between the blocks of the log are mamtained
so that the log can
deleted or overwritten.
be followed during crash recovery The right side of the figure shows the copying scheme where
log space is generated by reading the section of disk after the end of the log and rewriting
the
active blocks of that section along with the new data into the newly generated space.
be no faster than traditional
file systems.
The
live data out of the log in order to leave large
second alternative
is to copy
free extents
for writing.
For
this paper we will assume that the live data is written
back in a compacted
form at the head of the log; it could also be moved to another
log-structured
file system to form a hierarchy
of logs, or it could be moved to some totally
different
file system
or archive.
The disadvantage
of copying
is its cost,
particularly
for long-lived
files; in the simplest
case where the log works
circularly
across the disk and live data is copied back into the log, all of the
long-lived
files will have to be copied in every pass of the log across the disk.
Sprite
LFS uses a combination
of threading
and copying.
The disk is
segments.
Any given segment
is
divided
into large fixed-size
extents
called
always written
sequentially
from its beginning
to its end, and all live data
must
be copied
out of a segment
before
the segment
can be rewritten.
However,
the log is threaded
on a segment-by-segment
basis; if the system
ACM
Transactions
on
Computer
Systems,
Vol.
10,
No.
1, February
1992,
The Design and Implementation
can
collect
long-lived
data
skipped
over
segment
size is chosen
whole
so that
segment
segment.
This
bandwidth
cessed. Sprite
data
large
into
doesn’t
enough
greater
allows
of the
together
the
is much
of a Log-Structured
that
than
segments
to be copied
the transfer
operations
regardless
LFS currently
those
.
33
can be
repeatedly.
time
to read
The
or write
the cost of a seek to the beginning
whole-segment
disk,
segments,
have
File System
of the
to run
order
uses segment
at
in which
sizes of either
a
of the
nearly
the
segments
full
are
512 kilobytes
ac-
or one
megabyte.
3.3
Segment
Cleaning
Mechanism
cleaning.
The process of copying live data out of a segment is called segment
In Sprite LFS it is a simple three-step
process: read a number
of segments
into
memory,
number
that
identify
of clean
were
read
additional
As part
the live
segments.
are marked
cleaning.
of segment
each segment
possible
file
the live
operation
and they
it must
so that
the
and write
this
as clean,
cleaning
are live,
to identify
data,
After
they
back
to a smaller
the
segments
can be used for new data
be possible
to identify
can be written
to which
data
is complete,
which
out again.
each block
belongs
blocks
It must
and the
or for
of
also be
position
of
the block within
the file; this information
is needed in order to update the
file’s inode to point to the new location
of the block. Sprite LFS solves both of
these problems
by writing
The summary
segment;
the
for
file
multiple
fill
the
blocks
a segment
identifies
example,
for
number
and
segment
summary
segment.
crash
each
block
in the
blocks
file
recovery
data
number
blocks
file
impose
summary
as part
block
each piece of information
the
more
writes
cache
little
block
for
when
(Partial-segment
buffered
summary
during
block
the
block.
(see Section
during
4) as well
block
Segments
the
in the
contains
can
one log write
when
is insufficient
overhead
is written
summary
than
occur
of each segment.
that
contain
is needed
number
to
of dirty
to fill
a segment).
Segment
writing,
and they
are useful
as during
cleaning.
Sprite LFS also uses the segment
summary
information
to distinguish
live
blocks from those that
have been overwritten
or deleted.
Once a block’s
identity
is known,
its liveness can be determined
by checking
the file’s inode
or indirect
block
to see if the
block.
It it does, then
Sprite
LFS
the
inode
ever the file
bined
with
optimizes
map
entry
is deleted
the
inode
the
appropriate
block
is live;
if it doesn’t,
block
this
check
for each file;
or truncated
number
form
slightly
by
the version
pointer
then
still
refers
the
block
keeping
a version
number
is incremented
to length
zero, The version
a unique
identifier
(uid)
to this
is dead.
number
number
for the
in
whencom-
contents
of the file.
The segment
summary
block records
this uid for each block
in the segment;
if the uid of a block does not match the uid currently
stored
in the inode map when the segment
is cleaned,
the block can be discarded
immediately
without
examining
the file’s inode.
This approach
to cleaning
means that there is no free-block
list or bitmap
in Sprite.
In addition
to saving memory
and disk space, the elimination
of
these
data
structures
also simplifies
ACM
Transactions
crash
recovery.
on Computer
Systems,
If these
Vol.
data
10, No.
structures
1, February
1992.
34
M. Rosenblum and J, K. Ousterhout
●
existed,
additional
restore
consistency
3.4
Segment
basic
(1) When
only
many
(2) How
(3) Which
should
ment.
but
live
files
call this
approach
Sprite
to the
fourth
LFS
the
file
system.
segments
We
use
is the
The
remainder
and how
including
of
overhead
time
called
the
the
and
data
overhead.
mum
A
time
data
For
a log-structured
latency
are
total
number
of
those
bytes
ACM Transactions
One
by
output
seglast
were
segments;
we
first
two
of the
number
the
of
a few tens of segthe number
of clean
of bytes
that
disk
means
that
for
for
moved
represent
with
which
of
its
and
one-tenth
new
no
there
the
as
a
cleaning
with
would
no
seek
mean
that
is no cleaning
of the
data;
cost
written,
expressed
were
it
write
data
bandwidth
is perfect:
only
is
there
full
The
of new
cost
if
bandwidth
disk’s
rest
maxi-
of the
disk
or cleaning.
large
and
to and
analysis
policies.
byte
write
writing
writing
new
per
of 1.0
latency,
system
both
at
cost
they
of a log-structured
our
cleaning
The
written
third
the
experience
data.
required
full
used
In contrast,
performance
is busy
be
be
rotational
file
negligible
the
the
in our
3 discusses
overheads.
10
values.
live
disk
A write
of
the
the
would
actually
seeks,
the
at the
cost
is
in
out?
they
new
when
important:
to compare
cost
could
be written
write
is spent
into
addressed
threshold
Section
of
that
latency.
bandwidth
time
the
segments
critically
to group
cleaning
the
could
are
for example
a single
value (typically
at a time until
determine
of time
time
or rotational
new
by
at
that
ones
written
reads,
into
offers
cleaned
choice.
are
age together
cleaning
of the
are
write
amount
all
multiple
they
blocks
the
of segments
that
to clean
a term
or
cleaning
is the
best
of future
together
a threshold
choice
factors
average
for
segments
choice
when
not methodically
starts
decisions
primary
are
are
at night,
another
threshold
value
(typically
50– 100 clean segof Sprite
LFS does not seem to be very
performance
exact
policy
be
age sort.
segments
surpasses
ments).
The overall
sensitive
Segment
to be the
locality
of similar
clean segments
drops below
It cleans a few tens
ments).
and
not
grouped
the
is to sort
blocks
so far we have
policies.
and
must
choices
or only
more
the
An obvious
directory
same
possibility
and group
above
a time?
at
out
be
to enhance
modified
In our work
turns
blocks
issues
possible
Some
on disk;
be cleaned?
this
in the
Another
clean
to rearrange.
is to try
grouping
it
opportunities
the
possibility
structures
exhausted.
should
should
policy
at low priority,
data
segments
(4) How
execute?
to reorganize
fragmented,
most
cleaner
is nearly
segments
once, the more
to the
four
above,
in background
space
opportunity
an
described
segment
the
disk
to log changes
Policies
continuously
when
be needed
crashes.
mechanism
should
to run
it
after
Cleaning
the
Given
addressed:
would
code
from
data.
segments,
for
seek
cleaning,
the
This
on Computer Systems, Vol. 10, No 1, February
and
so the
disk
divided
cost
is
1992
rotational
write
by the
determined
cost
is
number
by
the
The Design and Implementation
utilization
the
(the
steady
segment
and
fraction
state,
of new
writes
of data
the
data
out
written.
and
O <
u <
space
new
data.
Thus
write
cost
live)
must
in the
of live
1). This
one
it reads
data
N*(1
–
bytes
read
and
that
clean
.
are
for
in their
u is the
utilization
u) segments
35
cleaned.
segment
N segments
(where
creates
total
File System
segments
generate
To do this,
N* u segments
segments
for
still
cleaner
of a Log-Structured
In
every
entirety
of the
of contiguous
free
written
=
read
new
data
written
segs
+ write
live
+ write
new
—
new
N+
—
—
data
written
N*u+N*(l–u)
N*(l-zL)
In
the
above
must
faster
(we
to
read
specific
write
can
form
write
(see
of less
the
current
is
overall
Unix
in
segments
utilized
to
Even
that
to be less
utilized
for
the
trade-off
at
storage
if
algorithm
cleaned.
Variations
others,
these
FFS
for
Section
disk
5.1
request
cleaned
file
a
for
sort -
[24]
must
system
than
or
a
have
a
to outper-
.5 to outperform
will
discussed
data;
have
it
the
the
usage
cleaner
lower
above
is just
in file
and
are
is
not
fraction
will
can
the
of live
cause
choose
utilization
disk
space.
will
have
fewer
are
usable
reduced
space
but
than
so
is
FFS
some
the
the
least
overall
only
to operate
ACM
not
10’%o is kept
to
9070
free
of
to
disk
by
in use,
resulting
in
a
cost-performance
can
utilization
Such
unique
allows
a
performance
capacity
be improved
of the
blocks,
provide
performance.
is
remaining
disk
can
less
live
systems
if
system
With
higher
byte;
utilization
Unix
The
file
is underutilized,
per
file
of the
cleaned
space
cost
files.
of a log-structured
utilization
example,
by
Unix
bandwidth,
bandwidth
be less
utilization
live
Log-structured
and
For
occupied
disk
costs
performance
tems.
cost.
a high
must
no live
reference,
and
of the
be
low
disk.
that
write
writes,
has
is 1.0.
8 in
a log-structured
the
than
performance
overall
segments
lower
Figure
may
cost
disk
segments
containing
are
to clean;
the
for
write
u. For
2590
the
utilization
that
disk
segments
so, the
reducing
note
of the
segments
average
about
that
order
the
the
of
delayed
to
it
is very
to be cleaned
and
and
a segment
practice
utilization
5 – 109ZO of the
[16]
logging,
.8 in
a segment
most
improved
FFS;
if the
at all
that
in
FFS.
important
fraction
blocks
than
blocks;
as a function
at
3 suggests
Unix
If
be read
cost
With
assumption
live
particularly
Ousterhout
be
Figure
the
LFS).
not
utilizes
probably
improved
tion
Sprite
the
10–20
of 4.
utilization
but
in
need
(1)
‘I-u
conservative
blocks,
it
workloads
of
cost
the
this
the
to recover
live
measurements).
this
It
the
3 graphs
cost
ing
just
file
made
entirety
= O) then
small
write
we
its
tried
(u
Figure
an
in
haven’t
blocks
on
formula
be read
2
—
be achieved
is
increased,
a trade-off
between
log-structured
the
disk
allow
the
file
space
space
systo
be
alloca-
efficiently.
Transactions
on Computer
Systems,
Vol.
10, No,
1, February
1992.
36
M. Rosenblum and J. K. Ousterhout
.
Write
~4co
cost
. . . .
L.......L...-[
......J.
. . . . . . . ..
..........
~_ Log-structured
12.0 --”/
/,
lo.o--” -----------go ....
--.--.——
————
~ FFs today
~..
;...
6.0 --~
4.0 --”;-””-”---------”--------”
-----------------------+-.------------------! F’FS improved
2.0...’
:---
/:
O.-J -- ~. . . . . . . .. T........
0.0
0.2
Fraction
..r.........f.........1.......... ...
0.4
0.6
0:8
alive in segment
1:0
cleaned
(u)
Fig, 3. Write cost as a function of u for small files In a log-structured
cost depends strongly on the utilization
of the segments that are cleaned,
segments cleaned, the more disk bandwidth
that is needed for cleaning
writing
new data. The figure also shows two reference points:
which
is our estimate
of
Unix FFS today, and “FFS Improved,”
an
improved
Unix
FFS.
Write
cost
for
Umx
FFS
is
not
file system, the write
The more live data in
and not available for
‘‘FFS to day,”
the
sensitive
to
represents
which
best
performance
the
amount
possible
of
disk
in
space
in
use,
The
system
the
key
to
achieving
is to force
segments
are
cleaner
can
overall
disk
section
describes
almost
Simulation
We
built
reflect
and
The
the
At
one
actual
but
work
a log-structured
file
where
nearly
most
empty,
segments.
a low
bimodal
helped
This
write
cost.
distribution
in
can
a file
be
(its
simulator
of
and
allows
The
the
a high
following
Sprite
Each
file
has
Files
are
divided
files;
the
time.
the
files
groups
it
LFS.
to
each
they
file
models
Computer
Systems,
two
hot
other
pattern
access
the
cost
files
patterns
files,
capacity
with
not
than
of cleaning.
of 4-kbyte
disk
does
harsher
of random
of the
different
model
much
number
likelihood
into
is called
but
is
reduce
overall
one
analyze
with
utilization.
new
data,
using
patterns:
equal
The
model
as a fixed
overwrites
could
simulator’s
effects
a particular
access
the
the
exploited
system
to produce
we
The
patterns
us to understand
pseudo-random
on
so that
conditions.
usage
of which
models
the
simulator
controlled
Uniform
TransactIons
in
distribution
or
empty
provides
such
system
Hot-and-cold
ACM
cost
empty
the
yet
low
segment
are
with
achieve
system
chosen
step,
a few
utilization
file
both
simulator
of two
always
under
file
it
number
each
full,
we
at
a bimodal
Results
a simple
locality,
performance
into
nearly
how
policies
reality),
disk
capacity
3.5
cleaning
high
the
group
are
is equally
10,
because
No.
its
only
likely
form
selected
One
is called
selected
a simple
Vol
of being
groups.
files
cold;
107o
are
it
in
1992
each
step.
contains
10%
of
selected
90%
of
90%
of
contains
of the
to be selected.
of locality.
1, February
group
time.
This
Within
access
The Design and Implementation
Write
of a Log-Structured
File System
.
37
cost
{.----------”: LFS hot-and-cold
12.0 “-~
&O --\
Go --~
0.0..
; .. ... .. .. ~.---.- ..-:.... .. .... . .. ... .. .. . .. .. ... ....
0.0
0.2
Disk
0.4
0.6
capacity
0:8
1:0
utilization
Fig. 4. Initial
simulation
results. The curves labeled “FFS today” and “FFS improved”
are
reproduced from Figure 3 for comparison, The curve labeled “No variance” shows the write cost
that would occur if all segments always had exactly the same utilization.
The “LFS uniform”
curve represents a log-structured
file system with uniform access pattern and a greedy cleaning
policy: the cleaner chooses the least-utilized
segments. The “LFS hot-and-cold”
curve represents
a log-structured
file system with locality of file access. It uses a greedy cleaning policy and the
cleaner also sorts the live data by age before writing
it out again. The x-axis is overall disk
capacity utilization,
which is not necessarily the same as the utilization
of the segments being
cleaned.
In
this
approach
traffic
is
hausted,
clean
the
then
until
The
simulates
segments
run
overall
modeled.
disk
capacity
simulator
runs
the
actions
is available
the
write
In
each
stabilized
and
is constant
all
of a cleaner
again.
cost
utilization
until
clean
until
run
the
all
and
no read
segments
a threshold
simulator
cold-start
was
are
ex-
number
of
allowed
to
variance
had
been
removed.
Figure
4 superimposes
curves
of
pattern
was
chose
the
cleaner
the
Figure
used.
the
In
the
The
same
attempt
order
that
from
used
pattern
the
is no
When
data:
in
the
reason
of
simulations
the
greedy
clean.
appeared
there
sets
simulations
a simple
to
to reorganize
they
two
uniform”
segments
not
access
results
“LFS
cleaner
least-utilized
did
uniform
3.
policy
out
blocks
being
expect
any
the
variance
than
would
the
access
it
live
were
segments
to
where
writing
live
onto
uniform
always
data
the
written
out
cleaned
(for
improvement
in
a
from
reorganization).
Even
with
lization
the
uniform
allows
overall
overall
utilization
write
capacity
capacity
of only
cost
have
disk
disk
drops
no live
The
access
was
the
55%.
patterns,
as for
patterns,
write
cost
and
formula
utilization
At
at all
the
overall
2.0;
this
and
hot-and-cold”
same
access
lower
utilization,
below
blocks
“LFS
the
random
a substantially
as
ACM
uniform”
Transactions
the
above.
except
need
write
The
that
on Computer
For
cleaned
capacity
that
don’t
shows
described
“LFS
disk
means
hence
curve
(1).
segments
in
of the
to be read
cost
when
cleaning
the
live
Systems,
at
an
under
cleaned
75%
average
2090
the
segments
in.
there
is locality
policy
for
blocks
were
Vol.
utifrom
example,
have
utilization
some
segment
be predicted
10, No.
this
sorted
1, February
in
curve
by
1992.
38
.
M, Rosenblum and J, K. Ousterhout
age before
tends
writing
to be
thought
that
segment
this
Figure
worse
degree
doesn’t
worse
cold
a very
until
point
numbers
large
After
locality
and
valuable
these
than
data
data;
we
distribution
it
space.
segment
Said
with
of
beneficial
space
cleaning
value
and
a hot
and
access
segment
patterns.
the
longer
estimated
by
To
this
test
clean.
free
space
The
segment
the
and
amount
to
cost
of
space
block
in
how
long
the
space-time
product
cleaning
the
back
the
that
live
—
cost
in
to
current
the
and
delay
the
segment.
data
without
older
The
in
the
knowing
the
the
a
cold
quickly
of the
the
from
it is less
as well
predicted
that
blocks
die
has
unusable
before
likely
might
the
must
is more
contrast,
stability
be
the
segment
free
time
unchanged,
is
the
formed
by
is
1 +
free
Computer
of free
We
likely
data).
The
used
age
to
the
and
data
stability
has
the
in
can
a
be
all
generated*
Vol.
10,
segments
of time
is just
1 –
recent
modified
The
two
of cost
these
age
read
factors,
of data
as an
of
—
—
1, February
1992
of
we
get
(1
-
any
estimate
of
is
The
segment,
the
space
u is the
cleaning
u)*age
l+U
No.
the
time
components.
to
with
the
u, where
block)
benefit
to
the
components:
most
cost
Systems,
the
two
amount
youngest
segments
of cleaning
space
these
unit
Combining
space
and
free.
multiplying
u (one
selecting
benefit
chooses
benefit
of the
stay
for
to the
segment
cost.
amount
(i.e.,
policy
according
be reclaimed
–
on
segments
the
” In
will
the
a new
the
to
will
The
segment
benefit
TransactIons
of cleaning
segment.
space
data
system
remain
segment
benefit
segment
the
the
on
in
tend
segment
a cold
a long
die
clustered
than
cold
the
again.
the
assumption
to
simulated
each
free.
of the
an
we
the
stay
utilization
Using
likely
them
cannot
are
in
point
of data.
rates
ratio
of free
is likely
age
stability
is
theory
policy
highest
it
reclaims
blocks
cleaning
reaccumulates
back
in-
slowly
segments
a cold
once
system
is based
the
and
it
them
the
cold
in
because
to “keep”
of the
hot
before
because
more
Unfortunately,
future
get
“takes
very
of time.
space
segment
reaccumulate;
let
that
Thus
threshold,
locality
is that
5 shows
a segment
segments
with
periods
Free
the
above
that
segments.
drops
more
result
long
time
segment
rapidly
many
realized
once
it will
of a segment’s
segment.
we
way,
data
a while
for
a hot
fragmented
will
blocks
a long
another
to clean
free
that
Figure
cleaning
just
varying
found
policy,
of all
the
tried
and
greedy
utilization
simulations
overall
data)
utilized
to
in
The
of
grouping
We
increased.
to linger
cleaner.
in
take
cold
becomes
least
the
the
5%
the
drops
5 shows
the
space
will
to
locality
the
“better”
no locality!
Under
tend
figures
by
free
cleaned
as the
becomes
of free
differently
with
accesses
result.
locality.
studying
be treated
ACM
that
of
segments
Figure
tie
write
result
Unfortunately,
time.
without
the
bimodal
eventually
cleaning
simulations
the
(cold)
(hot)
desired
worse
it
so these
the
up
and
segments.
long
around
data
long-lived
short-lived
to the
95%
utilization
segments,
free
that
from
lead
a system
nonintuitive
cleaned
the
been
means
than
(e.g.,
this
segment’s
cluding
surprising
locality
for
get
every
the
got
reason
for
This
segments
would
performance
of
performance
cold
again.
different
approach
4 shows
in
the
out
in
utilizations.
result
the
them
segregated
“
the
cost
of
u to
The Design and Implementation
of a Log-Structured
File System
.
39
Fraction of segments
~.~g .. ;.........!--!-....
0.007 ‘- I
f).f)~ --~
\
().005 --~
0.004 ‘-;
0.003 ‘-~
0.002 --~
(),001 --~
;...
O.000 --~
0.0
0:2
...........,---.----+.
1.0
0:8
0:4
Segment
0:6
utilization
Fig. 5. Segment utilization
distributions
with greedy cleaner. These figures show distributions
of segment utilizations
of the disk during the simulation.
The distribution
is computed by
measuring the utilizations
of all segments on the disk at the points during the simulation
when
segment cleaning was initiated.
The distribution
shows the utilizations
of the segments available to the cleaning algorithm.
Each of the distributions
corresponds to an overall disk capacity
utilization
of 75Y0. The “Uniform”
curve Corresponds to ‘‘LFS uniform”
in F@re
4 and
“H&aml-colcl”
corresponds to “LFS hot-and-cold”
in Figure 4. Locality causes the distribution
to
be more skewed towards the utilization
cleaned at a higher average utilization.
at which
cleaning
occurs; as a result,
segments
are
Fraction of segments
o.m8 ..L ........L...L
......
0.007 -“j
0.006 -j
0.005 --+
0.004 --j
.. LFS Cost-Benefit
0.003 --j
0.002
--~
().001 --j
0.000 --:
0.0
4
L . . . . . . . . . . . . . . . . . . . .!. .
0:2
0:4
Segment
0:6
0:8
1.0
utilization
Fig. 6. Segment utilization
distribution
with cost-benefit policy. This figure shows the distribution of segment utilizations
from the simulation
of a hot-and-cold access pattern with 75% overall
disk capacity utilization.
The “LFS Cost-Benefit”
curve shows the segment distribution
occurring when the cost-benefit policy is used to select segments to clean and live blocks are grouped
by age before being rewritten.
Because of this bimodal
segment distribution,
most of the
segments cleaned had utilizations
around 15!%. For comparison, the distribution
produced by the
greedy method selection policy is shown by the “LFS Greedy” curve reproduced from Figure 5.
ACM TransactIons on Computer Systems, Vol. 10, No. 1, February 1992
40
.
M. Rosenblum and J. K. Ousterhout
0.0
T..........r.............
. ..~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
0.0
0.2
0.4
Disk
0.6
capacity
0.8
1:0
utilization
7. Write cost, including
cost-benefit pohcy. This graph compares the write cost of the
greedy policy with that of the cost-benefit
policy for the hot-and-cold
access pattern
The
cost-benefit policy is substantially
better than the g-reedy policy, particularly
for disk capacity
utilizations
above 6070.
Fig.
We
call
this
cleaned
We
policy
reran
the
cost-benefit
and
6,
ments
that
we
about
7570
utilization
about
15$Z0 before
of the
segments
reduces
the
better
high
The
simulation
in
file
3.6
In
systems
Segment
order
data
to
time
the
of any
cleaner
LFS.
Sprite
5070
the
that
be
cold
reach
the
the
from
of
seg-
segments
at
to hot
files,
cost-benefit
greedy
and
FFS
even
a number
cost-benefit
of
most
policy
policy,
Unix
simulated
the
seen
a utilization
are
possible
the
with
be
distribution
that
over
We
found
can
cleans
writes
best
pattern
As
segments
7 shows
as
to
a
at
of other
policy
gets
even
the
number
the
choosing
is written,
are
deleted
or
ACM
TransactIons
blocks
on
is even
in
better
to
implement
Section
than
5.2,
the
the
predicted
Computer
cleaning
segment
of live
segment
us
be seen
cost-benefit
the
in
the
convinced
will
in
cost-benefit
behavior
of actual
Figure
7.
Table
called
block
As
LFS
support
when
bimodal
policy
utilizations.
and
the
90’%0 of the
Figure
experiments
Usage
structure
records
capacity
data.
hot
segments
access
live
cleaning
much
cold
increases.
Sprite
in
the
out-performs
of locality
as locality
hot-and-cold
on
Since
hot.
as
system
disk
approach
are
by
allows
segments.
the
until
them.
cost
kinds
The
waits
cleaning
it
hot
produced
for.
but
file
and
policy
cleaned
log-structured
relatively
under
hoped
write
than
age-sorting
cost-benefit
had
policy;
utilization
simulations
policy
the
cost-benefit
higher
Figure
degrees
the
at a much
bytes
usage
in the
segment.
and
are
the
segment
These
segments
to
Vol
If
10,
No.
Sprite
1, February
maintains
each
and
the
most
recent
are
used
by
values
bytes
the
LFS
For
values
The
of live
overwritten.
Systems,
two
clean.
count
policy,
table.
segment,
are
the
is decremented
count
1992
falls
to
modified
the
initially
zero
a
table
segment
set
when
when
files
then
the
The Design and Implementation
segment
table
can
are
checkpoint
In
be
order
to
the
LFS
single
modified
that
sort
of
does
are
summary
(see
age
Sprite
When
time
for
is
In
[3,
already
used
the
each
usage
stored
in
the
We
block
in
At
a file;
will
plan
for
information
segment.
estimate
times
be
present
it
keeps
a
incorrect
to modify
each
the
for
segment
block.
many
to
the
checkpoint.
the
made,
typical
reboot
correct
the
must
The
disk
have
during
to
so it
the
may
logs,
consistency.
in
in
locations
end
crashes.
any
system
scan
cost
configurations),
of the
of the
This
log.
checkpoints,
which
which
is used
in the
log
disk
all
of
of
these
and
it
is
define
known
in other
LFS
uses
consistent
to recover
are
be possible
is well
[6] and
Sprite
operations
it should
of logs
systems
systems,
last
Thus
benefit
database
logging
roll-forward,
file);
order
without
were
restore
file
expand.
at the
both
recovery:
and
changes
system
other
the
in
systems
on
a new
containing
file
to
performed
example,
operations
of minutes
are
operations
(for
these
systems
after
few
directory
disk
(tens
file
quickly
last
last
on
they
system,
to
and
file
has
systems
a two-pronged
states
information
of the
written
file
since
Checkpoints
A checkpoint
are
This
Unix
the
to advantage
approach
4.1
the
high
91. Like
last
file.
state
review
as storage
very
8,
41
segment
are
summary
to
entirety.
the
must
where
a log-structured
been
of the
blocks
segment
for
modified
traditional
to determine:
recover
their
writing
structures
higher
easy
in
the
written
times
inconsistent
system
determine
getting
age,
entire
occurs,
an
without
metadata
scans
the
blocks
of the
.
details).
block
to include
in
inconsistencies.
cannot
by
modified
modified
it
written
In
blocks
youngest
crash
left
operating
the
live
The
addresses
4 for
the
keep
not
and
File System
RECOVERY
have
been
cleaning.
the
not
a system
may
without
log,
Section
information
4. CRASH
the
to the
regions
records
files
reused
written
of a Log-Structured
is a position
consistent
checkpoint.
file
data
segment
and
First,
on
disk.
in the
pointer
During
segment
reboot,
information
and
a crash
during
regions,
and
a
LFS
the
there
block
be
updated.
the
structures
to create
of
the
the
log,
inode
addresses
plus
the
map
of
current
a
including
to a special
region
checkpoint
operation
last
uses
reads
operations
will
and
table,
checkpoint
is in the
blocks
contains
usage
data
time
regions
and
system
process
to the
a checkpoint
main-memory
checkpoint
file
information
region
segment
of the
a two-phase
its
time
not
modified
writes
all
uses
and
fixed
all
time
the
and
a
written.
Sprite
to initialize
LFS
inodes,
it
checkpoint
map
last
all
blocks,
Second,
The
inode
to the
out
indirect
table.
at which
Sprite
it writes
blocks,
usage
position
blocks
complete.
of the
one
ACM
checkpoint
During
with
alternate
the
Transactions
reboot,
most
are
between
region,
the
recent
on Computer
region
structures.
actually
them.
so if the
system
and
In
order
that
to handle
two
checkpoint
The
checkpoint
checkpoint
reads
uses
both
fails
the
checkpoint
time.
Systems,
Vol.
10, No.
1, February
1992.
42
M. Rosenblum and J K. Ousterhout
.
Sprite
file
LFS
performs
system
is
between
checkpoints
increases
the
interval
LFS
probably
this
overhead
4.2
too
when
file
but
of
the
is not
the
cost
of new
checkpoints
operating
but
of normal
operation.
seconds,
which
checkpointing
data
while
the
interval
checkpoint
periodic
time
long
a short
of thirty
to
as when
A
recovery;
interval
amount
as well
down.
writing
increases
on recovery
system
shut
during
alternative
a given
set a limit
the
overhead
forward
An
after
would
is
a checkpoint
short.
intervals
system
the
time
uses
checkpoints
log;
the
to roll
recovery
much
perform
the
needed
currently
at periodic
or
reduces
time
improves
Sprite
checkpoints
unmounted
has
been
reducing
to
written
the
at maximum
is
is
to
checkpoint
throughput.
Roll-Forward
In
principle
latest
it
would
checkpoint
checkpoint.
since
This
the
tion
last
the
would
presence
the
checkpoint,
This
file
file
automatically
then
on disk
table
from
the
checkpoint
after
will
adjusted
fied
The
to reflect
by
the
The
directory
entries
directory
entries
is deleted.
been
the
and
operation
code
count
the
tion
for
the
log
entry
log
for
of the
directory
the
for
entry
named
operation
appears
log;
in
in
the
to
when
not
yet
log
recovered
copy
of’ the
of the
segment
usage
written
the
will
live
also
of these
since
data
have
can
left
to
be
be identi-
to occur
while
and
These
LFS
guarantees
the
location
and
records
that
corresponding
inode.
ACM TransactIons on Computer Systems, Vol. 10, No 1, February
1992.
LFS
record
outputs
a
includes
of the
the
new
collectively
directory
directory
an
directory
directory),
each
has
versa.
the
are
of
file
containing
or vice
within
the
an inode
block
Sprite
The
i-number),
number
to zero
when
the
written,
the
position
between
of the
drops
inodes,
unlink),
the
consistency
count
change.
entry.
before
the
from
inode.
version
segments
a count
been
and
directory
or
new
in
the
count
(name
the
blocks.
restore
a crash
and
Sprite
data
(both
for
rename
directory
the
a new
the
indicates
read
of the
into
that
segments
summary
it
copy
to reflect
reference
has
written
block
map
blocks
contains
directories
each
link,
the
directory
inode
entry
were
log).
inode
a new
written
informa-
segment
without
of the
how
inode;
between
(create,
(i-number
contents
with
directory
in
the
it is possible
log
new
older
is
Each
new
utilizations
overwrites
in
to that
consistency
record
entry
inodes.
referring
corresponding
and
inodes
the
be adjusted
of
roll-forward
inode
data
utilizations
must
deletions
in
to the
To restore
special
they
Unfortunately,
written
The
of new
issue
the
the
in
the
a file
assumes
ignores
adjusts
that
a summary
to
new
data
that
roll-forward.
When
for
any
the
after
as much
segments
information
refers
file’s
utilizations
file
presence
final
map
code
it
log
reading
log
but
to recover
updates
discovered
checkpoint.
be zero;
roll-forward.
LFS
roll-forward
and
the
data.
the
also
the
uses
inode
are
code
the
is called
file
Sprite
the
is incomplete
roll-forward
read
LFS
inode,
the
simply
in
order
through
written
blocks
by
data
recovery
In
operation
incorporates
If data
inode,
The
the
Sprite
so that
crashes
any
be lost.
scans
recently
after
in instantaneous
This
of a new
system.
file’s
LFS
roll-forward
the
restart
discarding
would
checkpoint.
to recover
to
and
result
Sprite
last
During
blocks
safe
checkpoint
as possible,
after
be
region
the
reference
called
operablock
or
The Design and Implementation
During
roll-forward,
tency
between
inode
and
directory
directory
and/or
cause
entries
on inodes
ries,
block
inode
were
inode
this
the
case
functions,
both
the
usage
directory
entry
directory
log
will
made
appends
table
blocks
only
the
easy
to
and
reference
the
changed
to the
the
the
counts
and
that
writes
can’t
be
written;
addition
to
an
can
directo-
log
is never
provide
consisbut
updates
operation
In
43
operations
inode
removed.
it
ensure
appears
Roll-forward
which
be
to
entry
.
roll-forward
The
for
used
directories
them.
file
is
written,
from
File System
a log
program
include
of a new
log
if
operation.
segment
to
creation
the
not
recovery
and
region
is the
inodes:
to or removed
The
map,
checkpoint
operation
and
to complete
to be added
completed
directory
entries
to be updated.
inodes,
a new
the
directory
of a Log-Structured
in
its
atomic
other
rename
operation.
The
interaction
troduced
each
between
additional
checkpoint
log
is
must
consistent
required
are
began
was
the
being
of 1990
it
described
been
used
thirty
in
this
paper
not
yet
duction
disks
use
When
we
system
tional
file
segment
and
layout
that
either
In
code
project
this
FFS
FFS
or
the
directory
in
in-
particular,
operation
the
log.
This
modifications
by
disks
in
were
while
Unix
by
restore
since
they
but
rollpro-
all
the
a log-structured
turns
addition,
than
Logging
somewhat
more
include
file
than
out
to
a tradi-
be
no
complexity
the
complicated
be
are
The
discard
elimination
consistency.
to
fall
features
LFS,
and
additional
the
in
the
system.
that
LFS
has
FFS;
Sprite
it
the
which
of
to implement
Sprite
mid-1990
Since
reboot.
concerned
LFS
All
seconds)
they
by
partitions,
production
complicated
likely
LFS,
disk
(30
when
Sprite
and
system.
in
the
is no more
to
[81 are
1989
computing.
compensated
LFS
Sprite
different
however,
is
late
operating
interval
we
[12]:
in
implemented
more
required
in Sprite
Unix
Unix
been
checkpoint
FFS
[91 or Cedar
five
installed
reality,
but
LFS
day-to-day
checkpoint
last
Unix
policies
scans
Episode
have
the
cleaner,
roll-forward
for
be substantially
than
In
blocks
directory
network
manage
been
the
system.
complicated
to
a short
began
might
checkpoints
LFS.
LFS
Sprite
users
has
after
where
directory
of Sprite
of the
forward
information
and
Sprite
written.
as part
about
state
and
log
into
to prevent
implementation
has
by
a
inode
WITH THE SPRITE
operational
used
the
operation
issues
synchronization
5. EXPERIENCE
We
directory
represent
with
additional
checkpoints
the
synchronization
both
more
for
of the
checkpointing
the
and
code [131
fsck
file
systems
like
complicated
logging
the
bitmap
than
and
layout
code.
In
the
everyday
Unix
being
For
use
Sprite
FFS-like
used
are
example,
file
not
on
fast
the
20’%
faster
than
Most
of the
speedup
in
Sprite
LFS.
LFS
system
enough
modified
SunOS
Even
using
does
in
not
the
much
The
different
reason
to be disk-bound
Andrew
the
is attributable
with
feel
Sprite.
configuration
to the
synchronous
ACM Transactions
with
benchmark
removal
writes
is
the
[16],
presented
of the
of Unix
to the
that
users
the
current
Sprite
workloads.
LFS
in
than
machines
Section
is
only
5.1.
synchronous
writes
FFS,
bench-
the
on Computer Systems, Vol. 10, No. 1, February 1992.
44
M. Rosenblum and J. K. Ousterhout
,
mark
has
changes
5.1
a CPU
in
the
utilization
disk
of over
storage
80%,
limiting
the
speedup
possible
from
management.
Micro-Benchmarks
We
used
a collection
performance
is based
on Unix
realistic
workloads,
systems.
integer
IV
disk
onds
average
with
a
system
with
was
measurements
of cleaning
8 shows
SunOS
for
faster
and
while
the
disk’s
performance
faster
phases
of the
the
files
back;
the
log-structured
It
substantially
many
ual
disk
[14]
and
read
operations
should
for
a file
sequentially
seeks
after
Sprite
a
the
the
runs
5.2
test.
so
the
below
for
for
for
log;
each
large
block
(a newer
has
two
been
so its
systems
it
SunOS
written
for
cases.
turns
them
into
writes
because
performs
to
the
in
this
it
individ-
groups
Sprite
for
randomly;
performance
small
in all
of SunOS
except
get
SunOS.
many
SunOS
equivalent
the
performance
sequential
version
the
that
in
with
than
whereas
performance
in the
LFS,
1/0,
for
of
as CPUS
be expected
because
faster
85%
1.2~0
of 4-6
competitive
create
busy
means
on workloads
writes
is also
have
can
same
in the
the
disk
as
is also
in the
about
factor
bandwidth
random
it
a single
it
another
LFS
during
This
fast
densely
the
only
data.
provides
write
busy
deletes
as
read
files
kept
though
new
are
the
17%
and
times
Sprite
files
SunOS
efficiency
it also
reads,
ten
packs
disk
even
by
creates,
the
no improvement
is similar
in
the
improve
a higher
therefore
performance
case
during
benchmark.
contrast,
for
that
faster
into
An
used
each
Section
almost
system
used
designed
to the
blocks
require
Almost
has
kept
In
phase,
will
9 shows
writes
only
was
is
is because
file
CPU.
LFS
LFS
this
create
was
Sprite
sequential
groups
the
Figure
files.
is
LFS
the
bandwidth
large
that
delete
8b).
see
of a benchmark
and
Sprite
accesses,
performance;
LFS
benchmark
a
formatted
storage.
In
quiescent
and
millisec-
was
Sprite
size.
the
LFS
Figure
Although
file
results
of Sprite
(see
otherwise
Sprite
during
disk
usable
segment
during
HBA,
17.5
of
(8.7
overhead.
Sprite
potential
was
occurred
best-care
saturating
time
a one-megabyte
SCS13
the
while
of the
a Sun-4\260
a Sun
SunOS
files.
Furthermore,
phase
of
reading
by
system
represent
weaknesses
bandwidth,
SunOS,
best-case
file
do not
was
megabytes
small
create
created
log.
of
the
for
order
the
number
and
300
but
and
systems
transfer
LFS
used
cleaning
represent
Figure
was
and
measurements
a large
both
multiuser
no
strengths
both
the
whose
so they
of memory,
around
size
running
the
for
4.0.3,
synthetic
maximum
For
size
LFS
illustrate
having
block
are
used
to measure
it to SunOS
32 megabytes
time).
block
Sprite
they
MBytes/see
seek
eight-kilobyte
For
benchmarks
programs
compare
machine
system
four-kilobyte
The
but
(1.3
file
and
The
SPECmarks)
Wren
benchmark
LFS
FFS.
two
file
of small
of Sprite
writes
LFS).
case
The
of reading
case
is substantially
the
reads
lower
than
SunOS.
Figure
different
file
system
(sequential
tory,
9 illustrates
form
etc.);
the
of locality
achieves
reading
it then
ACM Transactions
fact
logical
of files,
pays
that
on disk
extra
than
locality
a tendency
on writes,
a log-structured
traditional
by
file
file
assuming
to
use
certain
multiple
if necessary,
on Computer Systems, Vol. 10, No, 1, February
system
systems.
access
files
to organize
1992.
produces
A traditional
within
patterns
a direc-
information
a
The Design and Implementation
of a Log-Structured
❑ SpriteLFS
Key:
❑
File System
.
45
SunOS
Files/see (predicted)
Files/see (measured)
180 [--------------------------------------------------675 [--------------------------------------------------
.
.
.
.
..
-.
.
...I.................................1111
I ... ... .. ... .. ... .. .. ... .. ... .. .. ... .. .
160
600
140
525
120
450
100
375
80
300
60
225
40
150
20
75
0
0
10000
I
. . . . . . . . . .---------- . . . . . . . . . . . . . .
‘--
r
10000 lK file create
file access
lK
... .. ..
y
m)
(a)
Fig. 8. Small-file
performance under Sprite LFS and SunOS. (a) measures a benchmark that
created 10000 one-kilobyte files, then read them back in the same order as created, then deleted
them. Speed is measured by the number of files per second for each operation on the two file
systems. The logging approach in Sprite LFS provides an order-of-magnitude
speedup for
creation and deletion. (h) estimates the performance of each system for creating files on faster
computers with the same disk. In SunOS the disk was 8570 saturated in (a), so faster processors
will not improve performance much. In Sprite LFS the disk was only 17% saturated in (a) while
the CPU was 100% utilized; as a consequence 1/0 performance will scale with CPU speed,
optimally
file
on disk
system
at the
same
logical
time
then
performance
on
from
tially
on
logical
Random
disk.
the
reads
occurred
would
have
in
the
much
5.2
Cleaning
micro-benchmark
same
should
file
efficiently
for
the
random
sequential
same
performance
very
differently.
order
as
the
If
read
the
same
locality
differently.
it
writes
writes
in
rereads
in
about
then
temporal
perform
because
matches
and
have
system.
will
or modified
locality
sequentially
system
systems
a log-structured
is created
If temporal
is written
handles
the
out
contrast,
that
order
more
the
However,
two
sequen-
to
achieve
efficiently.
systems,
if the
nonsequential
Sprite
them
even
nonsequential
writes
then
Sprite
faster.
Overheads
performance
overheads
more
it
about
laid
the
more
pays
are
been
that
file
then
In
on disk.
as a traditional
then
have
patterns.
information
closely
a file
writes
but
The
of the
files
SunOS
blocks
for
locality
random
reads
though
does
large
locality,
read
locality:
a log-structured
logical
handles
assumed
be grouped
as it
sequentially,
LFS
the
temporal
will
locality,
differs
for
achieves
(the
results
of Sprite
write
cost
of the
LFS
during
previous
because
the
ACM Transactions
section
they
benchmark
give
do not
runs
an
optimistic
include
was
any
1. O). In
view
cleaning
order
to
on Computer Systems, Vol. 10, No. 1, February 1992.
46
M. Rosenblum and J K. Ousterhout
.
SunOS
El
900 [-----""--"---"-""----"--"---------------------"""--""""-------------------"----------------------------Sprite
kilobytes/see
LFS
1--------------"----"
-"--"
"-"----'-m--------------------""--"---------------------------
800
700
600
500
400
300
200
100
0
Read
Write
Sequential
Write
Read
Reread
Sequential
Random
Fig. 9. Large-file performance under Sprite LFS and SunOS. The figure shows the speed of a
benchmark
that creates a 100-Mbyte file with sequential
writes, then reads the file back
sequentially,
then writes 100 Mbytes randomly
to the existing file, then reads 100 Mbytes
randomly from the file, and finally reads the file sequentially
again, The bandwidth
of each of
the five phases is shown separately, Sprite LFS has a higher write bandwidth
and the same read
bandwidth
as SunOS with the exception of sequential
reading
of a file that was written
randomly.
assess the
policy,
cost
we
over
of cleaning
recorded
a period
of several
/user6
Home
tion
the
effectiveness
about
our
months.
Five
directories
for
program
and
systems
Sprite
cost-benefit
were
file
systems
measured:
developers.
text
cleaning
log-structured
Workload
processing,
consists
electronic
of
communica-
simulations.
directories
processing
and
and
/src / kernel
Sources
/swap2
Sprite
client
virtual
memory
and
tions.
of the
production
development,
Home
Ipcs
and
statistics
Files
VLSI
project
circuit
binaries
for
the
workstation
Sprite
for
research
on
parallel
kernel.
swap
backing
tend
area
design.
store
to be large,
files.
for
Workload
40 diskless
sparse
and
consists
Sprite
accessed
of
workst
a-
consequen-
tially.
Itmp
Temporary
Table
period.
putting
behavior
predicted
capacity
II
shows
In
order
the
file
systems
use
file
simulations
in
ranged
from
totally
than
the
Transactions
on
empty.
average
Computer
Systems,
before
Sprite
workstations.
cleaning
over
waited
several
we
beginning
has
been
3.
Even
the
substantially
though
than
the
segments
nonempty
10,
No
The
1, February
a
overall
1992
four-month
months
after
measurements.
11 – ‘75Y0, more
utilizations.
Vol
40
during
Section
Even
disk
for
effects,
systems
the
were
ACM
into
utilizations
cleaned
area
gathered
start-up
production
far
less
storage
statistics
to eliminate
of the
by
file
the
half
than
overall
disk
of the
have
write
The
better
segments
utilizations
costs
ranged
The Design and Implementation
Table II.
I
Segment Cleaning
File system
1 /sw;p2
and Write
File System
Costs for Production
309 MB
13.3 MB/hour
68.1 KB
4701
65%
.
47
File Systems.
Write cost in Smite LFS file systems
Segments
Avg File
Avg Write
In Use
~i7~
Traffic
Cleaned Empty
10732
<B
75%
69%
3.2 MB/hoor
52%
—..
::
I
Statistics
of a Log-Structured
Write
66%
Atg
.133
.137
cost
1.4
1.6
.535
1.6
For each Sprite LFS file system the table lists the disk size, the average file size, the average
daily write traffic rate, the average disk capacity utilization,
the total number of segments
cleaned over a four-month
period, the fraction of the segments that were empty when cleaned,
theaverage
utilization
of thenonempty
segments that were cleaned, and the overall write cost
fortheperiod
of the measurements.
These write cost figures imply that the cleaning overhead
limits the long-term write performance to about 70% of the maximum
sequential write bandwidth.
Fraction
of segments
.~.............{..............+.............+............. ...
~
0.180
--
0.160
---
0.140 ‘-”
0.120 ‘-”
0.100 ‘-”
0.080
‘-
0.060
--
0.040 ‘-”
0.020 ““”~
(jm
-. ~-.
~1 ... ............................................................
0.0
0:2
0:4
0:6
Segment
0:8
1:0
utilization
Fig. 10. Segment utilization
in the /user6 tile system. This figure shows the distribution
of
shows large
segment utilizations
in a recent snapshot of the /user6 disk. The distribution
numbers of fully utilized segments and totally empty segments.
from
1.2
to
1.6,
simulations.
gathered
We
Sprite
were
in
ity
than
and
the
a
segments.
simulated
write
the
of the
two
long.
In
/user6
practice,
The
all
segments.
second
reference
ACM Transactions
the
difference
patterns
the
corresponding
segment
utilizations,
cleaning
costs
the
files
in
are
a substantial
there
and
deleting
in
of
why
First,
individual
of 2.5-3
disk.
reasons
to be written
segment,
costs
distribution
simulations.
tend
within
than
are
the
block
locality
empty
is that
there
they
to
shows
snapshot
in
a single
files,
longer
totally
10
that
LFS
in greater
much
comparison
a recent
believe
just
longer
in
Figure
deleted
will
best
This
where
produce
between
were
case
evenly
lower
in
simulations
number
as a whole.
In the
file
are
the
one
simulation
distributed
of
results
a file
or
is
more
and
real-
within
on Computer Systems, Vol. 10, No. 1, February 1992
48
M Rosenbium and J. K. Ousterhout
.
the
hot
are
almost
cold
and
cold
segments
the
very
every
cold
groups.
written
in
in
may
other
idle
periods,
We
this
be
done.
as
we
can
improve
not
clean
at a time,
hot
yet
data
5.3
crash
The
type
time
for
different
of
fifty
interval
and
roll-forward,
file
of
III
sizes
by
last
sizes
and
time
to disk,
in
both
terms
than
Overheads
to
we
segments
to
However,
inodes,
inode
overwritten
data
ACM
TransactIons
data
about
map
blocks,
written
The
to
on
the
Computer
an
to
the
one,
crashed.
A
checkpoint
During
map,
the
recovery
the
directory
between
II,
rate
one
and
Recovery
second
of files
can
150
for
be
From
a checkpoint
times
of
size
times
checkpoints.
recovery
write
by
different
created
infinite
number
in Table
average
grow
The
was
inode
crash.
written
traffic
in
disk.
the
recovery
updated.
with
the
of data
interval
of
around
one
megabytes/hour,
every
70
seconds
of
LFS
much
inode
log.
disk
Systems,
map
Vol.
to
consists
various
blocks
the
of file
information
segment
We
live
written
of the
and
of the
of the
data
on
13%
had
to the
crash
and
the
that
that
table
and
recovered.
system
varies
importance
of the
live
quickly.
the
would
of how
much
of the
blocks.
usage
shows
the
changes
production
various
interval
a program
to be added
the
of
before
used
observed
relative
in terms
of how
ggy.
if
LFS
to segregate
on
III
data
running
files
was
Table
of file
by
write
in Sprite
the
amounts
result
maximum
length.
shows
performed.
time
daily
interval
IV
checkpoint
had
would
recovery
Other
the
amount
maximum
5.4
on
checkpoint
checkpoint
Table
depends
segment
the
the
recovery
recovery
hour
Using
time
directory
files
the
that
the
an
been
LFS
wrote
and
file
as
and
Sprite
limiting
average
second.
of
to know
example,
many
ability
to
generated
created
between
large
installed
bursts
LFS
For
pracduring
of Sprite
of how
system’s
or
during
algorithms.
the
In
night
Sprite
not
of fixed-size
never
shows
has
being
were
created,
bounded
as
operations
the
written
at
enough
recover
megabytes
version
Table
over-optimistic,
performance
issue
impact
simulations,
to be cleaned.
a bit
available
with
the
policy
may
code
well
to
configurations
special
the
recovery
works
time
and
entries
tune
the
it
are
the
the
the
isolate
data.
code
rate
or
and
were
will
had
cleaning
experience
expect
In
thus
that
than
over-pessimistic.
of the
segments
we
think
5.1
anything,
of files
colder
system
them.
and
Section
if
enough
analyzed
we
file
clean
much
experience
but
the
scenarios.
ten,
clean
addition,
gain
cold
the
crash
that
have
in
are,
numbers
much
Recovery
Although
system,
In
never
LFS
section
large
are
modifications
perform
yet
are
reality
log-structured
and
to
carefully
from
Crash
so
do not
have
A
of Sprite
this
possible
activity.
in
received
in
be
there
segments
measurements
it
practice
segments
simulations).
eventually
measurements
tice
In
(cold
the
files
segment
If the
the
file
never
alone
10,
No.
log
data
accounts
1, February
this
all
blocks
and
log
of which
for
more
is
because
1992
written
on disk
represent.
to the
blocks,
that
of data
occupy
they
written
map
suspect
kinds
they
indirect
consists
tend
than
of
and
More
to
of
be
770 of all
the
short
The Design and Implementation
Table III.
of a Log-Structured
Recovery Time for Various
Sprite
LFS ~covery
File
File System
.
49
Crash Configurations.
time
in seconds
File Data Recovered
Size
1 MB
1 KB
10 MB
50 MB
21
132
1
10KB
<1
100KB
<1
I
I
3
1
I
17
8
The table shows the speed of recovery of one, ten, and fifty megabytes of fixed-size files. The
system measured was the same one used in Section 5.1. Recovery time is dominated by the
number of files to be recovered.
Table IV.
Disk Space and Log Bandwidth
Sprite
Block
LFS
6 file system contents
/user
type
Live
Data blocks*
Indirect
Usage of /user6.
data
[ Log bandwidth
98.0%
blocks*
Inode blocks*
85.2%
1 .0%
1.6%
0.2%
2.7%
For each block type, the table lists the percentage of the disk space in use on disk (Live data) and
the percentage of the log bandwidth
consumed writing
this block type (Log bandwidth).
The
block types marked with ‘*’ have equivalent
data structures in Unix FFS.
checkpoint
disk
interval
more
metadata
to
increase
than
drop
the
fashion,
differ
from
Sprite
unnecessary
for
segment
garbage
segment
in Sprite
when
LFS
system
LFS,
the
we
concept
different
storage
have appeared
media
these
inode
The
Sprite
expect
which
log
install
forces
bandwidth
roll-forward
metadata
to
overhead
for
recovery
and
interval.
file
LFS
nefit
substantially
log-structured
Sprite
ing
in
We
WORK
ideas from many
log-like
structures
tems on write-once
only
used
necessary.
checkpoint
6. RELATED
The
currently
often
[5,
20].
systems
map
LFS
the
file
cleaning
collectors
selection
separates
and
in
Besides
maintain
inodes
that
systems
for
the
and
files
ACM
the
into
the
Sprite
writing
quickly
LFS
design
age
in Sprite
borrow
on Computer
in
an
much
reading
of the
appendlike
files.
media
the
They
made
it
space.
LFS
acts
much
much
like
languages
[1].
during
segment
of blocks
generations
Transactions
and
nature
log
programming
sorting
changes
information
locating
write-once
used
for
all
indexing
to reclaim
approach
developed
and
management
systems.
File systems with
in several proposals
for building
file sys-
like
Systems,
The
generational
Vol.
10, No.
scavengcost-becleaning
garbage
1, February
1992.
50
M. Rosenblum and J. K, Ousterhout
.
collection
tion
schemes
schemes
generational
achieve
fact
[11]. A significant
and
Sprite
garbage
high
LFS
collectors,
performance
that
blocks
algorithms
for
can
difference
is that
whereas
in
belong
a file
random
system.
garbage
file
Sprite
in the
are
LFS
at a time
used
garbage
collec-
is possible
accesses
Also,
one
than
these
access
sequential
to at most
identifying
between
efficient
can
to use
systems
in the
necessary
much
for
to
exploit
the
simpler
programming
languages.
The
logging
database
crash
recovery
they
use
the
the
most
and
log.
database
mechanisms
by
changes
have
processed
read
database
The
been
Sprite
LFS
crash
[151.
repositories
because
the
operation
to
indexes
log
is
the
separate
point
Collecting
similar
to
techniques
the
at the
data
home
data
newest
in
the
the
concept
used
in
copy
file
of
of
copy,
group
not
changed
in
read
bytes
in
and
Sprite
in
the
in
it
space
logged
requests
are
without
written
to
roll
forward
is
than
and
simplified
redoing
ensures
the
that
the
log.
to
disk
database
systems
The
the
systems
LFS
Rather
writing
data
segment
are
database
recovery
database
space.
as
main
LFS.
LFS
data
the
compacted
data.
and
need
all
log
separate
when
the
commit
main-memory
repository
Sprite
of the
cache
final
The
Since
implementation
final
as the
greatly
used
how
The
of checkpoints
techniques
in
disk.
log
Sprite
LFS
on
be reclaimed
the
as in
mechanism
to
The
only
Sprite
data
do
be
for
the
reclaim
can
in
logging
view
log
locations.
blocks
recovery
similar
to
pioneered
systems
purpose.
they
can
log
Typically
entire
the
this
that
final
from
of the
use
LFS
the
differ
for
system
area,
than
is
Sprite
to their
data
log”
do not
schemes
write-ahead
database
state
means
to
use
but
the
is reserved
the
written
rather
the
a database
performance.
logs
a “redo
object
in
the
about
systems
of
log
from
hurting
using
the
[6],
and
area
is similar
systems
LFS
systems
data
occupied
LFS
performance
database
cleaning
Sprite
database
Sprite
“truth”
a separate
of these
in
all
high
Both
is that
data:
area
used
Almost
up-to-date
difference
for
scheme
systems.
in
large
systems
writes
[4]
is
and
to
[7, 22].
7. CONCLUSION
The basic principle
behind a log-structured
file system is a simple one: collect
large amounts
of new data in a file cache in main memory,
then write
the
data to disk
Implementing
areas
in a single
this idea
on disk,
Sprite
simple
but
LFS suggest
policy based
tured file system to
also works very well
no cleaning
overhead
their entirety.
The bottom
line is
both
large 1/0 that
is complicated
our
Transactions
on
simulation
analysis
that
low cleaning
on cost and benefit.
overheads
Although
and
our
bandwidth.
large free
experience
with
can be achieved
with
a
we developed
a log-struc-
support
workloads
with many small files, the approach
for large-file
accesses. In particular,
there is essentially
at all for very large files that are created and deleted in
that
a log-structured
of magnitude
more efficiently
possible
to take advantage
ACM
can use all of the disk’s
by the need to maintain
Computer
file
system
can use disks
than existing
file systems.
of several
more generations
Systems,
Vol
10,
No.
1, February
1992,
an order
This should make it
of faster processors
The Design
before
1/0
and Implementation
limitations
once
again
of a Log-Structured
threaten
the
File System
scalability
.
of
51
computer
systems.
ACKNOWLEDGMENTS
Diane
Jim
Green,
Mary
Mott-Smith
Baker,
provided
John
Hartman,
helpful
Douglis
participated
in the
log-structured
file systems.
Mike
comments
initial
Kupfer,
on drafts
discussions
and
Ken
Shirriff,
and
of this
paper.
Fred
experiments
examining
REFERENCES
in real time on a serial computer, A.I. Working
Paper 139,
1. BAKER, H. G. List processing
MIT-AI Lab, Boston, MA, Apr. 1977.
2. BAKER, M. G., HARTMAN, J. H., KUPFER, M. D., SHIRRIFF, K, W., AND OUSTERHOUT, J. K.
of the 13th Symposium
on
Measurements
of a distributed
file system. In Proceedings
Operating
Systems Principles
(Pacific Grove, Calif., Oct. 1991), ACM. pp. 198-212.
of
3. CHANG, A., MERGEN, M. F., RADER, R. K., ROBERTS, J. A., AND PORTER, S. L. Evolution
storage facilities in AIX Version 3 for RISC System/6000 processors. IBM J. Res. Deu. 34, 1
(Jan. 1990), 105-109.
4. DEWITT, D. J., KATZ, R. H., OLKEN, F., SHAPIRO. L. D., STONEBRAKERM. R., .ANDWOOD, D.
of SZGMOD
Implementation
techniques for main memory database systems. In Proceedings
1984 (Jun. 1984), pp. 1-8.
5. FINLAYSON, R. S. AND CHERITON, D. R. Log files: An extended file service exploiting
of the 11th Symposvum
on Operating
Systems Principles
write-once storage. In Proceedings
(Austin, Tex., Nov. 1987), ACM, pp. 129-148.
systems. In Operating
Systems,
An Aduanced
6. GRAY, J. Notes on data base operating
Course. Springer-Verlag,
Berlin, 1979, pp. 393-481.
database system, IEEE
7. HAGMANN, R. B. A crash recovery scheme for a memory-resident
Trans. Comput.
C-35, 9 (Sept. 1986), 000-000.
8. HAGMANN,
Proceedings
R. B.
of the
Reimplementing
11th
Symposium
the cedar
on
file
Operating
system
using
Systems
logging
Principles
and group
(Austin,
commit.
Tex,,
In
Nov.
1987), pp. 155-162.
9. KAZAR, M. L , LEVERET~, B. W., ANDERSON, O. T , APOSTO1.IDES,V., BOTTOS, B. A CHUTANI,
file system
S., EVERHART, C. F., MASON, W. A.j Tu, S. T., AND ZAYAS, E. R. Decorum
architectural
overview.
In Proceedings
of the USENIX
1990 Summer
Conference
(Anaheim,
Calif., Jun. 1990), pp. 151-164.
10. LAZOWSKA, E. D., ZAHORJAN, J., CHERITON, D. R., AND ZWAENEPOEL, W. File access perforTrans. Comput.
Syst. 4, 3 (Aug. 1986), 238-268.
mance of diskless workstations.
garbage
collector
based on the
lifetimes
of
11. LIEBERMAN, H AND HEWITT, C. A real-time
ACM. 26, 6 (June 1983), 419-429.
objects. Commun.
Comput.
Syst. 2, 3 (Aug. 1984),
A fast file system for Unix. Trans.
12. MCKUSICK, M. K.
181-197.
UNIX file
13. MCKUSICK, M K., JoY, W. N., LEFFLER, S. J., AND FABRY, R. S. Fsck–The
Manual—4.3
BSD Virtual
VAX-11
Version,
system check program. Unix System Manager’s
USENIX, Berkeley, Calif., Apr. 1986.
performance from a UNIX file system. In Proceed14. McVOY, L. AND KLEIMAN, S. Extent-like
ings of the USENIX
1991 Winter Conference
(Dallas, Tex., Jan, 1991),
15. OKI, B. M., LISKOV, B. H., AND SCHEIFLER, R. W. Reliable object storage to support atomic
of the 10th Symposium
on Operating
Systems
Principles
(Orcas
actions. In Proceedings
Island, Wash., Dec. 1985) ACM, pp. 147-159.
Why aren’t operating systems getting faster as fast as hardware? In
16. OUSTERHOUT, J. K.
Proceedings
of the USENIX
1990 Summer
Conference
(Anaheim,
Calif., Jun. 1990), pp.
247-256.
ACM Transactions
on Computer Systems, Vol. 10, No 1, February
1992.
52
.
M. Rosenblum and J. K. Ousterhout
17. OUSTERHOUT, J. K , CHERENSON, A, R., DOUGLIS, F., NELSON, M N , AND WELCH, B. B. The
Sprite Network Operating System. IEEE Comput. 21, 2 (Feb. 1988), 23-36.
18. OUSTERHOUT,J. K , DA COSTA, H., HARRISON, D., KUNZE, J A., KUPFER, M., AND THOMPSON,
J. G
A trace-driven
analysis of the Unix 4.2 BSD file system. In Proceedings
of the 10th
Symposium
on Operating
Systems prlnctples
(Orcas Island, Wash., 1985), ACM, pp. 15-24.
A case for redundant arrays of inexpensive
19. PATTERSON,D. A.j GIBSON, G., AND KATZ, R. H
disks (RAID). ACM SIGMOD 88, pp. 109-116, (Jun. 1988).
20. REED, D. AND SVOBODOVA, L. SWALLOW:
A distributed
data storage system for a local
Networks
for Computer
Communieatcons.
North-Holland,
1981, pp.
network.
In Local
355-373.
of
21. SANDBERG,R. Design and implementation
of the Sun network filesystem. In Proceedings
the USENZX
1985 Summer
Conference
(Portland, Ore , Jun. 1985) pp. 119-130.
22 SALEM, K. AND GARCIA-M• LINA, H,
Crash recovery mechamsms for main storage database
systems. CS-TR-034-86, Princeton Univ., Princeton, NJ 1986.
of the
23. SATYANARAYANAN, M
A study of file sizes and functional
lifetimes.
In Proceedings
8th Symposium
on Operating
Systems Principles
(Pacific Grove, Calif., Dec. 1981), ACM, pp.
96-108.
24, SELTZER, M. I., CHEN, P, M., AND OUSTERHOUT,J. K
Disk scheduling revisited. In Proceedings of the Winter 1990 USENIX
Tech meal Conference
(Jan. 1990), pp.
Received June 1991; revised
ACM
TransactIons
on
Computer
September
Systems,
1991; accepted September
Vol.
10,
No.
1, February
1991
1992.