CS262B Advanced Topics in Computer Systems
Spring 2009
David E. Culler
Paper Title: Rethink the Sync
Author: Edmund Nightingale, Kaushik Veeraraghavan, Peter Chen,
and Jason Flinn
Novel Idea (Describe in a sentence or two the new ideas
presented in the paper):
-A synchronous IO interface with performance similar to that of
asynchronous IO achieved by lying to the app and telling it the IO was
durably committed but then putting the application into a speculative
state in which all externalization (anything visible to user) that
depends on the (invisibly asynchronous IO) is buffered until it is
*actually* durable.
Impact (What is the importance of these results. What impact
might they have on theory or practice of Computer Systems):
-So far as I can tell, nobody uses this, though the speculator work is
intriguing, with its roots obviously in architecture (branch
prediction). We cited the speculator work in the Hadoop paper last year.
Evidence (What reasoning, demonstration, analytical or empiricial
analysis did they use to establish their results):
-They run apache build as well as two well known benchmarks:
TPC-C (on mysql) and SPECweb99
Prior Work (What previously established results does it build upon and
how):
-Built on top of speculator project.
Competitive work (How to they compare their results to related prior or
contemporary work):
-This work was highly novel and intriguing with impressive measured
results. Similar work includes heirarchical storage systems that use
NV/battery backed RAM for quick returning IO calls.
Question (A question about the work to discuss in class):
-Would you be willing to program against this?
Criticism (A criticism of the work that merits discussion):
-Futzing with extenalized IO is not very intuitive to think about, so I
would argue it to a large degree weighs against their claim to simplify
how we think about IO (by making everything feel like synchronous IO),
especially since this is touted as a filesystem, which we think of
living strictly underneath our applications in the architecture, not
encapulating them
Ideas for further work (Did this paper give you ideas for future work,
projects, or connections to other work?):
-Speculator + <insert arbitrary other systems topic here... at the
risk of being predictable... "clouds">
Novel Idea
Treat the user (output) as the entity you are providing guarantees to,
provide a synchronous-looking interface to that user, put run
asynchronously under the covers.
Main Result(s):
This system achieves nearly the same performance as asynchronous file
systems (within 7%) while providing a syncrhonous output interface to
the user. This design allows the user to know what to expect
about the durability provided by the fs after a crash as whatever
output that's observed (received through network, etc) is for a
"commit" of data to disk.
Impact:
This is important in reconciling the guarantees offered by systems that
trade off performance and durability. You can make the system
perform well by allowing and fundamentally running an ayncrhonous
system under the covers, but only report that's really durable.
Evidence:
They ran xsyncfs on several benchmarks (PostMark, Apache build, MySql,
SPECweb99, etc) and compared with ext3 in syncrhonous and anychronous
modes.
Prior Work:
This essentially builds on the work done for asyncrhonous file systems.
Competitive work: No directly competing work.
Only current implementation of syncrhonous and asychronous file systems.
Reproducibility
The actual implementation of the system is part of a larger
implementation of an operating system. I don't think the results
would be reproducable without considerable effort.
Question:
Given the various tracing mechanisms that had to be put in place to get
xsyncfs to properly buffer output to the user, how effective would this
method be in other systems?
Criticism:
How well does this adapt to system changes? It seems there's a
close dependence on the tracing aspects to determine when output should
be buffered. If something in the system changes, it seems like
this might be difficult to include in the file system code after the
fact.
Ideas for further work: <Did this paper give
you ideas for future work, projects, or connections to other work?>
Novel Idea: Traditionally, operating systems have considered only
the kernel state to be internal state, while application state has been
deemed external; this is called an application-centric paradigm.
The authors propose applying to file systems a user-centric paradigm
inspired by the notion of consistent recovery from fault tolerance, so
that internal state includes both kernel and application state, and
external state comprises merely that which is visible to the
user. The user-centric paradigm enables a compromise between
previously-existing synchronous and asynchronous file systems that
achieves near-asynchronous performance while presenting a synchronous
facade to the user, simplifying application development.
Main Result(s): The authors' externally synchronous xsyncfs
provides durability while nearly matching the performance of
asynchronous ext3, the Linux FS, on several benchmarks.
Impact: If it has not yet been included in Linux, it likely will
be.
Evidence: The authors implemented a file system, xsyncfs, based
on their principle of external synchrony. They ran several
benchmarks (each of which primarily exercised a different resource) to
compare its performance with synchronous and asynchronous versions of
ext3, with and without write barriers. They also determined the
speedup from their output-triggered commit strategy, where commits are
batched until an output request.
Prior Work: The authors were influenced by Lamport's notion
presented in "Time, Clocks, and the Ordering of Events" of causal
ordering via the "happens before" relation. They also applied
ideas from their previous work, Speculator, in which they improved
performance by predicting the return value of RPCs to avoid
blocking. They also borrowed ideas from fault tolerance, most
notably one of the cornerstone ideas of their work, which is the
user-centric paradigm.
Competitive work: The main competition to providing durability
for asynchronous file systems involves special purpose hardware where
the volatile disk caches are backed by batteries. They assert
that there are no competing software solutions.
Reproducibility: Implementing a file system would take a lot of
work, but the ideas seem simple enough that I could get started.
I'm sure that they did a lot of engineering that they didn't include in
the paper to get such good performance.
Question: The speedup from output-triggered commits does not seem
remarkable, although the idea is quite intuitively appealing. How
could this be improved?
Criticism: The authors do not provide a good story for rolling
back in case of disk failures as they do in Speculator. They
assert that a checkpointing scheme like the one used in Speculator
would be overkill in this case and that the OS should really be
handling these failures rather than the application. I feel that
this was weak; they are definitely missing a good way to handle
failures. Being able to roll back when you have incorrectly taken
a speculative action is essential.
Ideas for further work: How would these ideas apply to a
distributed file system?
Novel Idea
This paper presents external synchrony, a filesystem model that
provides synchronous-I/O-style interfaces and guarantees with
performance more like that of asynchronous filesystems. Notably, this
is accomplished while preserving the existing APIs, eliminating the
issue of writing applications against this system.
Main Result(s):
The authors show that their system successfully runs and has high
performance. The performance of the system is on par with an
asynchronous filesystem while wildly faster than sychronous
filesystems, all while providing sychronous-style guarantees.
Impact:
This is an important advance in the field of transactional storage,
particularly as computing devices become more and more susceptible to
errors and disc I/O becomes the major bottleneck in production systems.
One would reasonably expect this style of design to pervade future
filesystems.
Evidence:
The authors ran a wide variety of empirical tests on their system,
demonstrating that it has excellent performance and works as
advertised. They show that high-I/O workloads do not significantly
hamper the effectiveness of their system and that even commodity
applications show major speedups.
Prior Work:
High performance hardware (such as Rio and Conquest) has been providing
high-performance sychronous I/O for years. All work on log-based files
sytems and write-ahead logging have been focused on the similar issues
of preserving data consistency.
Competitive work:
The authors primarily compare their work to ext3 in its synchronous and
asynchronous modes. By demonstrating that the performance of their
system is superior to these implementations, they show the viability of
their approach.
External synchrony is different from conventional views of data
consistency in that it focuses only on data that is visible by the
outside world, rather than on artificial measures of correctness.
Reproducibility
The ext3 portions of the evaluation would be fairly easy to reproduce.
Running the tests over the xsyncfs system would just require a copy of
the filesystem, though I could guess that it is non-trivial to run.
Question:
I have a feeling that this doesn't really work, for some reason. Are
people using this technology? Does it have a chance of working its way
into conventional distributions?
Criticism:
The issue of losing work due to power loss does not seem to be
adequately solved by this approach. If the authors did not see this as
a focus of their work (which it appears they did not), it was probably
unwise to belabor the issue in their introduction.
Ideas for further work:
It seems like this work has implications for developing regions, where
system stability is a greater issue that it is here.
Novel Idea:
By only flushing writes to disk when the effects of that write are
visible, applications can get the latency of asynchronous writes with
the semantics and durability of synchronous writes.
Main Result(s) & Evidence:
The authors modified the the Linux OS and ext3 filesystem to support
externally synchronous operation. They ran a number of benchmarks
including PostMark, Apache, MySQL, and SPECweb99 demonstrating high
performance. They argue for the durability guarantees provided by
their own file system, but don't perform any sort of fault injection
against it. They do demonstrate that ext3, even in supposedly
safe, synchronous modes doesn't truly offer durability. The
implicit argument is that conceptually simple models of safe writes are
simply too expensive to offer in real systems.
Criticism:
The paper cites the limitation of the external synchrony approach in
dealing with partial failures in the form of bad disk writes.
(These, in theory, could be dealt with by the application, though the
paper points out this capacity is not implemented in existing file
systems.) More generally, this approach seems weak in the
presence of partial failures. i.e. the durability guarantees are
only provided when the dependency graph is maintained. Any
failure in the full chain of components could result in arbitrary data
loss.
Impact:
I actually really like this paper. It's got a nice Schroedinger's
cat angle to it. And while it appears that the only effects of a
crash is the appearance that the crashed occurred earlier than it
actually did, I think the departure from intuitive semantics, the
weakness in the presence of certain types of failures, and the
availability of "transparent" hardware solutions is going to limit
adoption.
Ideas for further work:
Problems with partial failures aside, network output doesn't really
necessarily constitute "observed" output since the recipient is itself
a computer. If dependency information could be propagated across the
network, it seems like this could be extended.
Question:
If you write to a disk sector that's never read, can it be corrupted?
Prior & Competitive work:
The most direct competition to the paper are hardware solutions that
combine persistent storage media with some sort of "safe" fast-access
buffer (battery-backed ram, etc). These solutions can offer the
effects of asynchronous writes without os or file system modifications.
Reproducibility:
The idea is relatively simple and while there may be some tricky
implementation details in preserving dependencies and ordering, it
seems relatively easy to test the core ideas.
Novel Idea:
We can efficiently provide durability and ordering guarantees if we
focus on
ensuring the synchronization of user-perceived events. Current
approaches
assume that these guarantees are too costly because they attempt to
provide
them to applications.
Main Result(s):
The implementation of external synchronization achieves almost the same
performance as that provided by asynchronous implementations that
provide no
guarantees. Furthermore, the synchronous solution that does provide the
same
guarantees is much more costly.
Impact:
These results are significant because they fundamentally challenge
assumptions about how guarantees in file systems should be provided. As
such, new mechanisms can be developed that provide guarantees and
achieve
high levels of performance.
Evidence:
The authors compare their xsyncfs to asynchronous ext3, synchronous
ext3,
and synchronous ext3 with write barriers for a variety of workloads.
These
include the Postmark benchmark, an Apache build benchmark, a MySQL
benchmark, and a SPECweb99 benchmark. In all cases except SPECweb99,
xsyncfs
dramatically outperforms both non-synchronous ext3 implementations.
Prior Work:
This work builds upon previous advancements in file systems. For
example,
xsyncfs makes use of write ahead logging. This work also borrows many
ideas
from fault tolerance and specifically focuses on the difference between
traditional fault tolerance approaches and those currently used in file
systems.
Competitive work:
Some work such as the Conquest file system also provides
high-performance
synchronous I/O, but it requires specialized hardware mechanisms. Other
work, such as transactional file systems provides not only the
durability
and ordering guarantees provided by xsyncfs, but also atomicity.
However,
these file systems require applications to be modified.
Reproducibility:
It would be difficult to reproduce this work without the source code
used in
the experiments because of the modifications made to the kernel to track
dependencies. As such different implementations could lead to different
overheads.
Question:
Couldn’t delaying the output cause significant problems for certain
types of
applications? For example say an application expects that the user will
perceive a certain output within a certain period of time. If it takes a
significant amount of time before all of the necessary modifications are
committed, then couldn’t this cause problems for these types of
applications?
Criticism:
The MySQL benchmark assumes that many clients will be located on the
same
machine as the server. This assumption results in a significant
performance
improvement for xsyncfs. However, it seems as though this assumption
will
not be valid in most cases. Also, the discussion of speculative
execution
seemed to be unnecessary and distracted from the discussion of xsyncfs.
Ideas for future work:
I would like to perform analysis to see if the ideas here could be used
in
other types of file systems.
Novel Idea: External synchrony: synchronous updates only matter when
view by an external agent. Thus they build a system based on ext3
that
has similar performance benefits to async mode (only talking about
writes), but has the durability guarantees of sync'd updates.
They do
it by buffering the outputs of any process tainted by a pending commit,
until that transaction commits, similarly to their work with Speculator
in NFS. They are using the output to be a signal of when to
sacrifice
throughput for latency.
Main Result: They showed better performance than ext3 in sync mode
(with or without write barriers) and reasonably close performance to
async ext3.
Impact: Not sure. The main value for this may be the discussion
about
what is the appropriate point to sync things. Normally, I'm in
favor
of letting the process express its requirements, instead of the kernel
trying to guess when it wants the data committed. You might be
able to
get more group commits this way too. Also, returning from a fsync
while data isn't write barriered seems more like a bug than a matter of
this philosophy.
Evidence: A few benchmarks, Postmark, making a webserver (which I found
funny - everyone has to have a webserver benchmark! the standard
is
really a kernel build), then a real webserver benchmark.
Prior Work: Speculator, Lamport causal ordering, various systems with
battery backed RAM or NVRAM, ext3, and some fault tolerance work.
Competitive Work:
Reproducibility: Not easily. They used an ancient kernel and made
a
lot of local modifications, since this is the sort of thing that gets
its hands everywhere.
Question: How much easier or more difficult would this be to implement
in a microkernel? Tendrils everywhere!
Criticism:
- They should have used ext3 async with write barriers. The only
time
they did it was the hokey MySQL benchmark, and should have done it
throughout. Kinda lame to claim they have better durability than
async
ext3 when write barriers exist. Though to be honest, no one uses
them.
(see link below).
- Their MySQL benchmark shouldn't be taken very seriously. They
went
out of their way to set up a group commit scenario, by putting the
client on the server machine. They were able to use the kernel's
causal tainting of the client and waited til the client outputed, which
is cheating when we're talking about MySQL benchmarks. Yes, it's
a
neat way to show a benefit of group commit, but they can't claim it's a
real workload.
- They were rather repetitive, and sometimes hid the details - like how
they use write barriers in xsyncfs. It's there, but if you skim
to
fast over some repetitive parts, you can miss it.
- They take a while to say they flush commits every 5 seconds.
When
they talk about durability, there are really two aspects - the external
synchrony as well as actually saving the data quickly. They took
a
little too long (page 5) to really address the latter.
Ideas for further work:
- To what extent even write barriers are enough isn't clear. I've
heard of hard drives that lie about doing write barriers when under
heavy traffic, and also when power fails, the sector you are writing to
can be scribbled on with random data (as well as other places). I
haven't read about them in a while, but some of the
snapshotting/COW/checksummed filesystems supposedly can have consistent
views of the FS that you switch through atomically. You never
have the
writes in an invalid order (detected with a checksum), and if the main
commit does not happen, you still have the old consistent view. I
think.
- Here's an interesting article I read a while back (and again for
this): http://lwn.net/Articles/283161/
- Think a bit about what is the right interface for the process, and
true async I/O (all calls non-blocking). The coder *knows* that
performance is sacrificed for synchrony - let them choose. Though
that
still doesn't answer the "what is synchrony" question.
- Remember to use Postmark and RAMFS (among others) for filesystems.
- Think about this wrt to the microkernel question above.
Rethink the Sync
Summary: neither sync or async file systems really do what you want
them to since sync is slow and async breaks ordering. We can
essentially add ordering to an otherwise async system by tracking
dependencies between disk operations and lazily flushing updates to
disk to maintain a bound on stale state or reduce latency.
Strengths: speculator was an impressive piece of engineering, this has
the benefit of apparently being significantly simpler then that system
to the point where it's conceivable it may see some adoption.
xsyncfs
seems to perform comparably to asynchronous design modulo the overhead
of tracking causal dependencies. The discussion about working
around
NCQ with write barriers was illuminating.
Weaknesses: Reasoning about the new consistency model seems less
obvious then what they're replacing. Is their "external synchrony"
model stronger then an "ordered asynchronous" model where file system
ops doen't block but are committed in order? I think you need
their
system to give you a total ordering for that latter model to make
sense, although perhaps there are other total orders that would work?
Rethink the Sync:
Novel Idea:
I think the primary novel idea here is the recognition that durability
doesn't really *mean* anything to an application, but only to a user of
an application, so if we can make things look synchronous to a user
it's as good as making it look synchronous to the application.
This involves a re-definition of the word 'durable', but it's still a
novel and interesting idea. They call this external synchrony.
Main Result:
The authors leverage Speculator, a prior project of theirs, to
implement a system that provides external synchrony with a very low
overhead as compared to async ext3.
Impact:
External synchrony is an interesting idea, and it follows the general
pattern of Chen's work which seems to see what one can get away with if
you hide effects from the user but let the system keep running. I
could imagine this being applied in other areas like caches as well.
Evidence:
It's clear that allowing an application to continue processing rather
than blocking for synchronous i/o will allow it to run faster.
They also provide some pretty good benchmarks to show that they come
within 7% of the performance of async ext3. What I'm not so sure
about is how many real world apps would benefit from this as many to
most probably do something external almost immediately after writing
data.
Prior/Competitive Work:
Speculator is obviously an important prior project for this one.
Other work on non-volitile caches in HDDs should also be considered
competitive. I don't think there was other work on such a low
level, optimistic approach however.
Reproducibility:
Much like Speculator, reproducing this work would be quite
difficult. It's not quite as tough as Speculator since the
checkpointing code isn't needed, but it's still a huge modification to
the kernel to buffer all external i/o.
Question:
External i/o can be quite a lot of data quite quickly (writing to the
screen for instance). What impact does xsyncfs have on kernel
memory footprint?
Criticism:
Implementing xsyncfs is clearly a whole lot of work. In the real
world people don't seem to be too concerned with async
filesystems. Disks loose data for all sorts of reasons that no fs
can fix (forgetting bits, getting dropped, melting, etc) so fixing this
one little case with all this effort doesn't seem to be all that useful.
Ideas for further work:
Optimistic concurrency is pretty well understood in the db
community. It would be interesting to look at what work there
overlaps and where else in common slow routines we could assume things
and go faster and then roll-back if our assumptions where wrong.
Multi-processor cache coherence seems a logical place to look, although
maintaining all the state for rolling back could be quite difficult.
Novel Idea
<Describe in a sentence or two the new ideas presented
in the paper>
The key idea of this paper is to write data to disk only when a
subsequent observable event happens. They define this as "external
synchrony." This allows programmers to write code in a synchronous
model but achieve asynchronous performance while maintaining some
durability guarantees.
Main Result(s):
<Describe in a sentence or two the main results
obtained in the paper>
Using external synchrony allows significant performance improvements,
within 7% of an asynchronous FS, while maintaining arguably better
durability. Specifically they note that due to disk caches,
synchronously committed data is not really durable. Based on this they
argue that the model of having all observable data committed is
better.
Impact:
<What is the importance of these results. What
impact might they
have on theory or practice of Computer Systems>
Fundamentally we see a tradeoff between durability and performance.
Also the idea of delaying actions until they could be observed
externally seems interesting and more generally applicable.
Evidence:
<What reasoning, demonstration, analytical or empirical
analysis
did they use to establish their results>
They show timings of file system benchmarks running on their
"externally synchronous" FS compared to synchronous and asynchronous
ext3. In most cases they perform slightly worse than async ext3 but
better than synchronous ext3. The worst results are on SPECweb99 where
throughput is below synchronous ext3 due to sending many network
responses which externalize data.
Prior Work:
<What previously established results does it build upon
and how>
This work is based on the Speculator project. The Speculator code is
used to track dependencies within the kernel.
Competitive work:
<How to the compare their results to related prior or
contemporary work>
Systems using non-volitaile RAM can be seen as competitive work. These
systems allow the best of both worlds, true durability (not just
"external") and fast writes. The downside of course is the added cost.
Reproducibility
<Could you reproduce the findings? If so,
how? If not, why not?>
Unlikely. This work seems very technically complex and the necessary
OS instrumentation is not described in this paper.
Question:
<A question about the work to discuss in class>
Could a notification method be designed to tell the application what
data is really committed? It seems that each layer is hiding data from
the next: the disk buffers data but returns success on writes and
xsyncfs buffers data by returns blocking writes to the application.
Making the process more transparent should allow good performance
while letting the application choose the desired level of durability.
Criticism:
<A criticism of the work that merits discussion>
1) I think it was unfair of this paper to state that it provides write
durability while synchronous ext3 does not (e.g. Figure 3). This is
only true when using two different definitions of durability: complete
durability in the ext3 case vs external durability in the xsyncfs
case.
2) As mentioned in the paper application recovery would be very
complex. How does an application perform error recovery if it has no
way of knowing which writes actually committed? At least in the
synchronous ext3 model the application is guaranteed that, with
reasonable probability, every write call that has return would have
completed before the crash (or at worst the last few writes were
lost).
Durability vs performance concerns hav existed since FFS and that
original UNIX filesystem. Xsyncfs builds on Soft Updates to provide
consitent recovery (rather than journalling). Rio and Conquest use
batter-backed memory to make writes persistent. The output commit
problem is well studied in the fault tolerance community but less
commonly addressed in general purporse applications and I/O. Highly
reproducible.
Paper Title: Rethink the Sync
Author: Nightingale et al.
Date: 2006
Novel Idea
The authors introduces a new model of local file I/O (termed external
synchrony) that resolves the tension between the durability of
synchronous I/O and the performance of asynchronous I/O. This is
accomplished by defining durability in terms of observable actions, and
buffering all output that is causally dependent on uncommitted
modifications. They use output-triggered commits to balance between
throughput and latency.
Main Result(s):
Durability is maintained even in the face of power outages. Performance
is within 7% of asynchronous ext3 (and two orders of magnitude faster
than synchronous ext3), even for I/O-intensive benchmarks.
Impact:
A relatively recent paper, so this is harder to judge. Only four
citations on ACM portal, but it won best paper at OSDI. Roughly 100
Google hits for "external synchrony".
Evidence:
The authors compare the durability and performance of xsyncfs, and
examine how xsyncfs affects the performance of applications that
synchronize explicitly, and the affect of output-triggered commits on
xsyncfs performance. All evaluations were performed on identical
computers. Xsynfs data is durable both on writes and on fsyncs.
Performance was evaluated on a variety of benchmarks. Eager commit
favors latency too heavily over throughput, resulting in decreased
performance.
Prior Work:
Competitive work: eNVy is a filesystem that
stroes data on flash-based NVRAM that improves write performance, and
xsyncfs could serve a similar role. OS transactions and tansactional
filesystems provide similar durability and performance, but require
applications to be modified to specify transaction boundaries
explicitly.
Reproducibility
Question:
Does the choice between eager and output-driven commit change for
different applications?
Criticism:
No discussion of backing storage based on non-volatile flash-based
memory, and what effect its adoption would have on the design decisions
embodied in xsyncfs.
Ideas for further work: Address the above
criticism.
Novel Idea:
Hide the output of processes with pending disk operations from the user
until these operations are durable to provide consistent durability
without modifying or stalling the application.
Main Result(s):
The above scheme can make all application writes durable with
substantially less overhead than existing approaches.
Impact:
Shows that default-durable writes can be provided practically for
unmodified applications.
Evidence:
The authors demonstrated that their scheme provided durability by
powering down systems and comparing external UDP-based logs with the
contents of its disk; they ran several synthetic benchmarks to compare
its performance to ext3 with various levels of
synchronization/journaling options.
Prior Work:
Speculator (from which the hide-visible-output hackery comes);
journaling filesystems.
Competitive work:
Hardware solutions (battery-backed RAM; flash; etc.); databases; make it
easy to program with asynchronous I/O and wait for its completion
Reproducibility:
The durability experiment is not well-enough described to faithfully
reproduce (what was the workload? though they did only present
qualitative results); the others seem to be well enough described.
Question:
Should operating systems be exporting these default-durable interfaces
to applications (or, instead, should one require applications to be
explicitly about their durability concerns and try allow them to easily
express this)?
Criticism:
- Some quantative results (e.g how out of date was the data? how much
was written and how often was it corrupted?) from the durability
experiments would be nice.
- The authors do not measure how they effect the user-preceived latency
of transactions in the MySQL benchmark and of operations in the PostMark
benchmark.
Ideas for further work:
- Doing this at the language or library level on top of unmodified OSes.
- Providing applications finer-grained control the in presence of a
mechanism like this (and a default-durable policy).
- Doing this for a networked file system.