CS262B Advanced Topics in Computer Systems
Spring 2009

David E. Culler

Paper Title:  Rethink the Sync
Author:  Edmund Nightingale, Kaushik Veeraraghavan, Peter Chen, and Jason Flinn

Novel Idea (Describe in a sentence or two the new ideas presented in the paper):
-A synchronous IO interface with performance similar to that of asynchronous IO achieved by lying to the app and telling it the IO was durably committed but then putting the application into a speculative state in which all externalization (anything visible to user) that depends on the (invisibly asynchronous IO) is buffered until it is *actually* durable.

Impact (What is the importance of these results.  What impact might they have on theory or practice of Computer Systems):
-So far as I can tell, nobody uses this, though the speculator work is intriguing, with its roots obviously in architecture (branch prediction). We cited the speculator work in the Hadoop paper last year.

Evidence (What reasoning, demonstration, analytical or empiricial analysis did they use to establish their results):
-They run apache build as well as two well known benchmarks:  TPC-C (on mysql) and SPECweb99

Prior Work (What previously established results does it build upon and how):
-Built on top of speculator project.

Competitive work (How to they compare their results to related prior or contemporary work):
-This work was highly novel and intriguing with impressive measured results. Similar work includes heirarchical storage systems that use NV/battery backed RAM for quick returning IO calls.

Question (A question about the work to discuss in class):
-Would you be willing to program against this?

Criticism (A criticism of the work that merits discussion):
-Futzing with extenalized IO is not very intuitive to think about, so I would argue it to a large degree weighs against their claim to simplify how we think about IO (by making everything feel like synchronous IO), especially since this is touted as a filesystem, which we think of living strictly underneath our applications in the architecture, not encapulating them

Ideas for further work (Did this paper give you ideas for future work, projects, or connections to other work?):
-Speculator + <insert arbitrary other systems topic here... at the risk of being predictable... "clouds">


Novel Idea
Treat the user (output) as the entity you are providing guarantees to, provide a synchronous-looking interface to that user, put run asynchronously under the covers.
Main Result(s):
This system achieves nearly the same performance as asynchronous file systems (within 7%) while providing a syncrhonous output interface to the user.  This design allows the user to know what to expect about the durability provided by the fs after a crash as whatever output that's observed (received through network, etc) is for a "commit" of data to disk.
Impact:
This is important in reconciling the guarantees offered by systems that trade off performance and durability.  You can make the system perform well by allowing and fundamentally running an ayncrhonous system under the covers, but only report that's really durable.
Evidence:
They ran xsyncfs on several benchmarks (PostMark, Apache build, MySql, SPECweb99, etc) and compared with ext3 in syncrhonous and anychronous modes.
Prior Work:
This essentially builds on the work done for asyncrhonous file systems.
Competitive work:    No directly competing work.  Only current implementation of syncrhonous and asychronous file systems.
Reproducibility
The actual implementation of the system is part of a larger implementation of an operating system.  I don't think the results would be reproducable without considerable effort.
Question:
Given the various tracing mechanisms that had to be put in place to get xsyncfs to properly buffer output to the user, how effective would this method be in other systems?
Criticism:
How well does this adapt to system changes?  It seems there's a close dependence on the tracing aspects to determine when output should be buffered.  If something in the system changes, it seems like this might be difficult to include in the file system code after the fact.
Ideas for further work:     <Did this paper give you ideas for future work, projects, or connections to other work?>



Novel Idea:  Traditionally, operating systems have considered only the kernel state to be internal state, while application state has been deemed external; this is called an application-centric paradigm.  The authors propose applying to file systems a user-centric paradigm inspired by the notion of consistent recovery from fault tolerance, so that internal state includes both kernel and application state, and external state comprises merely that which is visible to the user.  The user-centric paradigm enables a compromise between previously-existing synchronous and asynchronous file systems that achieves near-asynchronous performance while presenting a synchronous facade to the user, simplifying application development.
Main Result(s):  The authors' externally synchronous xsyncfs provides durability while nearly matching the performance of asynchronous ext3, the Linux FS, on several benchmarks.
Impact:  If it has not yet been included in Linux, it likely will be.
Evidence:  The authors implemented a file system, xsyncfs, based on their principle of external synchrony.  They ran several benchmarks (each of which primarily exercised a different resource) to compare its performance with synchronous and asynchronous versions of ext3, with and without write barriers.  They also determined the speedup from their output-triggered commit strategy, where commits are batched until an output request.
Prior Work:  The authors were influenced by Lamport's notion presented in "Time, Clocks, and the Ordering of Events" of causal ordering via the "happens before" relation.  They also applied ideas from their previous work, Speculator, in which they improved performance by predicting the return value of RPCs to avoid blocking.  They also borrowed ideas from fault tolerance, most notably one of the cornerstone ideas of their work, which is the user-centric paradigm.
Competitive work:  The main competition to providing durability for asynchronous file systems involves special purpose hardware where the volatile disk caches are backed by batteries.  They assert that there are no competing software solutions.
Reproducibility:  Implementing a file system would take a lot of work, but the ideas seem simple enough that I could get started.  I'm sure that they did a lot of engineering that they didn't include in the paper to get such good performance.
Question:  The speedup from output-triggered commits does not seem remarkable, although the idea is quite intuitively appealing.  How could this be improved?
Criticism:  The authors do not provide a good story for rolling back in case of disk failures as they do in Speculator.  They assert that a checkpointing scheme like the one used in Speculator would be overkill in this case and that the OS should really be handling these failures rather than the application.  I feel that this was weak; they are definitely missing a good way to handle failures.  Being able to roll back when you have incorrectly taken a speculative action is essential.
Ideas for further work:  How would these ideas apply to a distributed file system?

Novel Idea

This paper presents external synchrony, a filesystem model that provides synchronous-I/O-style interfaces and guarantees with performance more like that of asynchronous filesystems. Notably, this is accomplished while preserving the existing APIs, eliminating the issue of writing applications against this system.

Main Result(s):

The authors show that their system successfully runs and has high performance. The performance of the system is on par with an asynchronous filesystem while wildly faster than sychronous filesystems, all while providing sychronous-style guarantees.

Impact:

This is an important advance in the field of transactional storage, particularly as computing devices become more and more susceptible to errors and disc I/O becomes the major bottleneck in production systems. One would reasonably expect this style of design to pervade future filesystems.

Evidence:

The authors ran a wide variety of empirical tests on their system, demonstrating that it has excellent performance and works as advertised. They show that high-I/O workloads do not significantly hamper the effectiveness of their system and that even commodity applications show major speedups.

Prior Work:

High performance hardware (such as Rio and Conquest) has been providing high-performance sychronous I/O for years. All work on log-based files sytems and write-ahead logging have been focused on the similar issues of preserving data consistency.

Competitive work:

The authors primarily compare their work to ext3 in its synchronous and asynchronous modes. By demonstrating that the performance of their system is superior to these implementations, they show the viability of their approach.


External synchrony is different from conventional views of data consistency in that it focuses only on data that is visible by the outside world, rather than on artificial measures of correctness.

Reproducibility

The ext3 portions of the evaluation would be fairly easy to reproduce. Running the tests over the xsyncfs system would just require a copy of the filesystem, though I could guess that it is non-trivial to run.

Question:

I have a feeling that this doesn't really work, for some reason. Are people using this technology? Does it have a chance of working its way into conventional distributions?

Criticism:

The issue of losing work due to power loss does not seem to be adequately solved by this approach. If the authors did not see this as a focus of their work (which it appears they did not), it was probably unwise to belabor the issue in their introduction.

Ideas for further work:

It seems like this work has implications for developing regions, where system stability is a greater issue that it is here.

Novel Idea:

By only flushing writes to disk when the effects of that write are visible, applications can get the latency of asynchronous writes with the semantics and durability of synchronous writes.

Main Result(s) & Evidence:

The authors modified the the Linux OS and ext3 filesystem to support externally synchronous operation.  They ran a number of benchmarks including PostMark, Apache, MySQL, and SPECweb99 demonstrating high performance.  They argue for the durability guarantees provided by their own file system, but don't perform any sort of fault injection against it.  They do demonstrate that ext3, even in supposedly safe, synchronous modes doesn't truly offer durability.  The implicit argument is that conceptually simple models of safe writes are simply too expensive to offer in real systems.

Criticism:

The paper cites the limitation of the external synchrony approach in dealing with partial failures in the form of bad disk writes.  (These, in theory, could be dealt with by the application, though the paper points out this capacity is not implemented in existing file systems.)  More generally, this approach seems weak in the presence of partial failures.  i.e. the durability guarantees are only provided when the dependency graph is maintained.  Any failure in the full chain of components could result in arbitrary data loss.

Impact:

I actually really like this paper.  It's got a nice Schroedinger's cat angle to it.  And while it appears that the only effects of a crash is the appearance that the crashed occurred earlier than it actually did, I think the departure from intuitive semantics, the weakness in the presence of certain types of failures, and the availability of "transparent" hardware solutions is going to limit adoption.

Ideas for further work:

Problems with partial failures aside, network output doesn't really necessarily constitute "observed" output since the recipient is itself a computer. If dependency information could be propagated across the network, it seems like this could be extended.

Question:

If you write to a disk sector that's never read, can it be corrupted?

Prior & Competitive work:

The most direct competition to the paper are hardware solutions that combine persistent storage media with some sort of "safe" fast-access buffer (battery-backed ram, etc).  These solutions can offer the effects of asynchronous writes without os or file system modifications.

Reproducibility:

The idea is relatively simple and while there may be some tricky implementation details in preserving dependencies and ordering, it seems relatively easy to test the core ideas.



Novel Idea:
We can efficiently provide durability and ordering guarantees if we focus on
ensuring the synchronization of user-perceived events.  Current approaches
assume that these guarantees are too costly because they attempt to provide
them to applications.
Main Result(s):   
The implementation of external synchronization achieves almost the same
performance as that provided by asynchronous implementations that provide no
guarantees. Furthermore, the synchronous solution that does provide the same
guarantees is much more costly.
Impact:           
These results are significant because they fundamentally challenge
assumptions about how guarantees in file systems should be provided. As
such, new mechanisms can be developed that provide guarantees and achieve
high levels of performance.
Evidence:
The authors compare their xsyncfs to asynchronous ext3, synchronous ext3,
and synchronous ext3 with write barriers for a variety of workloads. These
include the Postmark benchmark, an Apache build benchmark, a MySQL
benchmark, and a SPECweb99 benchmark. In all cases except SPECweb99, xsyncfs
dramatically outperforms both non-synchronous ext3 implementations.
Prior Work:       
This work builds upon previous advancements in file systems. For example,
xsyncfs makes use of write ahead logging. This work also borrows many ideas
from fault tolerance and specifically focuses on the difference between
traditional fault tolerance approaches and those currently used in file
systems. 
Competitive work:
Some work such as the Conquest file system also provides high-performance
synchronous I/O, but it requires specialized hardware mechanisms. Other
work, such as transactional file systems provides not only the durability
and ordering guarantees provided by xsyncfs, but also atomicity. However,
these file systems require applications to be modified.
Reproducibility:
It would be difficult to reproduce this work without the source code used in
the experiments because of the modifications made to the kernel to track
dependencies. As such different implementations could lead to different
overheads.
Question:
Couldn’t delaying the output cause significant problems for certain types of
applications? For example say an application expects that the user will
perceive a certain output within a certain period of time. If it takes a
significant amount of time before all of the necessary modifications are
committed, then couldn’t this cause problems for these types of
applications?
Criticism:
The MySQL benchmark assumes that many clients will be located on the same
machine as the server. This assumption results in a significant performance
improvement for xsyncfs. However, it seems as though this assumption will
not be valid in most cases. Also, the discussion of speculative execution
seemed to be unnecessary and distracted from the discussion of xsyncfs.
Ideas for future work:
I would like to perform analysis to see if the ideas here could be used in
other types of file systems.


Novel Idea: External synchrony: synchronous updates only matter when
view by an external agent.  Thus they build a system based on ext3 that
has similar performance benefits to async mode (only talking about
writes), but has the durability guarantees of sync'd updates.  They do
it by buffering the outputs of any process tainted by a pending commit,
until that transaction commits, similarly to their work with Speculator
in NFS.  They are using the output to be a signal of when to sacrifice
throughput for latency.

Main Result: They showed better performance than ext3 in sync mode
(with or without write barriers) and reasonably close performance to
async ext3.

Impact: Not sure.  The main value for this may be the discussion about
what is the appropriate point to sync things.  Normally, I'm in favor
of letting the process express its requirements, instead of the kernel
trying to guess when it wants the data committed.  You might be able to
get more group commits this way too.  Also, returning from a fsync
while data isn't write barriered seems more like a bug than a matter of
this philosophy.

Evidence: A few benchmarks, Postmark, making a webserver (which I found
funny - everyone has to have a webserver benchmark!  the standard is
really a kernel build), then a real webserver benchmark. 

Prior Work: Speculator, Lamport causal ordering, various systems with
battery backed RAM or NVRAM, ext3, and some fault tolerance work.

Competitive Work:

Reproducibility: Not easily.  They used an ancient kernel and made a
lot of local modifications, since this is the sort of thing that gets
its hands everywhere.

Question: How much easier or more difficult would this be to implement
in a microkernel?  Tendrils everywhere!

Criticism:
- They should have used ext3 async with write barriers.  The only time
they did it was the hokey MySQL benchmark, and should have done it
throughout.  Kinda lame to claim they have better durability than async
ext3 when write barriers exist.  Though to be honest, no one uses them.
(see link below).
- Their MySQL benchmark shouldn't be taken very seriously.  They went
out of their way to set up a group commit scenario, by putting the
client on the server machine.  They were able to use the kernel's
causal tainting of the client and waited til the client outputed, which
is cheating when we're talking about MySQL benchmarks.  Yes, it's a
neat way to show a benefit of group commit, but they can't claim it's a
real workload.
- They were rather repetitive, and sometimes hid the details - like how
they use write barriers in xsyncfs.  It's there, but if you skim to
fast over some repetitive parts, you can miss it.
- They take a while to say they flush commits every 5 seconds.  When
they talk about durability, there are really two aspects - the external
synchrony as well as actually saving the data quickly.  They took a
little too long (page 5) to really address the latter.

Ideas for further work:
- To what extent even write barriers are enough isn't clear.  I've
heard of hard drives that lie about doing write barriers when under
heavy traffic, and also when power fails, the sector you are writing to
can be scribbled on with random data (as well as other places).  I
haven't read about them in a while, but some of the
snapshotting/COW/checksummed filesystems supposedly can have consistent
views of the FS that you switch through atomically.  You never have the
writes in an invalid order (detected with a checksum), and if the main
commit does not happen, you still have the old consistent view.  I
think. 
- Here's an interesting article I read a while back (and again for
this): http://lwn.net/Articles/283161/
- Think a bit about what is the right interface for the process, and
true async I/O (all calls non-blocking).  The coder *knows* that
performance is sacrificed for synchrony - let them choose.  Though that
still doesn't answer the "what is synchrony" question.
- Remember to use Postmark and RAMFS (among others) for filesystems.
- Think about this wrt to the microkernel question above.


Rethink the Sync

Summary: neither sync or async file systems really do what you want
them to since sync is slow and async breaks ordering.  We can
essentially add ordering to an otherwise async system by tracking
dependencies between disk operations and lazily flushing updates to
disk to maintain a bound on stale state or reduce latency.

Strengths: speculator was an impressive piece of engineering, this has
the benefit of apparently being significantly simpler then that system
to the point where it's conceivable it may see some adoption.  xsyncfs
seems to perform comparably to asynchronous design modulo the overhead
of tracking causal dependencies.  The discussion about working around
NCQ with write barriers was illuminating.

Weaknesses: Reasoning about the new consistency model seems less
obvious then what they're replacing.  Is their "external synchrony"
model stronger then an "ordered asynchronous" model where file system
ops doen't block but are committed in order?  I think you need their
system to give you a total ordering for that latter model to make
sense, although perhaps there are other total orders that would work?

Rethink the Sync:
Novel Idea:
I think the primary novel idea here is the recognition that durability doesn't really *mean* anything to an application, but only to a user of an application, so if we can make things look synchronous to a user it's as good as making it look synchronous to the application.  This involves a re-definition of the word 'durable', but it's still a novel and interesting idea.  They call this external synchrony.

Main Result:
The authors leverage Speculator, a prior project of theirs, to implement a system that provides external synchrony with a very low overhead as compared to async ext3.

Impact:
External synchrony is an interesting idea, and it follows the general pattern of Chen's work which seems to see what one can get away with if you hide effects from the user but let the system keep running.  I could imagine this being applied in other areas like caches as well.

Evidence:
It's clear that allowing an application to continue processing rather than blocking for synchronous i/o will allow it to run faster.  They also provide some pretty good benchmarks to show that they come within 7% of the performance of async ext3.  What I'm not so sure about is how many real world apps would benefit from this as many to most probably do something external almost immediately after writing data.

Prior/Competitive Work:
Speculator is obviously an important prior project for this one.  Other work on non-volitile caches in HDDs should also be considered competitive.  I don't think there was other work on such a low level, optimistic approach however.

Reproducibility:
Much like Speculator, reproducing this work would be quite difficult.  It's not quite as tough as Speculator since the checkpointing code isn't needed, but it's still a huge modification to the kernel to buffer all external i/o.

Question:
External i/o can be quite a lot of data quite quickly (writing to the screen for instance).  What impact does xsyncfs have on kernel memory footprint?

Criticism:
Implementing xsyncfs is clearly a whole lot of work.  In the real world people don't seem to be too concerned with async filesystems.  Disks loose data for all sorts of reasons that no fs can fix (forgetting bits, getting dropped, melting, etc) so fixing this one little case with all this effort doesn't seem to be all that useful.

Ideas for further work:
Optimistic concurrency is pretty well understood in the db community.  It would be interesting to look at what work there overlaps and where else in common slow routines we could assume things and go faster and then roll-back if our assumptions where wrong.  Multi-processor cache coherence seems a logical place to look, although maintaining all the state for rolling back could be quite difficult.



Novel Idea
   <Describe in a sentence or two the new ideas presented in the paper>
The key idea of this paper is to write data to disk only when a
subsequent observable event happens. They define this as "external
synchrony." This allows programmers to write code in a synchronous
model but achieve asynchronous performance while maintaining some
durability guarantees.

Main Result(s):
   <Describe in a sentence or two the main results obtained in the paper>
Using external synchrony allows significant performance improvements,
within 7% of an asynchronous FS, while maintaining arguably better
durability. Specifically they note that due to disk caches,
synchronously committed data is not really durable. Based on this they
argue that the model of having all observable data committed is
better.

Impact:
   <What is the importance of these results.  What impact might they
have on theory or practice of Computer Systems>
Fundamentally we see a tradeoff between durability and performance.
Also the idea of delaying actions until they could be observed
externally seems interesting and more generally applicable.

Evidence:
   <What reasoning, demonstration, analytical or empirical analysis
did they use to establish their results>
They show timings of file system benchmarks running on their
"externally synchronous" FS compared to synchronous and asynchronous
ext3. In most cases they perform slightly worse than async ext3 but
better than synchronous ext3. The worst results are on SPECweb99 where
throughput is below synchronous ext3 due to sending many network
responses which externalize data.

Prior Work:
   <What previously established results does it build upon and how>
This work is based on the Speculator project. The Speculator code is
used to track dependencies within the kernel.

Competitive work:
   <How to the compare their results to related prior or contemporary work>
Systems using non-volitaile RAM can be seen as competitive work. These
systems allow the best of both worlds, true durability (not just
"external") and fast writes. The downside of course is the added cost.

Reproducibility
   <Could you reproduce the findings?  If so, how?  If not, why not?>
Unlikely. This work seems very technically complex and the necessary
OS instrumentation is not described in this paper.

Question:
   <A question about the work to discuss in class>
Could a notification method be designed to tell the application what
data is really committed? It seems that each layer is hiding data from
the next: the disk buffers data but returns success on writes and
xsyncfs buffers data by returns blocking writes to the application.
Making the process more transparent should allow good performance
while letting the application choose the desired level of durability.

Criticism:
   <A criticism of the work that merits discussion>
1) I think it was unfair of this paper to state that it provides write
durability while synchronous ext3 does not (e.g. Figure 3). This is
only true when using two different definitions of durability: complete
durability in the ext3 case vs external durability in the xsyncfs
case.

2) As mentioned in the paper application recovery would be very
complex. How does an application perform error recovery if it has no
way of knowing which writes actually committed? At least in the
synchronous ext3 model the application is guaranteed that, with
reasonable probability, every write call that has return would have
completed before the crash (or at worst the last few writes were
lost).



Durability vs performance concerns hav existed since FFS and that original UNIX filesystem. Xsyncfs builds on Soft Updates to provide consitent recovery (rather than journalling). Rio and Conquest use batter-backed memory to make writes persistent. The output commit problem is well studied in the fault tolerance community but less commonly addressed in general purporse applications and I/O. Highly reproducible.

Paper Title:    Rethink the Sync
Author:     Nightingale et al.
Date:     2006

Novel Idea
The authors introduces a new model of local file I/O (termed external synchrony) that resolves the tension between the durability of synchronous I/O and the performance of asynchronous I/O. This is accomplished by defining durability in terms of observable actions, and buffering all output that is causally dependent on uncommitted modifications. They use output-triggered commits to balance between throughput and latency.


Main Result(s):
Durability is maintained even in the face of power outages. Performance is within 7% of asynchronous ext3 (and two orders of magnitude faster than synchronous ext3), even for I/O-intensive benchmarks.
Impact:
A relatively recent paper, so this is harder to judge. Only four citations on ACM portal, but it won best paper at OSDI. Roughly 100 Google hits for "external synchrony".
Evidence:
The authors compare the durability and performance of xsyncfs, and examine how xsyncfs affects the performance of applications that synchronize explicitly, and the affect of output-triggered commits on xsyncfs performance. All evaluations were performed on identical computers. Xsynfs data is durable both on writes and on fsyncs. Performance was evaluated on a variety of benchmarks. Eager commit favors latency too heavily over throughput, resulting in decreased performance.
Prior Work:
Competitive work:     eNVy is a filesystem that stroes data on flash-based NVRAM that improves write performance, and xsyncfs could serve a similar role. OS transactions and tansactional filesystems provide similar durability and performance, but require applications to be modified to specify transaction boundaries explicitly.
Reproducibility
Question:
Does the choice between eager and output-driven commit change for different applications?
Criticism:
No discussion of backing storage based on non-volatile flash-based memory, and what effect its adoption would have on the design decisions embodied in xsyncfs.
Ideas for further work:     Address the above criticism.



Novel Idea:
Hide the output of processes with pending disk operations from the user
until these operations are durable to provide consistent durability
without modifying or stalling the application.
Main Result(s):
The above scheme can make all application writes durable with
substantially less overhead than existing approaches.
Impact:
Shows that default-durable writes can be provided practically for
unmodified applications.
Evidence:
The authors demonstrated that their scheme provided durability by
powering down systems and comparing external UDP-based logs with the
contents of its disk; they ran several synthetic benchmarks to compare
its performance to ext3 with various levels of
synchronization/journaling options.
Prior Work:
Speculator (from which the hide-visible-output hackery comes);
journaling filesystems.
Competitive work:
Hardware solutions (battery-backed RAM; flash; etc.); databases; make it
easy to program with asynchronous I/O and wait for its completion
Reproducibility:
The durability experiment is not well-enough described to faithfully
reproduce (what was the workload? though they did only present
qualitative results); the others seem to be well enough described.
Question:
Should operating systems be exporting these default-durable interfaces
to applications (or, instead, should one require applications to be
explicitly about their durability concerns and try allow them to easily
express this)?
Criticism:
- Some quantative results (e.g how out of date was the data? how much
was written and how often was it corrupted?) from the durability
experiments would be nice.
- The authors do not measure how they effect the user-preceived latency
of transactions in the MySQL benchmark and of operations in the PostMark
benchmark.
Ideas for further work:
- Doing this at the language or library level on top of unmodified OSes.
- Providing applications finer-grained control the in presence of a
mechanism like this (and a default-durable policy).
- Doing this for a networked file system.