Tornado: Maximizing Locality and Concurency in a Shared Memory 
Multiprocessor OS
Gamsa et al.


Novel Idea: SMMPs need to focus more on locality and sharing.  Their OS 
has three mechanisms to address locality, most notably the Clustered
Object. In a clustered object, methods on an object get directed to
local representatives (per core, per cluster of cores, or even one for
the whole system). This allows quick accesses without contention,
paying the costs for when combining the results from the
representatives.

Main Result: Their system showed good scalability for their tests,
which they implied was due to their locality and locking mechanisms.
Would have been nice to know, using tracing/profiling/counters.

Impact: The clustered object was later used in K42. Also somewhat
similar to RCU. RCU (read-copy-update) is a synch mechanism that
allows very fast read-side accesses (pratically no slowdown) as well as
concurrent updates. There are a lot of similarities between RCU and
Clustered objects and Tornado's garbage collection. Not sure if RCU
was influenced by Tornado.

Evidence: Nano and microbenchmarks on a homemade 16 core NUMA machine
and on the SimOS simulator

Prior Work: Hurricane (predecessor system), dynamic memory allocator
from McKenney and Slingwine

Competitive Work: Hive, Disco

Reproducibility: Could ask them for the source and run it in SimOS. Or
take the ideas and rework part of an OS of your choosing.

Question: How common are in-core page faults? Both Tornado and Corey
use them as rationale for design choices.

Criticism: Their locking inside the object is nice, but doesn't seem
like a new idea. Also, they say that traditional ways to avoid
deletion races is to protect each reference with it's own lock. It
sounds like they are saying you have to hold lock / have exclusive
access while using something. But a simple reference counting scheme,
like used in reader/writer locks would work fine (and are used
heavily). You could even use an atomic instruction to inc and dec a
reference count, which requires no locks.

Ideas for further work: One little comment of theirs is similar to some
of our ideas in the parlab. Specifically, they talk about how in some
cases it may be more efficient to ship work remotely and perform the
operation local to the data to avoid costs of contention, instead of
working via shared memory. Depends on what the cases are and how that
tradeoff changes as we gain many cores.

Novel Idea
The paper presented many ideas. First, it recognized how hardware trends will affect operating systems: improper sharing of data between processors on multiprocessor machines will have drastic performance consequences. So, they posit an object oriented approach to writing operating systems where they suggest the object orientedness implies:
  • implementations can be swapped in and out as an application sees fit
  • an object called a "clustered object" is used to interact with services and data
  • clustered objects should be replicated so independent tasks don't require sharing a clustered object between different cores
  • accessing clustered objects can be done lazily and through a level of indirection
They also discuss how memory allocation needs to be (re)considered in a parallel operating system; how synchronization is more than just locking, it also deals with existance guarantees (a tough problem without garbage collection, solved by using reference counting); and how interprocess communication, the primitive used in this micro-kernel like operating system for accessing system services, needs special support.
Main Result(s):
We can scale operating systems to multiprocessors by intelligently replicating kernel data structures while still keeping operating system primitives relatively cheap.
Impact:
For one, this paper helps recognize a significant issue in scaling operating systems to multiprocessors (i.e. data sharing/contention). In addition, it shows how intelligent replication can be used handle the data sharing/contention problem.
Evidence:
They provided microbenchmarks which evaluated thread creation, page faults, and file status. They showed how these microbenchmarks performed in existing operating systems designed for multiprocessors.
Prior Work:
This work built off of their previous operating system, Hurricane, although without reading that paper it is unclear exactly how Hurricane differs.
Competitive work: Eh, I'm not really convinced that they compare themselves with related work ... rather than just stating the related work. The do mention similar work that isn't exactly comparable because of differing environments (e.g. the idea of clustered objects in a distributed system with ethernet-like latencies versus bus-like latencies.)
Reproducibility
Other than getting their resources (code and hardware), their microbenchmarks seemed very straightforward.
Question:
In contemporary operating systems and hardware today, we get implicit data replication thanks to cache coherency (data sits in multiple caches in the read state). Do we need a special operating system primitive to provide replication or can we get away with just writing better software that takes into account what data structures need to get shared and providing either finer-grained locking or lock-free variants of the current algorithms?
Criticism:
It is difficult to imagine that solving the kernel "data-sharing" problem is sufficient, but that we don't also need to provide mechanisms for applications ... of which, I didn't see any discussed.
Ideas for further work: In a current project I suffered from what these authors call the "existence guarantee" and provided a similiar solution (one based on reference counting). It seems apparent that a language used to write a parallel operating system should have some support for doing this sort of thing ... perhaps another extension to the Ivy project?


Novel Idea:

To scale to multiple processors, an OS keep independent resources
independent of each other and of processors by which they are not being
used and make explicit the choices that must be made when a resource
must be shared between multiple processors

Main Result(s):

By avoiding unnecessary contention with the clustered object
abstractions, the Tornodo OS appears to scale well to many procesors
where existing commercial OSs' performance declines substantially.

Impact:

It tells future operating system designers that they must think
about avoiding interprocessor contention to scale well to multiple
machines rather than relying on the traditional approach of a single,
shared global data structure which is locked on access.

Evidence:

The authors implemented an OS which usually kept data and flow
control local to each processor, with synchronization limited to the
level of a logical object (handling specially the case of object
deallocation). They measured the performance of this OS using
microbenchmarks on 16-processor (sometimes simulated) machines in
comparision to existing OSs.

Prior Work:

The authors build upon a prior system of theirs, Hurricane, that
gained scalability by dividing a highly multiprocessor into fixed-sized
clusters of processors, but they learn to provide more fleixiblity to
application than a fixed division of the hardware. Their 'clustered
objects' are derived from similar work to share data across distributed
systems, though they do not need (they think) to handle faults and
communication overhead as much.

Competitive work:

The other paper: focuses more on not sharing rather than sharing
intelligently, and gets similar scalability gains over existing OSs.

Work analyzing existing OS's performance on large SMPs and
proposing changes: certainly, such work is easier to deploy and may
avoid the mistake of spending lots of time scaling something whose
performance doesn't matter.

Hive: another from-scratch SMP OS, targets reliability and scalability instead of performance and scalability.



Reproducibility:

Not quickly; but it probably wouldn't be that hard to replace the
code managing some small set of resources in an OS with an
implementation using clustered objects and test if that performs as
expected.



Question:

How much would (or would?) the feature of multiple implementations of an object be used in practice?

How would one measure the overhead of the paper's garbage collection scheme versus the 'traditional' approach?



Criticism:

If their OS "supports most of the facilities and services", why no non-micro benchmark?

Instruction count (used several times by authors to brag about how
good their implementations is) isn't a great measure of the cost of an
operation, especially when dealing with problems of contention and
sharing.



Ideas for further work:

- Networking Stacks in the Tornado Model?

- Choosing automatically between different implementations of an object? (1 rep/multireps)

- How much locking overhead in current OSs is for GC?