Tornado: Maximizing Locality and Concurency in a Shared Memory
Gamsa et al.
Novel Idea: SMMPs need to focus more on locality and sharing. Their OS
has three mechanisms to address locality, most notably the Clustered
Object. In a clustered object, methods on an object get directed to
local representatives (per core, per cluster of cores, or even one for
the whole system). This allows quick accesses without contention,
paying the costs for when combining the results from the
Main Result: Their system showed good scalability for their tests,
which they implied was due to their locality and locking mechanisms.
Would have been nice to know, using tracing/profiling/counters.
Impact: The clustered object was later used in K42. Also somewhat
similar to RCU. RCU (read-copy-update) is a synch mechanism that
allows very fast read-side accesses (pratically no slowdown) as well as
concurrent updates. There are a lot of similarities between RCU and
Clustered objects and Tornado's garbage collection. Not sure if RCU
was influenced by Tornado.
Evidence: Nano and microbenchmarks on a homemade 16 core NUMA machine
and on the SimOS simulator
Prior Work: Hurricane (predecessor system), dynamic memory allocator
from McKenney and Slingwine
Competitive Work: Hive, Disco
Reproducibility: Could ask them for the source and run it in SimOS. Or
take the ideas and rework part of an OS of your choosing.
Question: How common are in-core page faults? Both Tornado and Corey
use them as rationale for design choices.
Criticism: Their locking inside the object is nice, but doesn't seem
like a new idea. Also, they say that traditional ways to avoid
deletion races is to protect each reference with it's own lock. It
sounds like they are saying you have to hold lock / have exclusive
access while using something. But a simple reference counting scheme,
like used in reader/writer locks would work fine (and are used
heavily). You could even use an atomic instruction to inc and dec a
reference count, which requires no locks.
Ideas for further work: One little comment of theirs is similar to some
of our ideas in the parlab. Specifically, they talk about how in some
cases it may be more efficient to ship work remotely and perform the
operation local to the data to avoid costs of contention, instead of
working via shared memory. Depends on what the cases are and how that
tradeoff changes as we gain many cores.
|The paper presented many ideas.
First, it recognized how
hardware trends will affect operating systems: improper sharing of
data between processors on multiprocessor machines will have drastic
So, they posit an object oriented approach to writing operating
systems where they suggest the object orientedness implies:
They also discuss how memory allocation needs to be (re)considered in
a parallel operating system; how synchronization is more than just
locking, it also deals with existance guarantees (a tough problem
without garbage collection, solved by using reference counting); and
how interprocess communication, the primitive used in this
micro-kernel like operating system for accessing system services,
needs special support.
- implementations can be swapped in and out as an application
- an object called a "clustered object" is used to interact
with services and data
- clustered objects should be replicated so independent tasks
don't require sharing a clustered object between different cores
- accessing clustered objects can be done lazily and through
a level of indirection
|We can scale operating systems to multiprocessors by
replicating kernel data structures while still keeping operating
system primitives relatively cheap.
|For one, this paper helps recognize a significant issue in
operating systems to multiprocessors (i.e. data
sharing/contention). In addition, it shows how intelligent replication
can be used handle the data sharing/contention problem.
|They provided microbenchmarks which evaluated thread
faults, and file status. They showed how these microbenchmarks
performed in existing operating systems designed for multiprocessors.
|This work built off of their previous operating system,
although without reading that paper it is unclear exactly how
||Eh, I'm not really convinced that they compare themselves
work ... rather than just stating the related work. The do mention
similar work that isn't exactly comparable because of differing
environments (e.g. the idea of clustered objects in a distributed
system with ethernet-like latencies versus bus-like latencies.)
|Other than getting their resources (code and hardware), their
microbenchmarks seemed very straightforward.
|In contemporary operating
systems and hardware today, we get implicit
data replication thanks to cache coherency (data sits in multiple
caches in the read state). Do we need a special operating system
primitive to provide replication or can we get away with just writing
better software that takes into account what data structures need to
get shared and providing either finer-grained locking or lock-free
variants of the current algorithms?
|It is difficult to imagine that
solving the kernel "data-sharing"
problem is sufficient, but that we don't also need to provide
mechanisms for applications ... of which, I didn't see any discussed.
|Ideas for further work:
||In a current project I suffered from what these authors call
"existence guarantee" and provided a similiar solution (one based on
reference counting). It seems apparent that a language used to write a
parallel operating system should have some support for doing this sort
of thing ... perhaps another extension to the Ivy project?
To scale to multiple processors, an OS keep independent resources
independent of each other and of processors by which they are not being
used and make explicit the choices that must be made when a resource
must be shared between multiple processors
By avoiding unnecessary contention with the clustered object
abstractions, the Tornodo OS appears to scale well to many procesors
where existing commercial OSs' performance declines substantially.
It tells future operating system designers that they must think
about avoiding interprocessor contention to scale well to multiple
machines rather than relying on the traditional approach of a single,
shared global data structure which is locked on access.
The authors implemented an OS which usually kept data and flow
control local to each processor, with synchronization limited to the
level of a logical object (handling specially the case of object
deallocation). They measured the performance of this OS using
microbenchmarks on 16-processor (sometimes simulated) machines in
comparision to existing OSs.
The authors build upon a prior system of theirs, Hurricane, that
gained scalability by dividing a highly multiprocessor into fixed-sized
clusters of processors, but they learn to provide more fleixiblity to
application than a fixed division of the hardware. Their 'clustered
objects' are derived from similar work to share data across distributed
systems, though they do not need (they think) to handle faults and
communication overhead as much.
The other paper: focuses more on not sharing rather than sharing
intelligently, and gets similar scalability gains over existing OSs.
Work analyzing existing OS's performance on large SMPs and
proposing changes: certainly, such work is easier to deploy and may
avoid the mistake of spending lots of time scaling something whose
performance doesn't matter.
Hive: another from-scratch SMP OS, targets reliability and scalability instead of performance and scalability.
Not quickly; but it probably wouldn't be that hard to replace the
code managing some small set of resources in an OS with an
implementation using clustered objects and test if that performs as
How much would (or would?) the feature of multiple implementations of an object be used in practice?
How would one measure the overhead of the paper's garbage collection scheme versus the 'traditional' approach?
If their OS "supports most of the facilities and services", why no non-micro benchmark?
Instruction count (used several times by authors to brag about how
good their implementations is) isn't a great measure of the cost of an
operation, especially when dealing with problems of contention and
Ideas for further work:
- Networking Stacks in the Tornado Model?
- Choosing automatically between different implementations of an object? (1 rep/multireps)
- How much locking overhead in current OSs is for GC?