Computer Science 252: Graduate Computer Architecture

University of California
Dept. of Electrical Engineering and Computer Sciences
David E. Culler
Spring 2005

Project Ideas

Return to optimal caching: analysis of internet streams, neighbor table maintenance in ad hoc wireless networks (Woo Mobicom), and other stream-oriented applications suggest a potential for re-examination of
cache designs.  One unusual aspect of these cache algorithms is that
they will sometimes decide to not insert the object causing the miss.
Indeed, an optimal cache policy is to discard the object that will not
be referenced for the longest time into the future.  Of course, the
future is hard to know.  LRU approximates this by making the
assumption that objects that have been accessed recently in the past
will be accessed soon in the future.  But notice that the object
associated with the current miss has very little history.  This
project would re-examine the ancient literature on optimal cache
policies and the modern design of embedded systems for routing, L4-L7
processing, security, etc. to form a systematic characterization of
stream-oriented cache design.

Power implications of architectural trade-offs: About 95% of the
results obtained in the last 30 years of computer architecture have
focused on increasing performance.  Recently we have a spate of
power-oriented optimizations for laptops, including voltage and
frequency scaling focused mostly on media streaming.  We also have
some very low-level optimizations, such as shutting off subsections of
the design, reducing address variations on pins, using high-threshold
logic to reduce leakage, and subsetting the cache in various ways.
More recently we have embedded processors providing hardwired units
for high-level operations, such as TCP offload, compression, security,
and packet processing.  In principle, performance and energy
efficiency are not necessarily opposing, in that they both involve
getting the most utility out of a certain amount of hardware
resources.  Energy optimizations pay attention to utility of the
resource-time product.  Take a line of architectural research (such as
cache organization, prediction, ILP, multiprocessing, thread support,
IO processors, or instruction set design) and examine its impact from
an energy efficiency viewpoint.  Part of the challenge here is that
the metrics of comparison and the appropriate benchmarks are much less
straightforward than performance.

Microcontroller analysis: microcontrollers seem to evolve on a roadmap
that is unrelated to the Amdahl-Case rule and oblivious to the
progress in instruction set architecture.  Perform and survey and
analysis to determine if this make technical sense or is it marketing,
or perhaps naivete.  (Does it really make sense to leave out the
caches?  To have architecture registers realized as memory locations?)

Multithreading for power efficiency: Multithreading has been developed
to hide long latencies and utilize thread level parallelism.  If we go
back to the CDC6600 it was developed as an efficient way to handle a
collection of aynchronous events associated with different peripheral
devices.  It would seem to have lots of application in wireless sensor
networks where the node is either juggling many streams (sensors,
network, memory), sleeping, or transitioning between.

Deja Vous: Every study you can find in ISCA, ASPLOS, or MICRO
represents an effort to understand a general phenomenom in the context
of some particular technological snapshot.  Often there are many
dimensions to the design space and a large subset of the parameters
are set implicitly be the authors technological assumptions.  Take an
interesting study, try to reproduce it with similar assumptions and
investigate its sensitivity to those assumptions.  Are they crossocer
points in technological dimensions or workloads where the conclusions
flip?

Storage technologies on the horizon: current designs are a complex
federation of linked subsystems based on the characterstics of SRAM,
DRAM, Flash, and Magnetic disks.  The labs are busy working on
alterative storage media with different mixes of access time, leakage,
volatility, density, cost, and bandwidth.  Do a survey to identify
some of these new directions.  Pick one and explore its architectural
implications.

Programmable logic for low-power: it is generally accepted that direct
implementations of algorithms can be orders of magnitude more
efficient than interpreting a set of instructions that embody a
sequential process for computing the result.  However, instruction
interpretation is fully general and allows for algorithmic advance.
Programmable logic offers something of the best of best worlds -
direct implementation that can be reprogrammed.  It is becoming widely
used in "tethered" embedded computing, such as packet processors,
routers, and the like.  However, each logic gate involves SRAM to hold
the configuration and an expensive lookup table to realize the boolean
function.  Thus, there is a large hardware overhead and an especially
large leakage implication.  Develop a convincng cost model from the
literature and indentify the boundaries of applicability.

Reliability oriented design: Many believe that processors are fast
enough for all but the most demanding early-adopters and that future
desing needs to focus more on reliability.  They extensive checking
and protection schemes found in Sun's enterprise products vs what we
see in the PC space are one such example.  Generally, reliability
involves creating redundancy, coding, and checking.  There is reason
to believe that reliability and performance can be complementary.  For
example, we routinely create redundant copies of data in the storage
hierarchy in order to have the most frequently accessed data close to
the processor.  We have many arithemtic schemes that use redundant
encodings to allow faster arithmetic operations.  We can provide more
ILP resources than we can feed with useful instructions.  We have
elaborate memory consistency protocols to preserve ordering.  How can we
utilize these opportunities to gain reliability or to be able to
adjust the performance/reliability point on a common set of resources?

Security: We have recently gone to great length to have a NX (no
execute) bit introduced in x86 virtual address translations so that
viruses have a hard time writing instructions into the stack segment.
They can still bang the return address, but at least the kernel or
SUID application will seg fault with XP service-pack 2.  Can you think
of software transparent mechanmism that would provide much greater
potection?  For example, a simple extension of return address
prediction could trap when the return address is not to the point of
call.  You might do range protection of portions of the address space
for stack, data, and instruction access.

Encrypted instructions: There have been a couple of studies looking at
keeping all memory contents encrypted and decoding/encoding them at
the processor/memory boundary.  A much simpler notion is just to
encrypt all the instructions.  Each processor has a private key.
System software that loads binaries onto the local disk encrypts the
executibles.  Chunks can be decrypted as they enter the instruction
cache.  Even if the mapping was very simple, like an XOR, it still
changes the statistics of attack success quite a bit.  You could even
recode from time to time.

Offloading non-TCP: There has been a lot of work in TCP offloading,
however, there are lots of other parts of the network stack that could
potentially be offloaded, especially in the wireless space.  Whereas
TOE is generally for performance, many of these other uses would be
for energy efficiency.

Patterson often says, "Good benchmarks lead to good designs."  The
same can be said of good design analysis tools, especially
simulators.  Tools like SimOS, simple Scalar, and the like have been
instrumental in the mainstream architecture community.  Develop
analogous tools for power-constrained network embedded systems.
(Power TOSSIM is a starting point.)

Adversarial simulation: In the hardware design world, we do many kinds
of automated fault detection at a low level and high level fault
injection.  However, more network protocols are developed in
simulation environments that mimic typical behavior.  Take an
established simulator, such NS-2 or TOSSIM, and extend it to provide
robustness analysis of wireless protocols.  It could, for example,
introduce loss, duplication, variations in connectivity, or delays to
search for "worst case" behavior or even break protocols.  Part of the
idea of adversaries is that there is some criteria that limits the
power of the adversary.  An algorithm can be shown to be resilient
against an adversary of a given power.

Isolation for software reliability: A class of new applications, like
electronic voting, have very low performance requirements, narrow
functionality, and very high robustness and accountability
requirements.  Functionality can be distributed over a federation of
processors with narrow interfaces, as well as "smart cards" that are
inserted and devices across the network.  For one such applications,
formulate a decomposition that would allow subsystems to be shown to
be "oblivious" to other parts, such as keyboard input or display that
cannot understand the ballot.

Use of randomization: Randomization have been use effectively in
algorithms to achieve very good expected complexity, even when
worst-case may be awful.  In architecture, mechanisms that improve
typical performance are often more attractve than those that improve
peal performance for a very narrow set of applications.  On the other
hand, we like predictability and reproducibility. We have seen some
nice examples, like skew associative caches, in which it is hard to
fall into degenerate cases, like offset equal to cache size, but you
pay a little in the nice cases, like purely sequential access. Are
there places where simple randomization have a big payoff?

Ideas from David Wood:

Shared cache scheduling
Operating systems have long used two-level schedulers to manage competition between threads for a machine's finite physical memory resources. For example, many systems use a global paging stategy (e.g., clock) to manage page allocation between active threads and a swapping strategy to manage the number of active threads.  Swapping limits the competition for pages, reducing or eliminating potentially severe performance degradation due to page thrashing.

As Chip Multiprocessors (CMPs) move to highly shared caches (e.g., Sun's Niagara will have 32 threads sharing a 3MB L2 cache, http://www.theinquirer.net/?article=19423) the competition for these limited physical resources will become intense.  One  can imagine workloads where increasing the number of threads actually decreases throughput due to cache thrashing, just as we have long seen with page thrashing. What mechanisms should be included in hardware caches and what policies should the OS implement to limit this problem and mazimize throughput?

2) Low-power memory-level parallelism
As processor frequency increases relative to memory latency, the importance of exploiting memory-level parallelism (MLP) dominates most other performance factors. But power efficiency dominates everything, and most current MLP techniques are power inefficient. For example, current out-of-order processors rely on power-hungry CAM circuitry to execute (memory)  instructions in parallel. This circuitry does not scale to the large window sizes needed to cope with current, much less future, memory latencies. Conventional prefetching (e.g., IBM Power 4/5's strided prefetches) generates memory-level parallelism, but only works well for workloads with predictable access patterns (e.g., regular scientific codes). Recent proposals such as run-ahead execution and continual flow pipelines show great promise in being able to being able to exploit MLP. But because they require some instructions to be fetched and executed multiple times before being committed, some researchers have raised concerns about their power efficiency.  Do processors based on run-ahead execution or continual flow pipeline have potentially better---or potentially worse---power-efficiency than conventional processors?

Ideas for David Wagner

Here are two potential project ideas, inspired by e-voting:

1) Write-once non-volatile storage: Design and build a non-volatile
storage device that has write-once semantics, so that once a block is
written to storage, it cannot be erased or modified. Would be useful
for e-voting (ballot storage), if it can be made cheap and reliable.

2) Reliable one-way communication link: Design and build a transmission
link that allows data to be sent reliably from A to B, with no possibility
of any information flowing B to A. Has applications in computer security.

Ideas from Eric Brewer (an honorary David)

Some ideas on for developing regions:

 - anything related to sensors (MEMS sensors are a good place to start)

 - multi-microphone audio chip for better speech recognition (cancels
out background noise)

 - a better thin client design -- should share the power supply,
network RAM, as well as disks/servers/network

 - smart phone projects



Old Project Ideas: http://www.cs.berkeley.edu/~culler/cs252-s03/projects.html