Abstracts for George C. Necula

The EECS Research Summary for 2003

A Compatible Representation for CCured Metadata

Jeremy Condit
(Professor George C. Necula)
NSF Graduate Research Fellowship

CCured is a program transformation system that provides memory safety guarantees for C programs using a combination of static verification and run-time checks [1]. In order to support run-time checks, the instrumented program must store metadata with certain pointers; thus, one-word pointers in the original program are represented as multi-word pointers in the transformed program.

Unfortunately, this multi-word representation is incompatible with the original C representation used by libraries that were not compiled with CCured. To solve this problem, we have designed an alternative representation for CCured metadata in which metadata for a given data structure is stored in a separate data structure that has a similar shape. For each value in the original program, our transformed program contains two values: one for data and one for metadata. Each operation in the original program is split as well, producing one operation on the data value and a parallel operation on the metadata value. An inference algorithm allows us to minimize the number of values within the transformed program that must use this compatible representation.

We are currently using this representation to improve the usability of CCured when applied to large, real-world programs. In addition, we are investigating possible applications of this technique outside the realm of CCured.

G. C. Necula, S. McPeak, and W. Weimer, "CCured: Type-Safe Retrofitting of Legacy Code," ACM Symp. Principles of Programming Languages, Portland, OR, January 2002.

Send mail to the author : (jcondit@eecs.berkeley.edu)

CCured: Type-Safe Retrofitting of Legacy Code

Jeremy Condit, Matthew Harren, Scott McPeak, and Westley Weimer
(Professor George C. Necula)
Microsoft Research, (NSF) CCR-9875171, (NSF) ITR CCR-0085949, (NSF) ITR CCR-0081588, and NSF Graduate Research Fellowship

CCured is a program transformation system that adds memory safety and type safety guarantees to C programs by verifying statically that memory errors cannot occur and by inserting run-time checks where static verification is insufficient. Programs instrumented by CCured run with modest (typically 10-50%) slowdown, due to the effective combination of static and dynamic checking.

Perhaps the greatest potential impact of CCured is in the domain of security-critical software. Memory safety is an absolute prerequisite for security, and it is the failure of memory safety that is most often to blame for insecurity in deployed software. Further, CCured's relatively modest performance cost makes it plausible for security-critical production systems to use binaries compiled with CCured's run-time checks enabled. We have used CCured to guarantee memory safety for several real-world, security-critical programs, such as bind, sendmail, OpenSSL, and an FTP daemon.

Recent extensions to CCured include: support for object-oriented programming styles using physical subtyping of structs and run-time type information in pointers; and support for interoperability with legacy code by changing the representation of metadata. See the related abstract "A Compatible Representation for CCured Metadata," also in this Research Summary.

More information (http://www.cs.berkeley.edu/~necula/ccured/) or

Send mail to the author : (matth@eecs.berkeley.edu)

Elkhound: A Fast, Practical GLR Parser Generator

Scott McPeak
(Professor George C. Necula)

Elkhound is a parser generator based on the Generalized LR (GLR) parsing algorithm. Because it uses GLR, Elkhound can parse with any context-free grammar, including those that are ambiguous or require unbounded lookahead. Due to a novel improvement to the GLR algorithm, wherein the parser can choose between GLR and ordinary LR on a token-by-token basis, Elkhound parsers are about as fast as LALR (1) parsers on deterministic portions of the input. Unlike existing GLR implementations, Elkhound allows the user to associate arbitrary action code with reductions, and to directly manage subtree sharing and merging.

To demonstrate Elkhound's effectiveness, we used it to build a parser for C++, a language notorious for being difficult to parse. Our C++ parser is small (3500 lines), efficient, maintainable, and elegantly handles even the most troublesome constructs--by delaying disambiguation until semantic analysis when necessary.

More information (http://www.cs.berkeley.edu/~smcpeak/elkhound) or

Send mail to the author : (smcpeak@eecs.berkeley.edu)

Use of Randomization in Program Analysis

Sumit Gulwani
(Professor George C. Necula)
Microsoft Research, (NSF) CCR-0081588, (NSF) CCR-0085949 , and (NSF) CCR-987517

Analyzing programs is provably hard and we have learned to pay a price in terms of loss of completeness (or precision), and the complexity and running time of the analysis. This project is trying to explore how randomization can be used to obtain simplicity and efficiency at the expense of making soundness probabilistic.

We have used some randomized techniques for the problem of discovering affine equalities in a program [1]. This problem arises in several program analyses like compiler optimizations (e.g., constant propagation, common sub-expression elimination, elimination of conditionals), discovering loop invariants, and translation validation. The key idea of the randomized algorithm for this problem is to execute a code fragment on a few random inputs, but in such a way that all paths are covered on each run. This makes it possible to rule out invalid relationships even with very few runs. The algorithm is based on two main techniques. First, both branches of a conditional are executed on each run and at joint points we perform an affine combination of the joining states. Secondly, in the branches of an equality conditional we adjust the data values on the fly to reflect the truth value of the guarding Boolean expression. This increases the number of affine equalities that the analysis discovers. The algorithm is simpler to implement than alternative deterministic versions, has better computational complexity, and has an extremely small probability of error for even a small number of runs. This algorithm is an example of how randomization can provide a trade-off between the cost and complexity of program analysis, and a small probability of unsoundness. We are currently exploring if these ideas can be extended to handle other analyses involving inequalities, non-linear assignments, memory expressions, and dependent conditionals.

It is also interesting to see if randomized techniques can be helpful in the context of theorem proving. We have demonstrated this for the congruence closure problem involving uninterpreted functions and the theory of linear equalities, and are yet to explore if these ideas can be extended to handle a combination of other convex theories too.

S. Gulwani and G. C. Necula, "Discovering Affine Equalities Using Random Interpretation," ACM Symp. Principles of Programming Languages, New Orleans, LA, January 2003.

Send mail to the author : (gulwani@eecs.berkeley.edu)

Verification and Analysis of Reactive Systems

Westley Weimer
(Professor George C. Necula)
Microsoft Research, (NSF) CCR-9875171, (NSF) ITR CCR-0085949, and (NSF) ITR CCR-0081588

Existing software systems are becoming larger and more complex without becoming safer. Many separate theorem-proving, model-checking, and program analysis techniques tackle the problem of verifying that a program meets a specification. Each of these three methods alone can only address certain subsets of this larger problem. Although a few projects have attempted to combine these three techniques, much unifying work remains to be done. In addition, such techniques often rely on simplifying assumptions (e.g., about pointer arithmetic or type safety) that may not hold in practice, especially for code written in lower-level languages. While previous approaches concentrated on stand-alone programs with a single entry point, my focus is on reactive systems. Such systems exist in various states over time (e.g., device drivers, web server modules, database extensions) and respond to requests. In such systems both the interfaces between modules and the internal behavior of modules are important. Finally, such techniques can be extended to produce more detailed and useful reports than the traditional error trace.

I propose a two-phase approach for verifying that such a system meets a specification. In the first phase the system is made type- and memory-safe by a combination of static analyses and run-time checks. In essence, this ensures that the system adheres to a universal specification of memory safety and paves the way for more complicated analyses that assume memory-safe programs. Existing techniques usually generate false positives or false negatives in the presence of legacy code that violates such assumptions.

In the second phase the transformed system can be checked against a more detailed specification. This checking process should verify local properties, the behavior of the system over time, and the correct use of interfaces. The verification will use techniques from program analysis, model checking, and theorem proving with an emphasis on automation. When the specification permits it, an error report should contain not just a path demonstrating the error but also detailed suggestions on how to deal with the problem in the original source. Few existing techniques address either what to do when a bug is found or how to model systems that exist over time. If the system does adhere to the specification, I propose to emit a machine-checkable proof of the fact.

More information (http://osq.cs.berkeley.edu) or

Send mail to the author : (weimer@eecs.berkeley.edu)