CS 261 Homework 1

Instructions

This problem set is due Friday, September 11, at 11:59pm.

Work on your own for this homework. You may use any source you like (including other papers or textbooks), but if you use any source not discussed in class, you must cite it.

Question 1

The famous web company, Gargle Inc., has hired you to design and implement a safe filter to sanitize untrusted HTML content. They have a webmail service, GargleMail. A GargleMail user can go to the GargleMail website and view their email using a web browser. Gargle Inc. wants to allow people to send HTML email to GargleMail users, but they don't want this to open a pathway for malicious HTML content to harm GargleMail users or their machines. This is complicated by the fact that web browsers are complex and interpret many kinds of active content that can have harmful side effects, and we must find some way to eliminate this risk.

You're going to write GargleMail a sanitizing filter that they can invoke on the command line, like this:

./htmlfilter < untrustedemail.html > safeforviewing.html
They will then display the resulting HTML file to the recipient of the email, serving it from the GargleMail webserver (e.g., in a frame). They have two goals:
Security:
The resulting content must not, under any circumstances, cause any harm to the GargleMail user's system. View this HTML email should be as harmless as viewing an ASCII text file with, say, /bin/more; note that even if an attacker supplies the entire contents of an ASCII email, viewing it with /bin/more cannot harm your machine. In particular, reading an email from someone malicious (filtered through your HTML filter) should not cause any lasting side effects to the GargleMail user's machine that persist after the web browser is closed; it should not leak any confidential information (e.g., the contents of files on the user's hard disk; or, information about what the user is viewing in another window with the same browser); and it should not endanger the integrity of the user's machine (e.g., we must not allow it to tamper with a different web document that the user is viewing in another window using the same browser).

Your scheme must not only be secure; it must also be verifiably secure. You will have to provide an assurance argument why it is reasonable to believe that your filter achieves this goal. The goal is to provide positive evidence of security, not just absence of evidence of insecurity; after all, the absence of evidence is not evidence of absence.

Functionality:
Ideally, your filter should retain as much of the useful HTML content from the original email as possible -- except, of course, where this might conflict with security.

For instance, a filter that ignores its input and always outputs the empty HTML page is not very useful. Thus, your solution should be at least minimally useful for viewing the textual content of HTML emails. Ideally, it would also be nice to see inline images. However, other content (e.g., scripts, Flash animations, etc.) doesn't need to be preserved and can be stripped from the original email.

Feel free to keep your implementation simple and to omit support for complex functionality. This is intended only as a proof of concept exercise. To keep this homework problem tractable, you can err on the side of omitting functionality in your implementation (though you should make sure to choose an approach that can be generalized to support as much functionality as possible, and argue why your approach generalizes).

Your code should be reasonably robust: it shouldn't crash on any input. Since GargleMail is going to run your program on malicious inputs, it would be embarassing if there is any input that causes your filter to crash uncleanly.

For this exercise, security matters more than functionality. If push comes to shove, choose security over functionality. My expectations for security are pretty high; feel free to sacrifice functionality if it enables you to achieve greater assurance that your scheme will be secure.

I want you to come up with a design, implement it, document your basic architecture and assurance argument, and submit both the document and the code. Your submission should contain at least three files:

README
Document the basic architecture you've used and the theory of operation for your scheme. Sketch the assurance argument why one should expect your scheme to be secure. This should be an ASCII text file, and it doesn't have to be too lengthy; a page or so should be enough. You might want to describe both the policy you are enforcing (e.g., the restrictions you're trying to place on the HTML content) as well as the method you're using for enforcing that policy (e.g., the implementation strategy for ensuring that the restrictions are fully and accurately enforced).
Makefile
A Makefile with everything needed to compile your program. If I run make, it should do everything needed to compile your program and finally generate in the current directory an executable file called htmlfilter. This program should read an untrusted HTML file from stdin and write a sanitized HTML file to stdout.
Source files
Include any source files needed to build the executable. Don't include the executable itself; I will run make myself. You can use pretty much any well-supported language you like (e.g., C, C++, Java, Perl, Python, Ruby, ML, OCaml, bash script) as long as it will work on my Linux system. However, to avoid any difficulties, please take care to make your program as portable as possible. I encourage you to test your code on some modern Linux system (feel free to use the EECS instructional Linux servers if that helps: ilinux1.eecs.berkeley.edu, ilinux2.eecs.berkeley.edu, etc.).
From within the directory where the above files are found, run
tar cf your-lastname.tar .
Then, email this file as an attachment to cs261hw1 at taverner.cs.berkeley.edu by the due date. I will be using automated scripts to run your programs, so please do follow the above framework. If it helps, here is reference code that demonstrates the required format: ref.tar.

Feel free to keep your implementation simple. If you are writing more than a few hundred of lines of code, you're probably working too hard.

Some hints: You may want to review a HTML primer or reference document to refresh your memory about the format of HTML and the semantics of various aspects of HTML. You'll probably need to strip out all Javascript, as by default it can cause side effects and violate the security policy outlined above (for instance it could interfere with the GargleMail web site and have other undesirable effects). You'll probably also need to do something about other executable content like Flash or Java, as by default they tend to have similar powers.

You can use third-party libraries (e.g., HTML parsers, etc.) if you like.