This problem set is due Friday, September 11, at 11:59pm.
Work on your own for this homework. You may use any source you like (including other papers or textbooks), but if you use any source not discussed in class, you must cite it.
The famous web company, Gargle Inc., has hired you to design and implement a safe filter to sanitize untrusted HTML content. They have a webmail service, GargleMail. A GargleMail user can go to the GargleMail website and view their email using a web browser. Gargle Inc. wants to allow people to send HTML email to GargleMail users, but they don't want this to open a pathway for malicious HTML content to harm GargleMail users or their machines. This is complicated by the fact that web browsers are complex and interpret many kinds of active content that can have harmful side effects, and we must find some way to eliminate this risk.
You're going to write GargleMail a sanitizing filter that they can invoke on the command line, like this:
They will then display the resulting HTML file to the recipient of the email, serving it from the GargleMail webserver (e.g., in a frame). They have two goals:
./htmlfilter < untrustedemail.html > safeforviewing.html
Your scheme must not only be secure; it must also be verifiably secure. You will have to provide an assurance argument why it is reasonable to believe that your filter achieves this goal. The goal is to provide positive evidence of security, not just absence of evidence of insecurity; after all, the absence of evidence is not evidence of absence.
For instance, a filter that ignores its input and always outputs the empty HTML page is not very useful. Thus, your solution should be at least minimally useful for viewing the textual content of HTML emails. Ideally, it would also be nice to see inline images. However, other content (e.g., scripts, Flash animations, etc.) doesn't need to be preserved and can be stripped from the original email.
Feel free to keep your implementation simple and to omit support for complex functionality. This is intended only as a proof of concept exercise. To keep this homework problem tractable, you can err on the side of omitting functionality in your implementation (though you should make sure to choose an approach that can be generalized to support as much functionality as possible, and argue why your approach generalizes).
Your code should be reasonably robust: it shouldn't crash on any input. Since GargleMail is going to run your program on malicious inputs, it would be embarassing if there is any input that causes your filter to crash uncleanly.
I want you to come up with a design, implement it, document your basic architecture and assurance argument, and submit both the document and the code. Your submission should contain at least three files:
Then, email this file as an attachment to cs261hw1 at taverner.cs.berkeley.edu by the due date. I will be using automated scripts to run your programs, so please do follow the above framework. If it helps, here is reference code that demonstrates the required format: ref.tar.
tar cf your-lastname.tar .
Feel free to keep your implementation simple. If you are writing more than a few hundred of lines of code, you're probably working too hard.
You can use third-party libraries (e.g., HTML parsers, etc.) if you like.