Berkeley Electrical Engineering and Computer Sciences
The Internet, with its billions of documents, images, and audio and video snippets, has become so vast that only automated programs can hope to make sense of it all. Indeed, programs that search for documents and videos or recommend products based on preferences have become central to how people experience the Internet.

Yet thus far, search engines and recommendation systems have accomplished a lot while understanding very little. Imagine how much more useful it would be if Netflix could accurately predict someone's taste in movies based on their rental ratings or if YouTube could correctly summarize the content of an unlabeled video clip. Toward this end, Berkeley computer scientists are assembling systems that add intelligence to search and recommendation systems.

The Internet is packed with reviews of artists, performers, movies, and products and services. The tricky part is to separate the wheat from the chaff, a task that, for now, falls squarely on the shoulders of the searcher. "With dissemination virtually free, what’s really expensive is attention," Robert Wilensky, a Berkeley EECS professor emeritus, said. Wilensky and his students created a collaborative filtering algorithm to identify trend-spotting reviewers—people who spend a lot of time culling through information and whose judgment matches the overall consensus. In one case study, the reviewers highlighted by the algorithm were the same reviewers the users on Epinions.com ranked highly.

Internet search engines have gotten good at processing text, but they are less successful at deciphering images or video segments. The engine used by YouTube relies on users' often-sketchy text tags to classify pictures and clips. To create a more effective tool for video search, vision expert Jitendra Malik has teamed up with Dan Klein, whose expertise is in natural languaged processing. Together with graduate students Slav Petrov, Alex Berg, and Arlo Faria, they have created a program that uses machine learning techniques to automatically label video segments by recognizing images, spoken words, and sounds.

The researchers are testing their program on a challenge data set provided by the National Institute for Standards and Technology, consisting of video clips of such things as crowds, television weathermen, and soccer games. The computer vision aspect of the program identifies images by the arrangement of orientations of the boundary edges between regions. An image of a crowd, for instance, contains a lot of vertical lines, while an image of a soccer game contains a few long horizontal lines (the field) as well as many short vertical lines (the players).

The natural language processing element of the program can recognize the spoken words that are often associated with a particular type of scene. The program can also distinguish noises, such as the roar of a crowd or of an airplane.