At times, it seems that progress in understanding primate vision has stalled in the earliest stages of the ventral and dorsal visual pathways. In early vision, many basic recognition and servomotor tasks are facilitated by retinotopic mappings that preserve spatial structure through multiple layers of neural circuitry. However, despite the miraculous engineering of natural selection, sooner or later, continuity in time and space is disrupted by movement of the eye, the object of interest, or both. Primates have to construct their representations of the world by composing a pastiche of retinal image fragments, but little is known about exactly how they accomplish this.
Meanwhile, the emphasis in computer vision is gradually moving away from the still image, but is still largely constrained by the power of a single multi-core workstation with several gigabytes of memory, or relatively simple, embarrassingly parallel computations on clusters of such machines. One notable exception is the application of research on extracting structure from motion which has given rise to web applications such as Google's StreetView and Microsoft Live Lab's PhotoSynth. These applications and the mashups that they engender are made possible by the power of distributed computing to sift through enormous amounts of data.
Our group at Google is interested in learning to annotate video and still images to support image content search. We are designing an agent that has access to virtually unlimited video, much of it with accompanying meta data in the form of tags, movie scripts, commentary, audio transcriptions, etc. This agent will have the ability to select its training data by extracting content from several Google applications, including YouTube, Google Images and StreetView. It will have access to Google infrastructure for determining relationships involving text tokens including multi-word n-grams. We are also thinking about providing our agent with some capability to test hypotheses either by exploiting an interface to the Google Image Labeler---which offers similar functionality to the ESP Game, Peekaboom and LabelMe, or by interacting in Second Life. The agent's behavior is guided by its performance in predicting labels for still images and video fragments that accord well with human-generated labels.
The analogy to the prisoners in Plato's cave stems from the observation that our images are flat projections like the shadows cast upon the wall of the cave. Our meta data, like the voices of the pedestrians on the bridge, will, on occasion, bear little or no relationship to the objects whose shadows appear on the wall. The prisoners have the advantage over our simple agent of possessing language, but the shadows are a poor substitute for the real objects. Moreover, the prisoners cannot change the viewing angle, probe the objects with a stick, or otherwise intervene so as to (easily) glean the physical properties of the objects.
In addition to motivating the project, the talk will summarize the primary machine-learning and computer-vision challenges, describe some of the technologies we are developing, and discuss the central role played by distributed computing and very large datasets.