UCB Visual Object and Activity Recognition Class CS 294-43

Prof. Trevor Darrell, trevor@eecs.berkeley.edu


Spring 2011

See sites.google.com/site/ucbcs29443/ for course archive. (Contact Instructor for access.)


This course will cover computer vision techniques for object and category recognition, as well as recognition of human activity from video streams.  Recognition of individual objects or activities (the coffee cup on your desk, a particular chair in your office, a video of you riding your bike) or generic categories (any cup, chair, or cycling event) is an essential capability for a variety of robotics and multimedia applications.  The advent of standardized datasets and evaluation regimes has spurred considerable innovation in this arena, with performance on benchmark evaluations increasing dramatically in recent years.  This course will review methods that have achieved success on such datasets, and will also consider the techniques needed for real-time interactive application on robots or mobile devices, e.g. domestic service robots or mobile phones that can retrieve information about objects in the environment based on visual observation.  This class will be based exclusively on readings from the recent literature, including those appearing at the CVPR, ICCV, and NIPS conferences.


The format of the course this year will primarily be discussion based, with each class beginning with a short overview of the topic by the instructor followed by detailed student-led presentations and structured critique of selected papers.  All students will be expected to actively discuss each paper each week.  Class size will be limited to those who have preregistered, or to 16 students, whichever is greater, to foster an environment conducive to discussion.


Each week will focus on a different subtopic of object and activity recognition, covering three to five different papers from the recent literature.  These papers will be presented jointly by two or three students, one acting as a primary presenter and the other student(s) as discussant.  Each student will be expected to act as presenter once and as discussant once during the term.  The presenting students will choose the papers from the list suggested for that subtopic, or they are welcome to suggest other papers. 


Students are expected to be involved in a related research project during the term, and be experimenting with a technique covered during the course.  (Graduate students who are not actively involved in a research project outside of the course can work on a class project specific for this course or joint with another course; undergraduates who are not actively involved in a related research project are not allowed in the course.)  Students will be expected to present their research progress during the term in a ten minute presentation in the last class.  Grades will be based entirely on in class presentations and participation.


This course will meet once a week, Friday 10-12noon, in the 7th floor conference room (Newton room) of Sutardja Dai Hall.




Prerequisites: prior Computer Vision and Machine Learning courses, or permission of instructor. Advanced undergraduates allowed only with permission of instructor and if they are actively participating in a related research project.  Students should already be familiar with or be willing to learn on their own: basic image processing in MATLAB; Optic Flow; Edge Detection; Support Vector Machines;  Gaussian Mixture Models;  Hidden Markov Models, etc.; students must be able to read and understand at a basic level recent conference papers in the computer vision literature.

DRAFT Syllabus (class members please see google site for most up to date version):

January 28, 2011               Global Features               

Background readings:

A. Oliva and A. Torralba, "Modeling the shape of the scene: A holistic representation of the spatial envelope," International Journal of Computer Vision, vol. 42, no. 3, pp. 145-175, May 2001. http://dx.doi.org/10.1023/A:1011139631724

A. Efros, A. C. Berg, G. Mori, and J. Malik, "Recognizing action at a distance," ICCV 2003, pp. 726-733 vol.2. http://dx.doi.org/10.1109/ICCV.2003.1238420

N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, pp. 886-893. http://dx.doi.org/10.1109/CVPR.2005.177

Contemporary readings:

P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, "Cascade Object Detection with Deformable Part Models", CVPR 2010. http://dx.doi.org/10.1109/CVPR.2010.5539906

T. Deselaers and V. Ferrari, "Global and efficient self-similarity for object classification and detection", CVPR 2010. http://dx.doi.org/10.1109/CVPR.2010.5539775

February 4, 2011               Local Features

Background readings:

D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, November 2004. http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94

T. Lindeberg, "Feature detection with automatic scale selection," International Journal of Computer Vision, vol. 30, no. 2, pp. 79-116, November 1998. http://dx.doi.org/10.1023/A:1008045108935

J. Matas, O. Chum, U. Martin, and T. Pajdla, "Robust wide baseline stereo from maximally stable extremal regions," in Proceedings of British Machine Vision Conference, vol. 1, London, 2002, pp. 384-393. http://citeseer.ist.psu.edu/608213.html

K. Mikolajczyk and C. Schmid, "Scale & affine invariant interest point detectors," Int. J. Comput. Vision, vol. 60, no. 1, pp. 63-86, October 2004. http://dx.doi.org/10.1023/B:VISI.0000027790.02288.f2

I. Laptev, "On space-time interest points," International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107-123, September 2005. http://dx.doi.org/10.1007/s11263-005-1838-7

Contemporary readings:

L. Bo, X. Ren, and D. Fox, "Kernel Descriptors for Visual Recognition", NIPS 2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0821.pdf

L. Bourdev, S. Maji, T. Brox, and J. Malik, "Detecting People Using Mutually Consistent Poselet Activations", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15567-3_13

February 11, 2011            Bag-of-word and Correspondence Kernels         

Background readings:

C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka, "Visual categorization with bags of keypoints," in ECCV International Workshop on Statistical Learning in Computer Vision, 2004. http://www.xrce.xerox.com/Publications/Attachments/2004%2D010/2004_010.pdf

K. Grauman and T. Darrell, "The pyramid match kernel: discriminative classification with sets of image features," ICCV, vol. 2, 2005, pp. 1458-1465 Vol. 2. http://dx.doi.org/10.1109/ICCV.2005.239

S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories," CVPR, vol. 2, 2006, pp. 2169-2178. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1641019

Contemporary readings:

S. Maji and A. C. Berg, "Max-margin additive classifiers for detection", ICCV 2009, http://dx.doi.org/10.1109/ICCV.2009.5459203

A. Vedaldi and A. Zisserman, "Efficient Additive Kernels via Explicit Feature Maps", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539949

A. Kovashka and K. Grauman, "Learning a hierarchy of discriminative space-time neighborhood features for human action recognition", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539881

February 18, 2011            Segmentation and Region Proposals      

Background readings:

J. Shotton, M. Johnson, and R. Cipolla, "Semantic texton forests for image categorization and segmentation," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. http://dx.doi.org/10.1109/CVPR.2008.4587503

Contemporary readings:

Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, "Layered Object Detection for Multi-Class Segmentation", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540070

F. Li, J. Carreira and C. Sminchisescu, "Object Recognition as Ranking Holistic Figure-Ground Hypotheses", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539839

B. Alexe, T. Deselaers, V. Ferrari, "What is an object?", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540226

B. Packer, S. Gould, and D. Koller, "A Unified Contour-Pixel Model for Figure-Ground Segmentation", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15555-0_25

I. Endres and D. Hoiem, "Category Independent Object Proposals", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15555-0_42

March 4, 2011    Descriptor Sparse Coding and Topic Models                          

Background reading:

Olshausen B. and Field D. Sparse coding with an overcomplete basis set: A strategy employed by V1?. Vision Research (1997) vol. 37 (23) pp. 3311-3325 http://www.chaos.gwdg.de/~michael/CNS_course_2004/papers_max/OlshausenField1997.pdf

Contemporary readings:

Raina et al. Self-taught learning: Transfer learning from unlabeled data. ICML (2007). http://dx.doi.org/10.1145/1273496.1273592

Fritz M., Black M., Bradski G., Karayev S., Darrell T. An Additive Latent Feature Model for Transparent Object Recognition. NIPS (2009) http://books.nips.cc/papers/files/nips22/NIPS2009_0397.pdf

Wang et al. Locality-constrained Linear Coding for Image Classification. CVPR (2010) http://dx.doi.org/10.1109/CVPR.2010.5540018

March 11, 2011 Hashing and Metric Learning      

Background readings:

G. Shakhnarovich, P. Viola, and T. Darrell, "Fast pose estimation with parameter-sensitive hashing," ICCV 2003, http://dx.doi.org/10.1109/ICCV.2003.1238424

A. Frome, Y. Singer, F. Sha, and J. Malik, "Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification", ICCV 2007, http://dx.doi.org/10.1109/ICCV.2007.4408839

Contemporary readings:

P. Jain, B. Kulis, and K. Grauman, Fast Similarity Search for Learned Metrics, CVPR 2008/PAMI 2009, http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.151

B. Kulis and T. Darrell, "Learning to Hash with Binary Reconstructive Embeddings", NIPS 2009, http://books.nips.cc/papers/files/nips22/NIPS2009_0971.pdf

March 18, 2011  Temporal Models                             

Background readings:

J. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised learning of human action categories using spatial-temporal words," International Journal of Computer Vision. 79(3): 299-318. 2008 Available: http://dx.doi.org/10.1007/s11263-007-0122-4

Contemporary readings:

K. Prabhakar, S. Oh, P. Wang, G. D. Abowd, J Rehg, "Temporal Causality for the Analysis of Visual Events", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539871

A. Yao, J. Gall, L. Van Gool, "A Hough Transform-Based Voting Framework for Action Recognition", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539883

J.C. Niebles, C. Chen, and L. Fei-Fei, "Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15552-9_29

D. Weinland1, M. Ozuysal and P. Fua, "Making Action Recognition Robust to Occlusions and Viewpoint Changes", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15558-1_46

P. Matikainen, M. Hebert and R. Sukthankar, "Representing Pairwise Spatial and Temporal Relations for Action Recognition", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15549-9_37

T. Lan, Y. Wang, W. Yang and G. Mori, "Beyond Actions: Discriminative Models for Contextual Group Activities", NIPS 2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0115.pdf

April 1, 2011        Image and text models                  

Background readings:

K. Barnard and D. Forsyth, "Learning the Semantics of Words and Pictures," International Conference on Computer Vision, vol 2, pp. 408-415, 2001, http://doi.ieeecomputersociety.org/10.1109/ICCV.2001.937654

D. Blei and M. Jordan, "Modeling Annotated Data", SIGIR '03 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, http://dx.doi.org/10.1145/860435.860460

T. Berg and D. Forsyth, "Animals on the Web", CVPR 2006, http://dx.doi.org/10.1109/CVPR.2006.57

Contemporary readings:

Chong Wang, D. Blei, Fei-Fei Li, "Simultaneous image classification and annotation," CVPR 2009, http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206800

K. Saenko and T. Darrell, “Filtering Abstract Senses From Image Search Results”, NIPS 2009, http://books.nips.cc/papers/files/nips22/NIPS2009_1143.pdf

A. Farhadi, M. Hejrati , M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier and D. Forsyth, "Every Picture Tells a Story: Generating Sentences from Images", NIPS 2010, http://dx.doi.org/10.1007/978-3-642-15561-1_2

B. Siddiquie and A. Gupta, "Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540044

April 8, 2011        Crowdsourcing and Active Learning          

Background readings:

L. von Ahn and L. Dabbish, "Labeling images with a computer game", SIGCHI 2004, http://dx.doi.org/10.1145/985692.985733

A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, "Active Learning with Gaussian Processes for Object Categorization" ICCV 2007. http://doi.ieeecomputersociety.org/10.1109/ICCV.2007.4408844

Contemporary readings:

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. "ImageNet: A Large-Scale Hierarchical Image Database". In CVPR, 2009. http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206848

S. Vijayanarasimhan, P. Jain, K. Grauman, "Far-sighted active learning on a budget for image and video recognition", CVPR 2010. http://dx.doi.org/10.1109/CVPR.2010.5540055

P. Welinder, S. Branson, S. Belongie, P. Perona, "The Multidimensional Wisdom of Crowds", NIPS 2010. http://books.nips.cc/papers/files/nips23/NIPS2010_0577.pdf

S. Branson, C. Wah, B. Babenko, F. Schroff, P. Welinder, P. Perona, S. Belongie, "Visual Recognition with Humans in the Loop", ECCV 2010. http://dx.doi.org/10.1007/978-3-642-15561-1_32               

April 15, 2011     Scene and Image Context             

Background readings:

A. Torralba, K. P. Murphy, and W. T. Freeman, "Contextual models for object detection using boosted random fields," in Advances in Neural Information Processing Systems 17 (NIPS), 2005, pp. 1401-1408. http://dspace.mit.edu/handle/1721.1/6740

D. Hoiem, A. A. Efros, and M. Hebert, "Putting objects in perspective," in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, 2006, pp. 2137-2144. http://dx.doi.org/10.1109/CVPR.2006.232

L.-J. Li and L. Fei-Fei, "What, where and who? classifying events by scene and object recognition," in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1-8. http://dx.doi.org/10.1109/ICCV.2007.4408872

Contemporary readings:

S. Bao, M. Sun, S. Savarese, "Toward coherent object detection and scene layout understanding", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540229

B. Yao and L. Fei-Fei. "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities.", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540235

A. Gupta, A. Efros and M. Hebert, "Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics". ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15561-1_35           

April 22, 2011     Taxonomies and Sub-category Recognition                           

Background readings:

A. Zweig and D. Weinshall, "Exploiting object hierarchy: Combining models from different category levels," in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1-8. Available:http://dx.doi.org/10.1109/ICCV.2007.4409064

G. Griffin and P. Perona, "Learning and using taxonomies for fast visual categorization," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1-8. Available:http://dx.doi.org/10.1109/CVPR.2008.4587410

J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros, "Unsupervised discovery of visual object class hierarchies," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1-8. Available: http://dx.doi.org/10.1109/CVPR.2008.4587622

Contemporary readings:

L.-J. Li, C. Wang, Y. Lim, D. Blei and L. Fei-Fei. "Building and Using a Semantivisual Image Hierarchy", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540027

M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, "What helps where – and why? Semantic relatedness for knowledge transfer", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540121

April 29, 2011     Domain Adaptation

K. Saenko, B. Kulis, M. Fritz, and T. Darrell, "Adapting Visual Category Models to New Domains", ECCV 2010, http://dx.doi.org/10.1109/10.1007/978-3-642-15561-1_16

A. Bergamo and L. Torresani, "Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach", NIPS 2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0093.pdf

L. Cao, Z. Liu, T. Huang, "Cross-dataset action detection", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539875