UCB Visual Object and Activity Recognition Class CS 294-43
Prof. Trevor Darrell, trevor@eecs.berkeley.edu
Spring 2011
See sites.google.com/site/ucbcs29443/ for course archive. (Contact Instructor for access.)
This course will cover computer vision techniques for object and category recognition, as well as recognition of human activity from video streams. Recognition of individual objects or activities (the coffee cup on your desk, a particular chair in your office, a video of you riding your bike) or generic categories (any cup, chair, or cycling event) is an essential capability for a variety of robotics and multimedia applications. The advent of standardized datasets and evaluation regimes has spurred considerable innovation in this arena, with performance on benchmark evaluations increasing dramatically in recent years. This course will review methods that have achieved success on such datasets, and will also consider the techniques needed for real-time interactive application on robots or mobile devices, e.g. domestic service robots or mobile phones that can retrieve information about objects in the environment based on visual observation. This class will be based exclusively on readings from the recent literature, including those appearing at the CVPR, ICCV, and NIPS conferences.
The format of the course this year will primarily be discussion based, with each class beginning with a short overview of the topic by the instructor followed by detailed student-led presentations and structured critique of selected papers. All students will be expected to actively discuss each paper each week. Class size will be limited to those who have preregistered, or to 16 students, whichever is greater, to foster an environment conducive to discussion.
Each week will focus on a different subtopic of object and activity recognition, covering three to five different papers from the recent literature. These papers will be presented jointly by two or three students, one acting as a primary presenter and the other student(s) as discussant. Each student will be expected to act as presenter once and as discussant once during the term. The presenting students will choose the papers from the list suggested for that subtopic, or they are welcome to suggest other papers.
Students are expected to be involved in a related research project during the term, and be experimenting with a technique covered during the course. (Graduate students who are not actively involved in a research project outside of the course can work on a class project specific for this course or joint with another course; undergraduates who are not actively involved in a related research project are not allowed in the course.) Students will be expected to present their research progress during the term in a ten minute presentation in the last class. Grades will be based entirely on in class presentations and participation.
This course will meet once a week, Friday 10-12noon, in the 7th floor conference room (Newton room) of Sutardja Dai Hall.
THE FIRST CLASS WILL BE JAN 28th. THE INTRODUCTION CLASS WHICH WOULD HAVE BEEN SCHEDULED JAN 21st WILL HAPPEN VIRTUALLY -- PLEASE CONTACT THE INSTRUCTOR IF YOU ARE NOT ALREADY ON THE EMAIL LIST.
Prerequisites: prior Computer Vision and Machine Learning courses, or permission of instructor. Advanced undergraduates allowed only with permission of instructor and if they are actively participating in a related research project. Students should already be familiar with or be willing to learn on their own: basic image processing in MATLAB; Optic Flow; Edge Detection; Support Vector Machines; Gaussian Mixture Models; Hidden Markov Models, etc.; students must be able to read and understand at a basic level recent conference papers in the computer vision literature.
DRAFT Syllabus (class members please see google site for most up to date version):
January 28, 2011 Global Features
Background readings:
A. Oliva
and A. Torralba, "Modeling the shape of the
scene: A holistic representation of the spatial envelope," International
Journal of Computer Vision, vol. 42, no. 3, pp. 145-175, May 2001.
http://dx.doi.org/10.1023/A:1011139631724
A. Efros,
A. C. Berg, G. Mori, and J. Malik, "Recognizing
action at a distance," ICCV 2003, pp. 726-733 vol.2.
http://dx.doi.org/10.1109/ICCV.2003.1238420
N. Dalal
and B. Triggs, "Histograms of oriented gradients
for human detection," in CVPR '05: Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005,
pp. 886-893. http://dx.doi.org/10.1109/CVPR.2005.177
Contemporary readings:
P. F. Felzenszwalb, R. B. Girshick, and
D. McAllester, "Cascade Object Detection with
Deformable Part Models", CVPR 2010.
http://dx.doi.org/10.1109/CVPR.2010.5539906
T. Deselaers
and V. Ferrari, "Global and efficient self-similarity for object
classification and detection", CVPR 2010.
http://dx.doi.org/10.1109/CVPR.2010.5539775
February 4, 2011 Local Features
Background readings:
D. G. Lowe, "Distinctive image
features from scale-invariant keypoints,"
International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, November
2004. http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
T. Lindeberg,
"Feature detection with automatic scale selection," International
Journal of Computer Vision, vol. 30, no. 2, pp. 79-116, November 1998.
http://dx.doi.org/10.1023/A:1008045108935
J. Matas,
O. Chum, U. Martin, and T. Pajdla, "Robust wide
baseline stereo from maximally stable extremal
regions," in Proceedings of British Machine Vision Conference, vol. 1,
London, 2002, pp. 384-393. http://citeseer.ist.psu.edu/608213.html
K. Mikolajczyk
and C. Schmid, "Scale & affine invariant
interest point detectors," Int. J. Comput.
Vision, vol. 60, no. 1, pp. 63-86, October 2004.
http://dx.doi.org/10.1023/B:VISI.0000027790.02288.f2
I. Laptev, "On space-time
interest points," International Journal of Computer Vision, vol. 64, no.
2-3, pp. 107-123, September 2005. http://dx.doi.org/10.1007/s11263-005-1838-7
Contemporary readings:
L. Bo, X. Ren,
and D. Fox, "Kernel Descriptors for Visual Recognition", NIPS 2010,
http://books.nips.cc/papers/files/nips23/NIPS2010_0821.pdf
L. Bourdev,
S. Maji, T. Brox, and J. Malik, "Detecting People Using Mutually Consistent Poselet Activations", ECCV 2010,
http://dx.doi.org/10.1007/978-3-642-15567-3_13
February 11, 2011 Bag-of-word and Correspondence Kernels
Background readings:
C. Dance, J. Willamowski,
L. Fan, C. Bray, and G. Csurka, "Visual
categorization with bags of keypoints," in ECCV
International Workshop on Statistical Learning in Computer Vision, 2004. http://www.xrce.xerox.com/Publications/Attachments/2004%2D010/2004_010.pdf
K. Grauman
and T. Darrell, "The pyramid match kernel: discriminative classification
with sets of image features," ICCV, vol. 2, 2005, pp. 1458-1465 Vol. 2.
http://dx.doi.org/10.1109/ICCV.2005.239
S. Lazebnik,
C. Schmid, and J. Ponce, "Beyond bags of
features: Spatial pyramid matching for recognizing natural scene
categories," CVPR, vol. 2, 2006, pp. 2169-2178.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1641019
Contemporary readings:
S. Maji
and A. C. Berg, "Max-margin additive classifiers
for detection", ICCV 2009, http://dx.doi.org/10.1109/ICCV.2009.5459203
A. Vedaldi
and A. Zisserman, "Efficient Additive Kernels
via Explicit Feature Maps", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539949
A. Kovashka
and K. Grauman, "Learning a hierarchy of
discriminative space-time neighborhood features for human action
recognition", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539881
February 18, 2011 Segmentation and Region Proposals
Background readings:
J. Shotton,
M. Johnson, and R. Cipolla, "Semantic texton forests for image categorization and
segmentation," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. http://dx.doi.org/10.1109/CVPR.2008.4587503
Contemporary readings:
Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes,
"Layered Object Detection for Multi-Class Segmentation", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5540070
F. Li, J. Carreira
and C. Sminchisescu, "Object Recognition as
Ranking Holistic Figure-Ground Hypotheses", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5539839
B. Alexe,
T. Deselaers, V. Ferrari, "What is an object?", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540226
B. Packer, S. Gould, and D. Koller, "A Unified Contour-Pixel Model for
Figure-Ground Segmentation", ECCV 2010,
http://dx.doi.org/10.1007/978-3-642-15555-0_25
I. Endres
and D. Hoiem, "Category Independent Object
Proposals", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15555-0_42
March 4, 2011 Descriptor Sparse Coding and Topic Models
Background reading:
Olshausen
B. and Field D. Sparse coding with an overcomplete
basis set: A strategy employed by V1?. Vision Research
(1997) vol. 37 (23) pp. 3311-3325 http://www.chaos.gwdg.de/~michael/CNS_course_2004/papers_max/OlshausenField1997.pdf
Contemporary readings:
Raina et
al. Self-taught learning: Transfer learning from unlabeled data. ICML (2007). http://dx.doi.org/10.1145/1273496.1273592
Fritz M., Black M., Bradski G., Karayev S., Darrell
T. An Additive Latent Feature Model for Transparent Object
Recognition. NIPS (2009)
http://books.nips.cc/papers/files/nips22/NIPS2009_0397.pdf
Wang et al. Locality-constrained
Linear Coding for Image Classification. CVPR (2010) http://dx.doi.org/10.1109/CVPR.2010.5540018
March 11, 2011 Hashing and Metric Learning
Background readings:
G. Shakhnarovich,
P. Viola, and T. Darrell, "Fast pose estimation with parameter-sensitive
hashing," ICCV 2003, http://dx.doi.org/10.1109/ICCV.2003.1238424
A. Frome,
Y. Singer, F. Sha, and J. Malik,
"Learning Globally-Consistent Local Distance Functions for Shape-Based
Image Retrieval and Classification", ICCV 2007,
http://dx.doi.org/10.1109/ICCV.2007.4408839
Contemporary readings:
P. Jain, B. Kulis,
and K. Grauman, Fast Similarity Search for Learned
Metrics, CVPR 2008/PAMI 2009,
http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.151
B. Kulis
and T. Darrell, "Learning to Hash with Binary Reconstructive
Embeddings", NIPS 2009,
http://books.nips.cc/papers/files/nips22/NIPS2009_0971.pdf
March 18, 2011 Temporal Models
Background readings:
J. Niebles,
H. Wang, and L. Fei-Fei, "Unsupervised learning
of human action categories using spatial-temporal words," International
Journal of Computer Vision. 79(3): 299-318. 2008 Available:
http://dx.doi.org/10.1007/s11263-007-0122-4
Contemporary readings:
K. Prabhakar,
S. Oh, P. Wang, G. D. Abowd, J Rehg,
"Temporal Causality for the Analysis of Visual Events", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5539871
A. Yao, J. Gall, L. Van Gool, "A Hough Transform-Based Voting Framework for
Action Recognition", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5539883
J.C. Niebles,
C. Chen, and L. Fei-Fei, "Modeling Temporal
Structure of Decomposable Motion Segments for Activity Classification",
ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15552-9_29
D. Weinland1, M. Ozuysal and P. Fua, "Making
Action Recognition Robust to Occlusions and Viewpoint Changes", ECCV 2010,
http://dx.doi.org/10.1007/978-3-642-15558-1_46
P. Matikainen,
M. Hebert and R. Sukthankar, "Representing Pairwise Spatial and Temporal Relations for Action
Recognition", ECCV 2010, http://dx.doi.org/10.1007/978-3-642-15549-9_37
T. Lan, Y. Wang, W. Yang and G. Mori, "Beyond
Actions: Discriminative Models for Contextual Group Activities", NIPS
2010, http://books.nips.cc/papers/files/nips23/NIPS2010_0115.pdf
April 1, 2011 Image and text models
Background readings:
K. Barnard and D. Forsyth,
"Learning the Semantics of Words and Pictures," International
Conference on Computer Vision, vol 2, pp. 408-415,
2001, http://doi.ieeecomputersociety.org/10.1109/ICCV.2001.937654
D. Blei
and M. Jordan, "Modeling Annotated Data", SIGIR '03 Proceedings of
the 26th annual international ACM SIGIR conference on Research and development
in informaion retrieval,
http://dx.doi.org/10.1145/860435.860460
T. Berg and D. Forsyth,
"Animals on the Web", CVPR 2006, http://dx.doi.org/10.1109/CVPR.2006.57
Contemporary readings:
Chong Wang, D. Blei,
Fei-Fei Li, "Simultaneous image classification
and annotation," CVPR 2009,
http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206800
K. Saenko
and T. Darrell, “Filtering Abstract Senses From Image
Search Results”, NIPS 2009,
http://books.nips.cc/papers/files/nips22/NIPS2009_1143.pdf
A. Farhadi,
M. Hejrati
, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier and D.
Forsyth, "Every Picture Tells a Story: Generating Sentences from
Images", NIPS 2010, http://dx.doi.org/10.1007/978-3-642-15561-1_2
B. Siddiquie
and A. Gupta, "Beyond Active Noun Tagging: Modeling Contextual
Interactions for Multi-Class Active Learning", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5540044
April 8, 2011 Crowdsourcing and Active Learning
Background readings:
L. von Ahn
and L. Dabbish, "Labeling images with a computer
game", SIGCHI 2004, http://dx.doi.org/10.1145/985692.985733
A. Kapoor,
K. Grauman, R. Urtasun, and
T. Darrell, "Active Learning with Gaussian Processes for Object
Categorization" ICCV 2007.
http://doi.ieeecomputersociety.org/10.1109/ICCV.2007.4408844
Contemporary readings:
J. Deng, W. Dong,
R. Socher, L.-J. Li, K. Li,
and L. Fei-Fei. "ImageNet:
A Large-Scale Hierarchical Image Database". In CVPR,
2009. http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206848
S. Vijayanarasimhan,
P. Jain, K. Grauman, "Far-sighted active
learning on a budget for image and video recognition", CVPR 2010.
http://dx.doi.org/10.1109/CVPR.2010.5540055
P. Welinder,
S. Branson, S. Belongie, P. Perona,
"The Multidimensional Wisdom of Crowds", NIPS 2010.
http://books.nips.cc/papers/files/nips23/NIPS2010_0577.pdf
S. Branson, C. Wah,
B. Babenko, F. Schroff, P. Welinder, P. Perona, S. Belongie, "Visual Recognition with Humans in the
Loop", ECCV 2010. http://dx.doi.org/10.1007/978-3-642-15561-1_32
April 15, 2011 Scene and Image Context
Background readings:
A. Torralba,
K. P. Murphy, and W. T. Freeman, "Contextual models for object detection
using boosted random fields," in Advances in Neural Information Processing
Systems 17 (NIPS), 2005, pp. 1401-1408.
http://dspace.mit.edu/handle/1721.1/6740
D. Hoiem,
A. A. Efros, and M. Hebert, "Putting objects in
perspective," in Computer Vision and Pattern Recognition, 2006 IEEE
Computer Society Conference on, vol. 2, 2006, pp. 2137-2144.
http://dx.doi.org/10.1109/CVPR.2006.232
L.-J. Li and L. Fei-Fei, "What, where and who? classifying events by scene and object recognition," in
Computer Vision, 2007. ICCV 2007. IEEE 11th
International Conference on, 2007, pp. 1-8.
http://dx.doi.org/10.1109/ICCV.2007.4408872
Contemporary readings:
S. Bao,
M. Sun, S. Savarese, "Toward coherent object
detection and scene layout understanding", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5540229
B. Yao and L. Fei-Fei.
"Modeling Mutual Context of Object and Human Pose in Human-Object
Interaction Activities.", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5540235
A. Gupta, A. Efros
and M. Hebert, "Blocks World Revisited: Image Understanding Using
Qualitative Geometry and Mechanics". ECCV 2010,
http://dx.doi.org/10.1007/978-3-642-15561-1_35
April 22, 2011 Taxonomies and Sub-category Recognition
Background readings:
A. Zweig and D. Weinshall,
"Exploiting object hierarchy: Combining models from different category
levels," in Computer Vision, 2007. ICCV 2007.
IEEE 11th International Conference on, 2007, pp. 1-8. Available:http://dx.doi.org/10.1109/ICCV.2007.4409064
G. Griffin and P. Perona, "Learning and using taxonomies for fast visual
categorization," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1-8. Available:http://dx.doi.org/10.1109/CVPR.2008.4587410
J. Sivic,
B. C. Russell, A. Zisserman, W. T. Freeman, and A. A.
Efros, "Unsupervised discovery of visual object
class hierarchies," in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1-8. Available:
http://dx.doi.org/10.1109/CVPR.2008.4587622
Contemporary readings:
L.-J. Li, C.
Wang, Y. Lim, D. Blei and L. Fei-Fei.
"Building and Using a Semantivisual Image
Hierarchy", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5540027
M. Rohrbach,
M. Stark, G. Szarvas, I. Gurevych,
and B. Schiele, "What helps where – and why? Semantic
relatedness for knowledge transfer", CVPR 2010,
http://dx.doi.org/10.1109/CVPR.2010.5540121
April 29, 2011 Domain Adaptation
K. Saenko,
B. Kulis, M. Fritz, and T. Darrell, "Adapting
Visual Category Models to New Domains", ECCV 2010,
http://dx.doi.org/10.1109/10.1007/978-3-642-15561-1_16
A. Bergamo and L. Torresani, "Exploiting weakly-labeled Web images to
improve object classification: a domain
adaptation approach", NIPS 2010,
http://books.nips.cc/papers/files/nips23/NIPS2010_0093.pdf
L. Cao, Z. Liu, T. Huang, "Cross-dataset action detection", CVPR 2010, http://dx.doi.org/10.1109/CVPR.2010.5539875