We develop three-dimensional shape contexts as part of an approach to 3D object recognition from point clouds. 3D shape contexts are semi-local descriptions of object shape centered at points on an object's surface, and are a natural extension of 2D shape contexts introduced by Belongie, Malik, and Puzicha  for recognition in 2D images. 3D shape contexts are joint histograms of point density parameterized by radius, azimuth, and elevation. These features are similar in spirit to spin images , which have shown good performance in 3D object recognition tasks. Spin images are two-dimensional descriptors, summing over the azimuth angle, whereas shape contexts preserve information in all three dimensions.
To recognize objects, we compute shape contexts at a few randomly chosen points on a query image, and find their nearest neighbors in a stored set of shape contexts for a set of sample points on 3D object models. The model with the smallest combined distances is taken to be the best match. Finding nearest neighbors in high dimensions is computationally expensive, so we explore the use of clustering and locality sensitive hashing  to speed up the search while maintaining accuracy. Results are shown for both full 3D models and simulated range data.
Figure 1: Visualization of the histogram bins of the 3D shape context
In this work we study the problem of combining region and boundary cues for natural image segmentation. We employ a large database of manually segmented images in order to learn an optimal affinity function between pairs of pixels. These pairwise affinities can then be used to cluster the pixels into visually coherent groups. Region cues are computed as the similarity in brightness, color, and texture between image patches. Boundary cues are incorporated by looking for the presence of an "intervening contour," a large gradient along a straight line connecting two pixels.
We first use the dataset of human segmentations to individually optimize parameters of the patch and gradient features for brightness, color, and texture cues. We then quantitatively measure the power of different feature combinations by computing the precision and recall of classifiers trained using those features. The mutual information between the output of the classifiers and the same-segment indicator function provides an alternative evaluation technique that yields identical conclusions.
As expected, the best classifier makes use of brightness, color, and texture features, in both patch and gradient forms. We find that for brightness, the gradient cue outperforms the patch similarity. In contrast, using color patch similarity yields better results than using color gradients. Texture is the most powerful of the three channels, with both patches and gradients carrying significant independent information. Interestingly, the proximity of the two pixels does not add any information beyond that provided by the similarity cues. We also find that the convexity assumptions made by the intervening contour approach are supported by the ecological statistics of the dataset.
Figure 1: Pixel affinity images. The first row shows an image with one pixel selected. The remaining rows show the similarity between that pixel and all other pixels in the image, where white is most similar. Rows 2-4 show our patch-only, contour-only, and patch+contour affinity models. Rows 5 and 6 show the pixel similarity as given by the groundtruth data, where white corresponds to more agreement between humans. Row 6 shows simply the same-segment indicator function, while row 5 is computed using intervening contour on the human boundary maps.
Figure 2: Performance of humans compared to our best pixel affinity models. The dots show the precision and recall of each of 1366 human segmentations in the 250-image test set when compared to the other humans' segmentation of the same image. The large dot marks the median recall (99%) and precision (63%) of the humans. The iso-F-measure curve at F=77% is extended from this point to represent the frontier of human performance for this task. The three remaining curves represent our patch-only model, contour-only model, and patch+contour model. Neither patches nor contours are sufficient, as there is significant independent information in the patch and contour cues. The model used throughout the paper is a logistic function with quadratic terms which performs the best among classifiers tried on this dataset.
Spectral graph theoretic methods have recently shown great promise for the problem of image segmentation. However, do to the computational demands of such methods, applications to large problems such as spatiotemporal data and high resolution imagery have been slow to appear. The contribution of this work is a method that substantially reduces the computational requirements of grouping algorithms based on spectral partitioning, making it feasible to apply them to very large grouping problems. Our approach is based on a technique for the numerical solution of eigenfunction problems known as the Nyström method. This method allows extrapolation of the complete grouping solution using only a small number of "typical" samples. In doing so, we successfully exploit the fact that there are far fewer coherent groups in a scene than pixels.
The goal of this work is to accurately detect and localize boundaries in natural scenes using local image measurements. We formulate features that respond to characteristic changes in brightness, color, and texture associated with natural boundaries. In order to combine the information from these features in an optimal way, we train a classifier using human labeled images as ground truth. The output of this classifier provides the posterior probability of a boundary at each image location and orientation. We present precision-recall curves showing that the resulting detector significantly outperforms existing approaches. Our two main results are (1) that cue combination can be performed adequately with a simple linear model, and (2) that a proper treatment of texture is required to detect boundaries in natural images.
Figure 1: Two decades of boundary detection. The performance of our boundary detector compared to classical boundary detection methods and to the human subjects' performance. A precision-recall curve is shown for five boundary detectors: (1) Gaussian derivative (GD); (2) Gaussian derivative with hysteresis thresholding (GD+H), the Canny detector; (3) A detector based on the second moment matrix (2MM); (4) our grayscale detector that combines brightness and texture (BG+TG); and (5) our color detector that combines brightness, color, and texture (BG+CG+TG). Each detector is represented by its precision-recall curve, which measures the tradeoff between accuracy and noise as the detector's threshold varies. Shown in the caption is each curve's F-measure, valued from zero to one. The F-measure is a summary statistic for a precision-recall curve. The points on the plot show the precision and recall of each ground truth human segmentation when compared to the other humans. The median F measure for the human subjects is 0.80. The solid curve shows the F=0.80 curve, representing the frontier of human performance for this task.
Figure 2: Local image features. In each row, the first panel shows an image patch. The following panels show feature profiles along the patch's horizontal diameter. The features are raw image intensity, brightness gradient BG, color gradient CG, raw texture gradient TG, and localized texture gradient TG. The vertical red line in each profile marks the patch center. The scale of each feature has been chosen to maximize performance on the set of training images--2% of the image diagonal (5.7 pixels) for CG and TG, and 1% of the image diagonal (3 pixels) for BG. The challenge is to combine these features in order to detect and localize boundaries.
Figure 3: Boundary images for three grayscale detectors. Columns 2-4 show P_b images for the Canny Detector, the second moment matrix (2MM), and our brightness+texture detector (BG+TG). The human segmentations are shown for comparison. The BG+TG detector benefits from operating at a large scale without sacrificing localization and the suppression of edges on the interior of textured regions.
The problem we consider in this project is to take a single two-dimensional image containing a human body, locate the joint positions, and use these to estimate the body configuration and pose in three-dimensional space. The basic approach is to store a number of exemplar 2D views of the human body in a variety of different configurations and viewpoints with respect to the camera. On each of these stored views, the locations of the body joints (left elbow, right knee, etc.) are manually marked and labelled for future use. The test shape is then matched to each stored view, using the technique of shape context matching. Assuming that there is a stored view sufficiently similar in configuration and pose, the correspondence process will succeed. The locations of the body joints are then transferred from the exemplar view to the test shape. Given the joint locations, the 3D body configuration and pose are then estimated. We present results of our method on a corpus of human pose data.