Next: Segmentation Up: Background estimation and removal Previous: Motivation

Background Estimation

In basic terms, we define the background as the stationary portion of a scene. Many applications simply require that there be introductory frames in the sequence which contain only background elements. If pure background frames are available, pixel-wise statistics in color and depth can be computed directly. The more difficult case is computing the background model in sequences which always contain foreground elements.

We model each pixel as an independent statistical process. We record the (R,G,B,Z) observations at each pixel over a sequence of frames in a multidimensional histogram. We then use a clustering method to fit the data with an approximation of a mixture of Gaussians. For ease of computation, we assume a covariance matrix of the form $\Sigma=\sigma^2I$ . At each pixel one of the clusters is selected as the background process. The others are considered to be caused by foreground processes. In the general case where depth measurements at the pixel are largely valid, the background is simply represented by the mode which is farthest in range and covers at least $T\%$ of the data temporally. We use T=10. In general, the required temporal coverage for good background estimation when depth is available can be much less than in a color only estimate because of the fact that background is inherently behind foreground. We need only insure that the deepest mode is a reliable process, and not due to noise.

However, if the pixel is undefined in range in a significant portion of the data (more than represented by the deepest mode) then we do not have sufficient data to model the background range and tag the range in the background as invalid (e.g. corresponding to a uniform distribution). We then cluster the data in color space and use the largest (most common) mode to define the background color.

As long as there is sufficient data representing the background at any given pixel over the sequence, the background can be estimated in the presence of foreground elements. In traditional color-based background estimation, which models the background color as the mode of the color histogram at each pixel, the background must be present at a given pixel in the majority of the frames for correct background estimation. A significant advantage of the use of color and depth space in the background estimation process is that, at pixels for which depth is usually valid, we can correctly estimate depth and color of the background when the background is represented in only a minority of the frames. For pixels which have significant invalid range, we fall back to the same majority requirement as color-only methods.

It is important to note the advantage of using a multi-dimensional representation. When estimating the background range or color independently, the background mode can be more easily contaminated with foreground statistics. Take for example, standard background range estimation[2] for a scene in which people are walking across a floor. Their shoes (foreground) come into close proximity with the floor (background) as they walk. The mode of data representing the floor depth will be biased to some extent by the shoe data. Similarly, in standard background color estimation, for a scene in which a person in a greenish-blue shirt (foreground) walks in front of a blue wall (background), the blue background color mode will be biased slightly toward green. However, assuming that the shoe is a significantly different color than the floor in the first case, and that the person walks at a significantly different depth than the wall in the second case, the combined range/color histogram modes for foreground and background will not overlap. This will result in more accurate estimates of background statistics in both cases.

Preprocessing of range data

Video from a pair of cameras is used to estimate the distance of the objects in the scene using the census stereo algorithm [12]. We have implemented the census algorithm on a single PCI card, multi-FPGA reconfigurable computing engine [10]. This stereo system is capable of computing 32 stereo disparities on 320 by 240 images at 57 frames per second, or approximately 140 million pixel-disparities per second. Using a commercial PCI frame-grabber, the system runs at 30 frames per second and 73 million pixel-disparities per second. These processing speeds compare quite favorably with other real-time stereo implementations such as [5].

The stereo range data typically includes some erroneous values which are not marked as invalid. These errors often take the form of isolated regions which are either much farther or much nearer than the surrounding region. Since the census algorithm uses a neighborhood based comparison when computing disparity between the two views, if an image region of uniform depth is small in comparison to the effective correlation window, disparities for the region are not likely to represent true distances in the scene. Therefore, before either background estimation or subsequent segmentation, we process the range to remove these artifacts using non-linear morphological smoothing[8,9].

Background Estimation Results

In Figure 3, we show an example of the background computed from 60 frames sampled from a 780 frame sequence. The top row shows typical images from the sequence; there were no frames in the scene containing only background. The bottom row shows the background range and color representation in which all the foreground elements have been effectively removed.

**Figure:** The top row shows sample images from a 780 frame sequence which contained no frames without people in the foreground. The bottom row shows the background model estimated from this sequence. These examples use an intensity and range model space.
$\begin{figure} \twofigw{/home/gaile/text/Figures/BackSeg/image700.ps}{/home/gail... ...lorH.ps}{/home/gaile/text/Figures/BackSeg/backTrackDepthH.ps}{1.5in}\end{figure}$

These examples were computed with an off-line implementation of this background estimation algorithm. We are currently working on extensions which will allow dynamic background estimation based on the previous N frames (to allow for slow changes in the background), as well as an estimate of multiple background processes at each pixel, similar to [3], but using higher dimensional Gaussians.

Next: Segmentation Up: Background estimation and removal Previous: Motivation

G. Gordon, T. Darrell, M. Harville, J. Woodfill."Background estimation and removal based on range and color,"Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (Fort Collins, CO), June 1999.