Motion Estimation

We first review the classic brightness change constraint equation and its application to motion estimation under perspective camera projection. We then introduce and develop a second, analogous constraint that operates directly on depth information. Next, we show how to combine these constraints across image pixels into a single linear system which may be solved efficiently for 3-D motion parameters. Finally, we discuss a spatial coordinate shift that greatly improves the motion estimation results.

The Brightness Change Constraint Equation for Perspective Projection

In all of the following derivations, we will denote the coordinate of a point in 3-D space as $\vec{X} = [\begin{array} {ccc}X & Y & Z\end{array}]^T$ , and the 3-D velocity of this point as $\vec{V} = [\begin{array} {ccc}V_x & V_y & V_z\end{array}]^T$ . When we project this point onto the camera image plane via some camera projection model, it will be located at the 2-D image coordinate $\vec{x} = [\begin{array} {cc}x & y\end{array}]^T$ . The 3-D motion of the point in space will induce a corresponding 2-D motion of the projected point in the image plane, and we will express these 2-D velocities as $\vec{v} = [\begin{array} {cc}v_x & v_y\end{array}]^T$ .

The brightness change constraint equation (BCCE) for image velocity estimation arises from the assumption that intensities undergo only local translations from one frame to the next in an image sequence. This assumption is only approximately true in practice, in that it ignores phenomena such as occlusions, disocclusions, and changes in intensity due to changes in lighting. The assumption may be expressed for frames at times t and t+1 as follows:

I(x,y,t) is the image intensity, and v_x(x,y,t) and v_y(x,y,t) are the x- and y-components of the 2-D velocity field of object motion after projection onto the image plane. If we further assume that the time-varying image intensity is well approximated by a first-order Taylor series expansion, we can expand the right side of the above equation to obtain

$\begin{displaymath} -I_t = \left[\begin{array} {cc}I_x &I_y\end{array}\right]\left[\begin{array} {c} v_x \\ v_y \end{array}\right]\end{displaymath}$

(2)

This equation constrains image plane velocities, but we are interested in solving for 3-D world velocities. For a perspective camera with focal length f, the relationship between the two sets of velocities may be derived from the perspective camera projection equations: $x = \frac{fX}{Z}$ , and $y = \frac{fY}{Z}$ .Taking the derivatives of these equations with respect to time yields

We wish to further constrain the 3-D velocities $\vec{V}$ according to rigid body motion. Any rigid body motion can be expressed in terms of the instantaneous object translation $\vec{T} = [\begin{array} {ccc}t_x &t_y &t_z\end{array}]^T$ and the instantaneous rotation of the object about an axis $\vec{\Omega} = [\begin{array} {ccc}\omega_x &\omega_y &\omega_z\end{array}]^T$ . $\vec{\Omega}$ describes the orientation of the axis of rotation, and $\left\vert \vec{\Omega} \right\vert$ is the magnitude of rotation per unit time. For small rotations,

$\begin{displaymath} \vec{V} \approx \vec{T} + \vec{\Omega} \times \vec{X} = \vec{T} - \vec{X} \times \vec{\Omega}\end{displaymath}$

(4)

The cross product of two vectors may be rewritten as the product of a skew-symmetric matrix and a vector. Applying this to the cross-product $\vec{X} \times \vec{\Omega}$ above, we obtain:

$\begin{displaymath} \vec{X} \times \vec{\Omega} = \hat{X}\vec{\Omega}\mbox{ , wh... ...ray} {rrr}0 &-Z &Y \\ Z &0 &-X \\ -Y &X &0\end{array}\right] \end{displaymath}$

$\begin{displaymath} \mathbf{Q} = [\begin{array} {cc}\mathbf{I} &-\hat{X}\end{arr... ...Y \\ 0 &1 &1 &-Z &0 &X \\ 0 &0 &1 &Y &-X &0\end{array}\right]\end{displaymath}$

Substitution of the right side of (8) for $\vec{V}$ in (6) produces a single linear equation relating image intensity derivatives to rigid body motion parameters under perspective projection at a single pixel:

$\begin{displaymath} -I_t = \frac{1}{Z}[\begin{array} {ccc} f I_x &f I_y &-(xI_x + yI_y)\end{array}]\mathbf{Q}\vec{\phi}\end{displaymath}$

(6)

Much of the previous work on motion and pose estimation from intensity data has used this constraint and variations on it. However, in most of that work, the depth values which appear in the equation are not known, and one must use non-linear estimation techniques to solve for the motion (see [10] for examples). Alternatively, the estimation can be reduced to a linear system through the use of a generic shape model of the object being tracked [2,3]. By using depth measurements directly in our linear constraint equation, we are able to avoid the non-linear computations required by the former class of approaches, as well as reduce the object shape errors inherent in the latter class of approaches.

Adding the Depth Constraint

Assuming that video-rate depth information is available to us, we can relate changes in the depth image over time to rigid body motion in a manner similar to that shown for intensity information above. For rigid objects, an object point which appears at a particular image location (x, y) at time t will appear at location (x + v_x, y + v_y) at time t+1. The depth values at these corresponding locations in image space and time should therefore be the same, except for any depth translation that the object point undergoes between the two frames. This can be expressed in a form similar to (1):

Use of perspective camera projection to relate image velocities to 3-D world velocities yields

Finally, we again constrain 3-D world velocities to rigid body motion by introducing the $\mathbf{Q}$ matrix

$\begin{displaymath} -Z_t = \frac{1}{Z}[\begin{array} {ccc} fZ_x &fZ_y &-(Z + xZ_x + yZ_y)\\ end{array}]\mathbf{Q}\vec{\phi} \end{displaymath}$

(9)

This linear equation for relating depth gradient measurements to rigid body motion parameters at a single pixel is the depth analog to equation (9). Note that, in contrast to the BCCE, whose derivation begins with an assumption (see equation (1)) that is an approximation at best, the DCCE is based on the linearization of a generic description of motion in 3-D, and we might therefore expect it to lead to more accurate estimation of motion.

Orthographic Approximation

In many applications, we can approximate the camera projection model as orthographic instead of perspective without introducing significant error in 3-D world coordinate estimation. For the pose tracking algorithms discussed in this paper, use of orthographic projection greatly simplifies the constraint equations derived in the previous sections, thereby making the solution of linear systems of these equations much less computationally intensive.

Derivation of the orthographic analogs of equations (9) and (13) is straightforward. We replace the perspective projection relationship with the orthographic projection equations x = X and y = Y, which in turn imply that v_x = V_x and v_y = V_y. Hence, equation (5) is replaced by the much simpler equation

$\begin{displaymath} \vec{v} = \left[\begin{array} {rrr}1 &0 &0 \\ 0 &1 &0 \end{array}\right] \vec{V}\end{displaymath}$

(10)

Proceeding through the remainder of the derivations of equations (9) and (13) yields their orthographic projection analogs:

$\begin{displaymath} -I_t = [\begin{array} {rrr} I_x &I_y &0\end{array}]\mathbf{Q}\vec{\phi}\end{displaymath}$

(11)

$\begin{displaymath} -Z_t = [\begin{array} {rrr} Z_x &Z_y &-1\end{array}]\mathbf{Q}\vec{\phi}\end{displaymath}$

(12)

Integration Over Space and Time

We can write intensity and depth constraint equations of the form of equations (9) and (13) for each pixel location that pertains to the object of interest. Because the intensity constraint equations (9) are linear, we can combine them across N pixels by stacking the equations in matrix form: $\vec{b_I} = \mathbf{H_I}\vec{\phi}$ . $\mathbf{H_I} \in \Re^{Nx6}$ , where each row is the vector obtained by multiplying out the right side of equation (9) at a single pixel i. $\vec{b_I} \in \Re^{Nx1}$ , where the ith element is -I_t at pixel i. The I subscripts on $\mathbf{H_I}$ and $\vec{b_I}$ indicate that they only use the intensity constraints. We can collect the depth constraint equations (13) into an analogous linear system: $\vec{b_D} = \mathbf{H_D}\vec{\phi_D}$ . Provided that N > 6, the least-squares method may be used to solve either of these systems independently for the motion parameters $\vec{\phi}$ .

The intensity and depth linear systems and may also be combined into a single linear system for constraining the motion parameters:

$\begin{displaymath} \vec{b} = \mathbf{H}\vec{\phi}, \mbox{, where }\\ \mathbf{H}... ...n{array} {c} \vec{b_I} \\ \lambda\vec{b_D} \end{array}\right] \end{displaymath}$

(13)

The scaling factor, $\lambda$ , controls the weighting of the depth constraints relative to the intensity constraints. When one expects depth to be more reliable than intensity, such as under fast changing lighting conditions, one might want to set $\lambda$ to a value higher than 1, but under other conditions, such as when depth information is much noisier than intensity, one might prefer to use lower $\lambda$ values. The least-squares solution to the above equation is

$\begin{displaymath} \vec{\phi} = -(\mathbf{H^T}\mathbf{H})^{-1}\mathbf{H}^T\vec{b}\end{displaymath}$

(14)

The image pixel data used to build the above linear system are taken only from a restricted region of image support. This support region is the set of pixel locations which we believe correspond to the object, and for which intensity, depth, and their derivatives are well-defined.

The motions estimated between pairs of consecutive frames are simply added together to form an estimate of cumulative object motion over time. It is beneficial to supplement this tracking algorithm with a parallel scheme for deciding when the accumulated error has become substantial, and to reinitialize the object pose estimate at these times.

**Figure:** Synthetic sequence input. First and second images: intensity and depth images for initial frame of both synthetic sequences. Third, fourth, and fifth images: extrema of rotations for first synthetic sequence. Sixth image: extreme of translation for second synthetic sequence.
$\begin{figure} \onefigw{artgaileImages2.ps}{6.5in}\end{figure}$

Shift of World Coordinate System

We improve the numerical stability of the least-squares solution by translating all of the 3-D spatial coordinates to the centroid $\vec{X}_o = [\begin{array} {ccc}X_o & Y_o & Z_o\end{array}]^T$ of the supported samples. This transformation affects only the matrix $\mathbf{Q}$ , and the motion parameter vector $\vec{\phi}$ will compensate for this change. That is, we can rewrite equation (9) as

$\begin{displaymath} -I_t = \frac{1}{Z}[\begin{array} {ccc} f I_x &f I_y &-(xI_x + yI_y)\end{array}]\mathbf{Q'}\vec{\phi}'\end{displaymath}$

(15)

where $\vec{\phi}' = [\begin{array} {cc}\vec{T}'^T &\vec{\Omega}'^T\end{array}]^T$ , and

$\begin{displaymath} \mathbf{Q}' = \nonumber \left[\begin{array} {cccccc}1 &0 &0 ... ...&0 &(X-X_o) \\ 0 &0 &1 &(Y-Y_o) &-(X-X_o) &0\end{array}\right]\end{displaymath}$

Equation (13) can be modified similarly. We combine these shifted intensity and depth equations into a single linear system and solve for the motion parameters $\vec{\phi}'$ by least squares, as described above. These motion parameters are in the coordinate system of the object centroid; we would like to transform them back to motion parameters $\vec{\phi}$ in the camera coordinate system. The 3-D velocities $\vec{V}'$ in the shifted coordinate system are described by $\vec{V}' = \vec{T}' - (\vec{X} - \vec{X_o}) \times \vec{\Omega}'$ . Since these velocities must also equal the velocities $\vec{V}$ in the camera coordinate system, given by equation (7), it is straightforward to show that

$\begin{displaymath} \vec{\Omega} = \vec{\Omega}' \mbox{ and } \\ \vec{T} = \vec{T}' + \vec{X_o} \times \vec{\Omega}'\end{displaymath}$

(16)