MONTE CARLO SAMPLING FOR VISUAL POSE

MONTE CARLO SAMPLING FOR VISUAL POSE TRACKING
Jehoon Lee1 , Romeil Sandhu1 , and Allen Tannenbaum1,2
1
Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
2
Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
[email protected], [email protected], [email protected]
ABSTRACT
In this paper, we present a visual pose tracking algorithm based on
Monte Carlo sampling of special Euclidean group SE(3) and knowledge of a 3D model. In general, the relative pose of an object in 3D
space can be described by sequential transformation matrices at each
time step. Thus, the objective of this work is to find a transformation matrix in SE(3) so that the projection of an object transformed
by this matrix coincides with an object of interest in the 2D image
plane. To do this, first, the set of these transformation matrices is
randomly generated via an autoregressive model. Next, 3D transformation is performed on a 3D model by these matrices. Finally, a
region-based energy model is designed to evaluate the optimality of
a transformed model’s projection. Experimental results demonstrate
the robustness of the proposed method in several tracking scenarios.
Index Terms— Visual tracking, 2D-3D pose estimation, Monte
Carlo sampling
1. INTRODUCTION
Visual tracking is one of the important major topics of the computer
vision community. Numerous algorithms have been proposed to locate and track objects of interest in recent years; see [1] and references therein. In this paper, we address the problem of 2D-3D pose
estimation. This problem consists of determining object’s pose (position and orientation) in 3D, relative to a calibrated camera, from
a 2D image sequence; the relative pose can be described in terms
of trajectories on a transformation matrix [2]. Thus, the objective of
this paper is to find a 3D transformation matrix to optimally estimate
the pose of an object of interest. To this end, we propose the Monte
Carlo based sampling method to estimate a 3D transformation matrix for 2D object tracking and 3D pose estimation. In addition, we
take advantage of knowledge of a 3D model of an object. This allows the tracker to capture various aspects (or quasi-deformation)
of an object with respect to its dynamic motion, while the methods
based on learning a collection of 2D shape priors have difficulties in
completely describing them.
Several attempts to estimate the pose of a 3D object based on a
2D image sequence have been presented. For example, in [3], the
pose is estimated using local features, such as visible edges. These
local feature based methods have difficulties in dealing with a cluttered background due to their sensitivity to noise. To make up for
This work was supported in part by grants from NSF, AFOSR, ARO,
as well as by a grant from NIH (NAC P41 RR-13218) through Brigham and
Women’s Hospital. This work is part of the National Alliance for Medical Image Computing (NAMIC), funded by the National Institutes of Health
through the NIH Roadmap for Medical Research, Grant U54 EB005149. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics/.
this weakness, the authors in [4, 5] present a region-based model for
2D-3D pose estimation. In particular, [5] presents a variational approach to jointly perform 2D segmentation and 3D pose estimation.
It is based on minimizing a region-based energy functional using a
gradient descent flow.
The proposed algorithms are related to the work in [5]. Compared with the approaches in [5], in this paper, to find the optimal pose of an object of interest, we randomly generate the set of
transformation matrices and then evaluate the proposed statistical energy model for each transformation matrix, unlike the method of [5]
based on the gradient flow. To construct 3D transformation matrices,
Monte Carlo sampling of the matrix group SE(3) is used, which is
inspired by some filtering schemes on Lie groups in literature [6, 7].
This approach takes into account the underlying geometry of SE(3)
rather than employing a local coordinate-based sampling because the
set of three-dimensional transformations is not a vector space, but is
closely involved in curved Lie groups [7]. The work in [6] presents
a pose tracking method working on SE(3) in a Bayesian framework
and discusses minimum mean-square error on SE(3) and SO(3). The
authors in [7] sequentially draw the set of two-dimensional affine
group from a particle filtering and propagate it via autoregressive
process developed for state dynamics. Our work addresses a more
general case of the work in [7] since we focus not only on tracking
2D motion but also estimating a 3D pose.
2. PRELIMINARIES
2.1. Notation and Tranformation Matrix
It is assumed that the camera is calibrated and the object motion
is relatively smooth. Let X = [X, Y, Z]T be the coordinates of
a point in R3 with respect to the referential of the camera. Here,
the calibrated camera is modeled as an ideal perspective projection:
π : R3 7→ Ω; X 7→ x, where Ω ⊂ R2 is the domain of an image
I(x) and x = [x, y]T = [X/Z, Y /Z]T denotes coordinates in Ω.
S is a smooth surface representing the shape of interest in R3 and
its projected region is R = π(S) ⊂ Ω. Rc = Ω \ R is the complementary region of R. Similarly, the curve cˆ = π(C) ⊂ Ω is the
projection of the curve C ⊂ S in 3D where cˆ denotes a boundary of
R: cˆ = ∂R in 2D. Let X0 ∈ R3 and X ∈ R3 be the coordinates
of points on S0 and S, respectively. Here S0 is the identical reference surface in 3D. Since the calibrated camera is set, the pose of
the object with respect to the camera is represented as T ∈ SE(3),
which is a 4 × 4 transformation matrix of a 3D rigid body motion so
that S = T (S0 ). It yields the following expression in a point wise
fashion: X = T (X0 ) = RX0 + t where R ∈ SO(3) is a 3 × 3
rotational group and t ∈R3 is a 3 × 1 translation. Transformation
R t
T is represented as T =
in homogeneous coordinates.
0 1
Lie generators for 3D rigid motion span the tangent space to the
Lie group manifold. Each generator corresponds to each geometric
transformation, i.e., translations along the x, y, and z axes, rotations
about the x, y, and z axes, respectively, given by (see [8] for the
detail derivation):
"0 0 0 0#
"0 0 0 0#
"0 0 0 1#
B1 =
B4 =
0
0
0
"0
0
0
0
0
0
0
0
0
1
0
0 0
0 0
0 0
0
−1
0
0
, B2 =
0#
0
, B5
0
0
=
0 0 0 1
0 0 0 0
0 0 0 0
" 0 01
0 0 0
−1 0 0
0 0 0
0 0 0 0
0 0 0 1
0 0 0 0
" 0 −1
0#
0
, B6 = 10 00
0
0 0
0
, B3 =
,
0
0
0
0
(1)
0#
0
.
0
0
Several papers on distance metrics on SE(3) can be found in
literature [6, 7, 9]. Here, for example, one way of evaluating the
mean of special orthogonal matrices SO(3) is presented in [9]. The
rotation mean R ∈ SO(3) of N rotations is explicitly evaluated as:
T
V UT
for det(G ) ≥ 0
R=
(2)
V DU T for otherwise,
h1 0 0 i
PN
T
where D = 0 1 0 . G = N1
i=1 Ri and its transpose G can
0 0 −1
T
be decomposed using the singular value decomposition as G =
3
U ΣV T . For translational components, the arithmetic
mean
# t∈R
"
R t
.
is used. Thus, the sample mean on SE(3) is T =
0 1
3. PROPOSED METHOD
3.1. Motion Model
In general, information we have on the dynamics of a system is incomplete. One way to overcome the difficulties associated with this
lack of information is to use a stochastic model for the system dynamics. One can choose a random walk model for its simplicity.
However, since the random walk model requires a sufficiently large
number of samples to be practical; a Newtonian motion model [6]
or an infinitesimal constant velocity model [7] can be chosen for a
more efficient way to reach the optimal solution. In this work, by
extending the work in [7], we construct the first-order autoregressive
(AR) model on SE(3) based on motion variation to determine how a
transformation matrix is generated, which is given by
√ −1
Tk = Tk−1 exp a log(Tk−2
Tk−1 ) + duk ∆k ,
(3)
where exp(·) and log(·) denote the matrix exponential and the matrix logarithm, respectively [2]. a in equation (3) is the AR process
parameter and ∆k is a discrete time interval. And, duk is process
noise determined by the degree of variation of an object motion and
the Lie generators Bl (l = 1, ..., 6) as follows:
duk =
6
X
ρ(lth ) τ(lth ) Bl ,
τ ∼ N (0, σk ),
(4)
l=1
where N (·) represents the normal distribution and ρ(lth ) and τ(lth )
denote the lth element of ρ ∈ R6 and τ ∈ R6 , respectively. ρ
is a user-defined diffusion weight vector; it controls the range of
transformations in which we generate samples. In our experiments,
ρ = [0.5, 0.5, 0.5, 0.1, 0.1, 0.1]T is used; but it can be relaxed by
searching a larger space with the computational burden. To compute
the covariance matrix σk , we represent T as a six-dimensional vector
Fig. 1. Graph of motion variation. A solid line and a dotted line
denote the covariance variation of the sequences given in Figure 4(a)
and Figure 4(d), respectively.
by using twist coordinates for rotation R and attaching translation t:
λ = [tx , ty , tz , θx , θy , θz ]. Thus, the covariance matrix σk ∈
R6×6 is defined as:
h
i
ˆ ik−1 − λk−1 )(λ
ˆ ik−2 − λk−2 )T ,
σk = E (λ
(5)
ˆ i (i = 1, ..., N ) are parameters of the predicted transformawhere λ
tion matrix in the Monte Carlo framework, which is described in
Section 4.
The AR process in equation (3) determines the distribution of
samples in the sample space according to motion variance of the
tracked object during tracking. For example, the elephant in the sequence shown in Figure 4(d) is placed in the same position through
the sequence. On the other hand, the elephant of the sequence shown
in Figure 4(a) dynamically changes its pose due to a moving camera. The graph of Figure 1 shows that the degree of object’s motion
variation of the sequence in Figure 4(d) is much smaller than the sequence in Figure 4(a). Naturally, the degree of computed variance
for a moving object is much higher than for a stationary object.
3.2. Energy Model
The proposed energy model to evaluate the optimality of a candidate
transformation is simply defined as:
Z
Z
ER =
ri (I(x), cˆ) dΩ +
ro (I(x), cˆ) dΩ,
(6)
R
Rc
where ri : χ, Ω 7→ R and ro : χ, Ω 7→ R are mapping functions
to measure the quality over inside (R) and outside (Rc ) the curve,
respectively. Here, χ is the space that corresponds to photometric
variable of interest. Intuitively, we want to drive the pose estimation
so that the projected curve’s interior and exterior distributions are
maximally different. Any ri and ro functions that statistically describe the region properties of R and Rc can be used. For example,
one can choose the mean intensity model or the distinct Gaussian
model. In our work, ri and ro are defined as:
ri = log (Σi ) +
(I(x) − µi )2
(I(x) − µo )2
, ro = log (Σo ) +
,
Σi
Σo
where Σi and Σo are covariance, and µi and µo are intensity averages for R and Rc , respectively (Σi/o ∈ R3×3 and µi/o ∈ R3 for a
color image).
2 = 10%
(a) Noise level σn
Fig. 2. 3D point models used in this paper. From left to right: the
letters “ABC”, a red car, and a grey elephant.
2 = 50%
(b) Noise level σn
% − e (in %)
σn2
10%
50%
100%
T.avg
T.std
T.med
R.avg
R.std
R.med
0.64
0.83
1.57
0.40
0.55
1.01
0.56
0.83
1.49
6.73
6.57
8.89
5.00
5.43
8.48
4.72
3.96
5.05
2 = 100%
(c) Noise level σn
Fig. 3. Robustness to several noise levels. In (c), the letters “ABC”
and a background are barely visibly distinguishable (frame order:
from left to right).
Table 1. Noise level table displaying percent(%)-absolute errors of
the letters “ABC” sequence as given in Figure 3. T.avg, T.std, T.med,
R.avg, R.std, and R.med denote average, standard deviation and median value of translation error and rotation error, respectively.
Now, we assume that the statistical characteristic of the segmented object is not changing between consecutive frames. This
assumption allows the tracker to find the curve with the most similar statistical property to the previously selected curve. Thus, we
adopt the Bhattacharyya distance to evaluate the degree of similarity
between intensity distributions of the tracked object in consecutive
frames. The Bhattacharyya distance between two probability density
functions, p1 and p2 , is defined by [10]
Z
p
p1 (x)p2 (x)dx .
(7)
DB = − log
RN
Now, we represent p1 and p2 as a normalized histogram h1 and
h2 , respectively. Histograms are relatively invariant to object motion, i.e., translation, rotation, and scale. Thus, metrics based on
histograms can be good indicators of similarity between two distributions, for example, the intensity distributions of an object in consecutive video frames. A histogram can be computed by concatenating or by averaging histograms of each pre-defined color channel for
color-based imagery. For example, one can obtain a histogram by
using only pixels with large saturation and value in the HSV color
space [11]. In our work, we use pixels inside cˆ for histogram computation in which we average histograms of each RGB channel. Now,
given two histograms, h1 (x) and h2 (x) with x ∈ R, we have the
following similarity distance measure using equation (7):
Z p
EB = 1 −
h1 (x)h2 (x)dx.
(8)
We summarize the overall tracking approach by embedding the previously proposed algorithms into a Monte Carlo framework.
(a) Initialize Tk=0,1 and hk=0,1 .
(b) Generate N sample transformation
matrices {Tki }N
i=1 from
√ −1
i
equation (3): Tk = Tk−1 exp a log(Tk−2
Tk−1 ) + duk ∆k .
Note that transformation matrix at a current step k depends on
transformations until two step behind, k − 1 and k − 2. Also,
Tki refers to the estimated transformations of the previous time
steps, Tk−1 and Tk−2 , not each ith sample. This is different
from the work of [7] based on particle filtering.
(c) Perform 3D transformation on X0 : X i = Tki (X0 )
(d) Project the transformed 3D model onto a 2D image plane: Ri =
π(X i ) and compute the histogram of Ri , hi
(e) Evaluate the energy in equation (9) for each projected model:
i
= ER (Ri , Ik ) + EB (hk−1 , hi )
Eimage
i
(f) Estimate the optimal transformation using Tk∗ = arg min Eimage
T
t , where t =
or computing the sample mean on SE(3), R
0 1
P
P
N
N
i i
i i
i
1
1
i=1 w t and G = N
i=1 w R from equation (2). w
N
i
is normalized value for e−Eimage .
(g) Do iterations from step (b) to step (f) until Eimage converges or
the energy difference between frames is small enough.
(h) Let the histogram of the selected model be hk and go to step (b):
k = k + 1.
5. EXPERIMENTAL RESULTS
R
The EB is within [0, 1] and as the degree of similarity between histograms increases, as EB goes to zero.
Now, the optimal energy model is defined by adding two energy
terms from equations (6) and (8):
Eimage = ER + EB .
4. POSE TRACKING ALGORITHM
(9)
Therefore, the optimal projection (or projected curve) minimizes the
Eimage .
The proposed tracking algorithm is tested on several illustrative sequences in which color-valued images are used with except the synthetic sequence “ABC” which is grey scale. The 3D models used in
our tests are shown in Figure 2. The number of the generated transformation matrices is held fixed at N = 100 across all experiments.
As the number of transformation matrices increases, more accurate
results are expected, but the computational complexity increases exponentially.
6. CONCLUSION
Frame 3
Frame 59
Frame 190
(a)
Frame 3
Frame 216
Frame 432
(b)
In this note, we proposed a method for 2D-3D visual pose tracking
via Monte Carlo sampling on a special Euclidean group SE(3). As
shown in experiments, the proposed method shows promising results
in optimally tracking the object of interest in 2D and estimating its
pose in 3D under a cluttered environment. Moreover, the use of
knowledge of a 3D model allows the tracker to capture the quasideformations of an object more accurately than 2D-based trackers.
However, the proposed method has some limitations, which we
intend to overcome in our future work. Our framework suffers from
high computational complexity due to the nature of a sampling based
method. One possible solution would be achieved by effectively exploiting the system dynamics, i.e., the number of samples can be
adaptively changed at each time step according to the variation of
object’s motion. For example, the pose of a stationary object can be
obtained from the small sample size.
7. REFERENCES
Frame 50
Frame 90
Frame 135
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Computing Surveys (CSUR), vol. 38, no. 4, 2006.
[2] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An Invitation to 3D
Vision, Spinger-Verlag, 2003.
Frame 275
Frame 376
Frame 560
(c)
[3] T. Drummond and R. Cipolla, “Real-time visual tracking of
complex structures,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 24, no. 7, pp. 932–946, 2002.
[4] C. Schmaltz, B. Rosenhahn, T. Brox, D. Cremers, J. Weickert,
L. Wietzke, and G. Sommer, “Region-based pose tracking,”
Pattern Recognition and Image Analysis, pp. 56–63, 2007.
Frame 91
Frame 109
Frame 130
(d)
Fig. 4. Tracking (a), (b), and (d) a grey elephant, and (c) a red car in
noisy and cluttered environments.
First of all, to demonstrate the robustness to noise of the proposed method, the synthetic sequences are created with different levels of Gaussian noise whose variance, σn2 , ranges from 10% to 100%
as shown in Figure 3. Percent(%)-absolute errors are computed for
both translation and rotation over each noise level for quantitative
m −vt k
× 100
evaluation of tracking results; we use % − e = kvkv
tk
where vm and vt are measured and ground-truth of translation and
rotation vectors, respectively. As the noise level increases, the errors
also increase as shown in Table 1. However, tracking is still maintained throughout all sequences and all errors stayed less than 2%
and 9% for translation and rotation, respectively.
Figure 4 shows tracking results of the several sequences in which
drastic changes of an objects’ pose caused by a moving camera were
captured. As one can see in Figure 4, the pose of objects was successfully obtained despite the dynamic pose change and the cluttered
environment. Note that severe scale changes of the red car were observed in Figure 4(c). In Figure 4(d), the track of the elephant is
maintained under the occlusion with the marker throughout the sequence.
[5] S.Dambreville, R.Sandhu, Anthony Yezzi, and A.Tannenbaum,
“A geometric approach to joint 2d region-based segmentation
and 3d pose estimation using a 3d shape prior,” Society of
Industrial and Applied Mathematics Journal on Imaging Sciences, 2009.
[6] A. Srivastava, “Bayesian filtering for tracking pose and location of rigid targets,” Proceedings of SPIE Signal Processing,
Sensor Fusion, and Target Recognition, vol. 4052, pp. 160–
171, 2000.
[7] J. Kwon and F.C. Park, “Visual tracking via particle filtering
on the affine group,” The International Journal of Robotics
Research, vol. 29, no. 2-3, pp. 198, 2010.
[8] E. Bayro-Corrochano and J. Orteg´on-Aguilar, “Lie algebra approach for tracking and 3D motion estimation using monocular
vision,” Image and Vision Computing, vol. 25, no. 6, pp. 907–
921, 2007.
[9] M. Moakher, “Means and averaging in the group of rotations,”
SIAM Journal on Matrix Analysis and Applications, vol. 24,
no. 1, pp. 1–16, 2003.
[10] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Transactions on Communication Technology, vol. 15, no. 1, pp. 52–60, 1967.
[11] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based
probabilistic tracking,” in Proc. European Conference on Computer Vision, 2002, vol. I, pp. 661–675.