as a PDF

Learning of Graphical Models and Efficient
Inference for Object Class Recognition
Martin Bergtholdt, J¨
org Kappes, and Christoph Schn¨
orr
Computer Vision, Graphics, and Pattern Recognition Group
Department of Mathematics and Computer Science
University of Mannheim, 68131 Mannheim, Germany
{bergtholdt,jkappes,schnoerr}@uni-mannheim.de
Abstract. We focus on learning graphical models of object classes from
arbitrary instances of objects. Large intra-class variability of object appearance is dealt with by combining statistical local part detection with
relations between object parts in a probabilistic network. Inference for
view-based object recognition is done either with A∗ -search employing
a novel and dedicated admissible heuristic, or with Belief Propagation,
depending on the network size.
Our approach is applicable to arbitrary object classes. We validate this
for “faces” and for “articulated humans”. In the former case, our approach shows performance equal or superior to dedicated face recognition approaches. In the latter case, widely different poses and object
appearances in front of cluttered backgrounds can be recognized.
1
Introduction
Recent research on class-specific object recognition from arbitrary viewpoints
has focused on the high intra-class variability of object instances in connection
with the recognition of cars, airplanes, motor-bikes [1, 2], quadrupeds (cows and
horses) [3], faces [1, 2], and humans [4–11].
Approaches can be roughly classified into global/holistic and local methods.
Global methods model the distribution of objects as a whole using learned templates [4, 5] for example, while local methods use local object features and parts
in order to better cope with false detections due to occlusions, image clutter, and
noise by exploiting recent research on interest point detection and distinctive image features [12, 13]. In this context, object features or parts may be organized as
“bags of keypoints” ignoring geometric structure entirely [14], or with additional
structural constraints between parts [1, 2, 6–11], enabled by the recent progress
concerning the inference in graphical models [15]. Often, the relative geometric
locations of parts are distinctive for an object class.
In our work, we exploit both local parts and structure for object class recognition. Rather than using computationally convenient tree-models [10, 6] which
capture only a small fraction of dependencies explicitly, we employ more powerful graphical models to represent relevant relations between parts and to cope
with uncertainties due to clutter, occlusion, and noise. While the corresponding increased computational complexity of inference for object recognition was
an obstacle in previous work relying on conventional methods, up-to-date approximate inference algorithms, including Loopy Belief-Propagation or TreeReweighted Belief-Propagation, have proved to yield high-quality maximum a
posteriori (MAP) optima at moderate computational costs [16].
Fig. 1. Left, Middle: Recognition of humans in cluttered background. Edges indicate
relations between parts, not pose (see text). Right: Recognition of faces.
In this paper, we present a general approach to object class recognition.
Based on the probabilistic graphical model described in section 2, we explain
how part detectors are learned as well as relations between parts in terms of
geometry and appearance (section 3). The inference algorithms are described in
section 4. Besides the well-known belief propagation (BP), and related to [17], we
contribute a novel admissible heuristic for applying A∗ -search as an alternative
to BP. For sufficiently small-sized networks, the latter always converges, thus
returns the global optimum, and with less run-time than BP. On the other
hand, BP reliably infers highly probable configurations also for larger networks
in fixed time. The general applicability of our framework is validated in section 5
for two object classes, “faces” and “articulated humans”. Despite its generality,
our approach compares favorably with dedicated face detection algorithms.
2
Probabilistic Graphical Model
We want to locate an object with S parts in an Image I, with image domain
ΩI ⊂ Z × Z. The location of part s is denoted as xs ∈ ΩI . The configuration of
the entire model is therefore X = (x1 , . . . , xS ) ∈ Ω = ΩI × . . . × ΩI = ΩIS and
ˆ as an MAP-estimate:
we want to find the best configuration X
ˆ = arg max P (X|I, G)
X
X∈Ω
(1)
G refers here to our prior model hypothesis that an object is defined by a pairwise
Markov Random Field (MRF) with associated probabilistic graphical structure
G = (V, E, λ) where object parts are nodes in V and relations between parts
are edges in E; λ denotes a parameter vector for the geometric prior, which is
learned using training data. We use dense graphs to model the complex relations
between parts.
To simplify presentation, we omit G in the following derivations. Using Bayes’
rule, we can factor the posterior probability for the configuration P (X|I) as
P (I|X)P (X)
∝ P (I|X)P (X)
X∈Ω P (I|X)P (X)
P (X|I) = P
(2)
The first term will be denoted as the appearance or data term, the second as
geometry or shape term. Because we only use unary and binary constraints, the
posterior can also be written as Gibbs distribution p(X|I) ∝ exp(−E(X|I)) with
corresponding energy E and potential functions ψs , ψst :
X
X
E(X|I) =
ψs (xs ) +
ψst (xs , xt )
(3)
s∈V
st∈E
Section 3 explains how ψs , ψst depend on I, G. We point out that each sample space Ωs comprises all locations in the image and that ψs , ψst are general
functions learned from data. Therefore, global optimization with polynomial
complexity, e.g. by computing graph cuts [18], cannot be applied, and we have
to resort to approximate inference (cf. section 4).
Geometry term The geometry for our MRF-representation of the object comprises pairwise terms on edges only
Y
Hdst (|xs − xt |) Hγst (∡(xs − xt ))
P (X) ∝
(4)
st∈E
where H· (·) denote independent 1D histograms for relative edge-length dst =
|xs − xt | and absolute edge-direction γst = ∡(xs − xt ) with respect to the xaxis, learned from the training set with 30 bins each.
Concerning global object parameters (scale, rotation, and translation), we
note, that by using only pairwise relations, our representation is already invariant to global translation. The use of absolute angles makes our model rotation
variant. We assume, however, that our images are registered with respect to the
ground-plane such that the horizon is parallel to the x-axis. To account for the
scale dependency of relative lengths, we scale-normalize the training images. In a
new image we treat scale as hidden and consequently compute the MAP-estimate
over a set of discrete scales σ ∈ {σ1 , . . . , σL }.
Appearance term We assume that the image likelihood factors as
Y
Y
P (I|X) ∝
p(I|xs )
p(I|xs , xt )
s∈V
(5)
st∈E
Where the individual terms are functions learned from extracted features
p(I|xs ) ≈ Probs (fs (I, xs ))
p(I|xs , xt ) ≈ Probst (fst (I, xs , xt ))
(6)
Fig. 2. Left: Human with 11 labeled parts. Right-top: Face with 5 labeled parts.
Right-bottom: Patch geometry for edge “head/left hand”.
Probs (fs (I, x)) is our approximation to the image likelihood for observing part
s at location x against background (likewise for edges). Under the assumption
that the presence or absence of a part at a certain image location only depends
on a small neighborhood of that location, we compute features fs (I, x) from
image patches in windows of fixed size, see section 3.1, and use a support vector
machine (SVM) with Gaussian kernel and probabilistic outputs [19] to compute
Probs (fs (I, xs )). We have used the implementation of [20], performing gridsearch to learn optimal SVM-parameters (C and γ) using cross-validation.
Assuming independence of part-appearance is certainly not true for very selfsimilar object parts, e.g. symmetrical body parts like eyes, hands and feet. But
the assumption keeps the model tractable and with the additional geometricinformation, these ambiguities can in most cases be resolved. Additionally, our
SVM-detector will (and should!) give positive detections around the true location
of parts due to strong local correlation. To remedy for this effect, we use nonmaxima suppression when sampling candidates from the image, see section 3.2,
so that the assumptions hold approximately.
3
Supervised Learning and Implementation Details
3.1 Appearance features
Features suitable for our general setting have to meet certain criteria: To facilitate implementation, one type of feature-extractor is to be used for all parts.
We also require robustness for: changes in illumination and color; small occlusions and clutter; and minor variations in spatial transformations (translation,
rotation, and scale).
A suitable feature descriptor meeting these criteria has been proposed in [12],
and variants have already proved successful for object detection compared to
other descriptors [5]. The features we use are defined as follows: for each pixel in a
sliding window, we compute its gradient orientation θ ∈ [0, π], i.e. modulo π, and
for each block of 8 × 8 pixels we accumulate the orientation into one histogram
with 8 orientation bins.
Each image patch located over an object part, see fig. 2, is resized to 32 × 32
pixels for which we compute 4 × 4 blocks with 8 orientations, yielding a feature
vector of size 4 × 4 × 8 = 128. As proposed in [12], we used trilinear interpolation
among neighboring bins in (x, y, θ) to obtain smooth histograms and normalize
the feature vectors to unit length. This whole procedure significantly reduces the
dimensionality (e.g. 128 vs. 32 × 32 × 3 = 3072), while meeting our requirements.
See fig. 2 for an example labeling of a human and a face image. The white frames
correspond to the image patches used for learning.
We have not precomputed local orientation or scale information, but rely on
learning these remaining variations from the training set using the SVM. To this
end, we increased the number of training samples by factor 10 for faces and by
factor 20 for humans, randomly varying the scale in the interval [0.8, 1.2] and
the orientation in ±10◦ and ±20◦ for faces and humans respectively. We then
computed probabilistic SV-classifiers for each part against background.
For the pairwise appearance information, we propose the following: For each
edge in the graph, we sample an oriented patch using the image locations of the
two incident parts xs , xt and their respective diameters, see the bottom right
image in fig. 2 for an illustration. Each patch is then resized to 32×32 pixels. The
feature vector is computed in the same way as for the single parts, and SVMlearning yields then pairwise appearance probabilities. Note that appearance is
computed on all edges of the model graph, not only for the physical links, thus
adding necessary redundant information to the model representation. Moreover,
the geometry for the pairwise sampling is defined by the incident part-candidates,
so features are invariant to rotation and foreshortening along the edge, yielding
in general stronger classifiers than the individual part-classifiers.
We have found, that for a multi-scale image analysis, it is necessary to speed
up the process of feature generation. We have therefore changed the order of
computation in that we first computed for the entire image at each 8 × 8 pixel
block the corresponding histogram of orientations. For a single image location
we then used linear interpolation in (x, y, θ) to obtain the 4×4×8 feature-vector.
3.2 Determining the Effective Configuration Space
Based on the probabilistic model, we compute a feasible subset of the entire
space, the effective configuration space. We sample candidate part-locations in
the image using non-maxima suppression to account for local correlation of the
image likelihood terms (6), and compress probabilistically the remaining hypotheses into a single node for each part, where the missing information is provided by prior estimates as
Ps (·|I) = α Exs {exp(−ψs (xs )}
Pst (·, ·|I) = α Exs ,xt {exp(−ψst (xs , xt )}
(7)
We take the expectation over our training set and set the penalty parameter α
in our experiments manually (see section 5).
4
Inference
ˆ of the model to a given image, we consider the MAP-configuration.
As the best fit X
We used two approaches to the combinatorial inference problem (1): Loopy Belief Propagation [21] (BP) and A∗ -search [22, 23] (A∗ ) using a novel tree-based
admissible heuristic. Concerning BP, we refer the reader to the literature [21,
23].
A∗ -Algorithm with a Novel Admissible Tree-Heuristic. The A∗ -algorithm is an
established technique for searching the optimal solution in terms of the shortest
path within a graph, representing the whole configuration space [22].
Its performance depends on devising a heuristic for estimating the “future
costs” of unexplored paths between two nodes (configurations) for the problem
at hand. In order to find the global MAP optimum, the heuristic has to be
admissible, i.e., it always returns a lower bound for the cost of some unexplored
path of configurations. While this guaranty for global optimality holds once the
search terminates, we do not have polynomial time complexity guaranteed.
In previous work [23] we introduced this technique for graphical models with
the admissible heuristic (8).
n X
o
X
X
min ψst (xs , xt ) (8)
ψst (xs , xt ) +
min
ψs (xs ) +
xV \B ,xB =b
s∈V \B
st∈E11
st∈E12
xst
Here, B denotes the subset of already processed nodes in V . E11 is the set of
tree-edges which are not in B × B, and E12 contains all edges neither in B × B
nor in E11 .
A much tighter lower bound is achieved, however, by defining E21 as the
union of E11 and all edges in B × (V \ B), and E22 as the set of all edges neither
in B × B nor in E21 . This leads to the novel admissible heuristic:
n X
o
X
X
min ψst (xs , xt ) (9)
ψst (xs , xt ) +
min
ψs (xs ) +
xV \B ,xB =b
s∈V \B
st∈E21
st∈E22
xst
Whereas for (8) it is possible to compute lookup tables in advance, (9) requires
re-computation in every exploration step of the A∗ -algorithm. In spite of this
apparent disadvantage, the gained tightness of the bound more than compensates
for the computational cost as far less exploration steps are necessary to compute
the MAP. Moreover, with (8) it is very difficult to cope with hidden/missing
nodes.
5
Experiments and Discussion
Data sets. We have used three object data sets and one background set for
learning and evaluation of our model: The Caltech face dataset [2] consisting
of 450 frontal faces. We used the first 216 frames (14 subjects) as training and
the last 234 frames (14 subjects with 3 additional artificial paintings) for testing.
The BioID face dataset [24] consisting of 1521 face images of 23 subjects,
featuring more variation in pose than the Caltech dataset. This dataset was used
for testing only, using the model learned from the Caltech set. The human
dataset, consisting of 894 images, from various consumer cameras and google
image search. From the 894 frames we used 624 for training and 270 for testing.
Background was obtained from 45 images without people/faces, but featuring
scenes where people normally occur. For faces we chose α = 0.01 and for humans
α = 1.
Fig. 3. Recognition examples. Top row: BioID faces frame#(rank): B11(1465),
B416(1130), B428(568), B1464(484). Bottom row: Caltech faces C289(166),
C327(38), C402(202), C429(92). A ◦ denotes a found part, × are geometrically inferred
missed parts. None of these persons was part of the training set. Note the difficulty
of these particular images due to partial occlusion (C289, C327), illumination (C327),
and “abstractness” (C429).
Optimization. We applied A∗ -search and BP to all recognition experiments.
While A∗ always converged and detects faces quickly (mean: 0.008 seconds),
BP needs less run-time on average for the larger network (complete graph with
|V | = 11) used to recognize humans. Run-times vary between 20 seconds and
0.5 hours for A∗ , and between 3 seconds and 2 minutes for BP.
|x −x⋆ |
Quality measure. To measure the quality of our results, let ms = x⋆ s−x⋆s
| l-eye r−eye |
denote the point-to-point error for part s relative to the distance of the eyes,
where the ⋆ denotes the ground truth location. Images in fig. 4 are ranked by
the maximal error of a single part mmax = maxs∈V ms in descending order, so
ranking is from worst=1 to best=1521 (BioID), 234 (Caltech). To compare our
results inPtable 1 to [25] on the BioID dataset, we also included the measure
me4 = 41 s∈V ′ ms , where V ′ = {l-eye,r-eye,l-mouth,r-mouth}, i.e., our original
nodes without the nose. We assume a hit if the quality measure is below a given
tolerance. Comparable hit-rates reported by [25] (estimated electronically from
their plots) are ≈ 0.94 for me4 < 10% and ≈ 0.97 for me4 < 15%, where we
achieved hit-rates of 0.9178 and 0.9908, respectively. We also give mean error
and variances for each part over the whole image set.
For the Caltech face dataset, we used the same training images as [17],
whereas for testing they excluded frames 328–336 (smaller scale) and 400, 402,
403 (paintings) which we kept in our test set. Note that we search faces at multiple scales and our method generalizes to the paintings, see e.g. fig. 3. [17] report
a hit-rate of 0.92, but without mentioning a corresponding tolerance level or
quality measure. In our tests we achieved a hit rate of 0.92 for mmax < 16.7.
The BioID dataset was processed with exactly the same model learned from
the 216 training images of the Caltech set. Typical examples for the two test
sets are given in fig. 3. Some images with low rank are shown in fig. 4. For the
Caltech set these are actually the ones with largest mmax .
BioID faces
15% 18% 21% 24%
0.99 0.99 0.99 0.99
0.93 0.96 0.98 0.99
0.82 0.89 0.93 0.96
0.95 0.98 0.98 0.99
0.95 0.98 0.99 0.99
0.99 0.99 0.99 0.99
Tolerance
Left eye
Right eye
Nose
Left mouth
Right mouth
me4
3%
0.06
0.26
0.07
0.17
0.22
0.06
6%
0.53
0.56
0.29
0.49
0.54
0.53
9%
0.85
0.80
0.52
0.72
0.74
0.85
12%
0.96
0.90
0.72
0.87
0.87
0.96
Left eye
Right eye
Nose
Left mouth
Right mouth
em4
0.57
0.50
0.29
0.14
0.20
0.13
0.86
0.83
0.63
0.48
0.51
0.66
0.91
0.88
0.78
0.65
0.68
0.94
Caltech faces (test
0.95 0.98 0.98 0.98
0.94 0.98 0.98 0.98
0.85 0.93 0.97 0.98
0.85 0.94 0.97 0.98
0.85 0.95 0.98 0.98
0.98 0.99 0.99 0.99
27%
0.99
0.99
0.98
0.99
0.99
0.99
set only)
0.98 0.98
0.98 0.98
0.98 0.98
0.98 0.98
0.98 0.99
0.99 0.99
30% Mean Error Var Error
0.99
5.62%
0.52%
0.99
6.41%
0.41%
0.99
10.06%
0.65%
0.99
7.11%
0.49%
0.99
6.75%
0.48%
0.99
6.47%
0.33%
0.98
0.98
0.98
0.98
0.99
0.99
4.52%
4.82%
6.85%
8.31%
7.73%
6.34%
1.93%
1.80%
1.99%
2.31%
2.29%
1.97%
Table 1. Hit rates for the different face parts in the BioID and Caltech datasets. The
increased errors for the nose in BioID as compared to Caltech, are in our opinion not
caused by bad localization, but by a slightly different labeling scheme of our training set
compared to the provided labels of BioID. Overall we can attest excellent performance
of our general approach on these two unknown datasets.
Fig. 4. Bad recognitions. For each image pair, the left image is part candidates, right
image is MAP-result. Top row: BioID frame#(rank): B486(1), B600(40), B1294(2).
Bottom row: Caltech C320(2), C325(1), C417(3). For the ranking see text.
Recognition of humans is much harder, because geometry is much less constrained, and because object parts are far less discriminative without the context.
Locating an elbow or a hand in an image (without color information) turned out
to be quite challenging. The contextual information provided by our graphical
model, however, helps a lot for resolving the ambiguities caused by false detections – compare the images with part candidates only and their corresponding
MAP-result in fig. 5. However, a similar quality as for the face data sets cannot
be expected; see table 2 for the hit-rates, where the tolerance levels for humans
are relative to the distance between chest and hip. Failures are mainly due to
the unknown scale, or due to a complete breakdown of part detectors. Fig. 5,
bottom, shows some intricate examples.
Tolerance
Head
Chest
Elbows
Hands
Hip
Knees
Feet
me11
10%
0.47
0.63
0.28
0.29
0.37
0.51
0.46
0.15
20%
0.69
0.79
0.49
0.43
0.65
0.74
0.69
0.49
30%
0.83
0.88
0.67
0.50
0.81
0.81
0.84
0.76
40%
0.87
0.91
0.80
0.60
0.91
0.86
0.87
0.80
50% Mean Error Var Error
0.91
20.68%
7.89%
0.91
18.07%
6.54%
0.86
28.45%
7.72%
0.66
47.33% 24.82%
0.93
20.02%
4.05%
0.93
17.23%
3.33%
0.88
21.27%
7.11%
0.89
26.20%
3.44%
Table 2. Hit rates for the human data set. We have only used the images from the test
set with a single human. Recognizing humans is much harder than faces. Especially
hands are very hard to detect without color and the geometric prior cannot always
resolve their position.
Fig. 5. Positive and negative recognition results for humans Left images: part candidates, right images MAP-result with BP.
6
Conclusion
We presented a general model to object class recognition. Our work demonstrates
the feasibility of view-based object recognition even for articulated humans if a
sufficiently rich data base is available for learning. The evaluation for different
object classes showed a performance competitive to approaches that only work
for a specific object class.
Our future work will focus on the real-time performance of all components,
and on an approach to enlarge the learning data base with minimal supervision.
References
1. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for efficient
learning and exhaustive recognition. In: CVPR. (2005)
2. Weber, M., Welling, M., Perona, P.: Unsupervised learning of models for recognition. In: ECCV. (2000) 18–32
3. Kumar, M.P., Torr, P.H.S., Zisserman, A.: Extending pictorial structures for object
recognition. In: BMVC. (2004) 789–798
4. Gavrila, D., Philomin, V.: Real-time object detection using distance transforms.
In: Proc. Intelligent Vehicles Conf. (1998.)
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR. (2005) 886–893
6. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition.
IJCV 61(1) (2005) 55–79
7. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: ECCV, Springer (2004)
8. Ren, X., Berg, A., Malik, J.: Recovering human body configurations using pairwise
constraints between parts. In: ICCV. (2005)
9. Sigal, L., Isard, M., Sigelman, B., Black, M.: Attractive people: Assembling looselimbed models using non-parametric belief propagation. In: NIPS. (2003)
10. Ronfard, R., Schmid, C., Triggs, B.: Learning to parse pictures of people. In:
ECCV. (2002) 700–714
11. Ramanan, D., Forsyth, D.A., Zisserman, A.: Strike a pose: Tracking people by
finding stylized poses. In: CVPR. Volume 1. (2005) 271–278
12. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2)
(2004) 91–110
13. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. IJCV
60(1) (2004) 63–86
14. Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering objects
and their locations in images. In: ICCV. IEEE (2005)
15. Frey, B., Jojic, N.: A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE PAMI 27(9) (2005) 1392–1416
16. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A.,
Tappen, M., Rother, C.: A comparative study of energy minimization methods for
markov random fields. In: ECCV. (2006)
17. Pham, T., Smeulders, A.: Object recognition with uncertain geometry and uncertain part detection. CVIU 99(2) (2005) 241–258
18. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph
cuts? IEEE PAMI 26(2) (2004) 147–159
19. Platt, J.: Probabilistic outputs for support vector machines and comparison to
regularized likelihood methods. In: Advances in Large Margin Classifiers. MIT
Press (2000) 61–74
20. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2001)
21. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations
and generalized belief propagation algorithms. IEEE Trans. Information Theory
51(7) (2005) 2282–2312
22. Hart, P., Nilsson, N., Raphael, B.: A formal basis for the heuristic determination
of minimum cost paths. IEEE Tr. Syst. Sci. Cybernetics 4 (1968) 100–107
23. Bergtholdt, M., Kappes, J., Schn¨
orr: Graphical knowledge representation for human detection. In: International Workshop on The Representation and Use of
Prior Knowledge in Vision. (2006)
24. Jesorsky, O., Kirchberg, K., Frischholz, R.: Robust face detection using the hausdorff distance. In Bigun, J., Smeraldi, F., eds.: Audio and Video based Person
Authentication, Springer (2001) 90–95
25. Cristinacce, D., Cootes, T.F., Scott, I.: A multi-stage approach to facial feature
detection. In: BMVC. (2004)