Learning of Graphical Models and Efficient Inference for Object Class Recognition Martin Bergtholdt, J¨ org Kappes, and Christoph Schn¨ orr Computer Vision, Graphics, and Pattern Recognition Group Department of Mathematics and Computer Science University of Mannheim, 68131 Mannheim, Germany {bergtholdt,jkappes,schnoerr}@uni-mannheim.de Abstract. We focus on learning graphical models of object classes from arbitrary instances of objects. Large intra-class variability of object appearance is dealt with by combining statistical local part detection with relations between object parts in a probabilistic network. Inference for view-based object recognition is done either with A∗ -search employing a novel and dedicated admissible heuristic, or with Belief Propagation, depending on the network size. Our approach is applicable to arbitrary object classes. We validate this for “faces” and for “articulated humans”. In the former case, our approach shows performance equal or superior to dedicated face recognition approaches. In the latter case, widely different poses and object appearances in front of cluttered backgrounds can be recognized. 1 Introduction Recent research on class-specific object recognition from arbitrary viewpoints has focused on the high intra-class variability of object instances in connection with the recognition of cars, airplanes, motor-bikes [1, 2], quadrupeds (cows and horses) [3], faces [1, 2], and humans [4–11]. Approaches can be roughly classified into global/holistic and local methods. Global methods model the distribution of objects as a whole using learned templates [4, 5] for example, while local methods use local object features and parts in order to better cope with false detections due to occlusions, image clutter, and noise by exploiting recent research on interest point detection and distinctive image features [12, 13]. In this context, object features or parts may be organized as “bags of keypoints” ignoring geometric structure entirely [14], or with additional structural constraints between parts [1, 2, 6–11], enabled by the recent progress concerning the inference in graphical models [15]. Often, the relative geometric locations of parts are distinctive for an object class. In our work, we exploit both local parts and structure for object class recognition. Rather than using computationally convenient tree-models [10, 6] which capture only a small fraction of dependencies explicitly, we employ more powerful graphical models to represent relevant relations between parts and to cope with uncertainties due to clutter, occlusion, and noise. While the corresponding increased computational complexity of inference for object recognition was an obstacle in previous work relying on conventional methods, up-to-date approximate inference algorithms, including Loopy Belief-Propagation or TreeReweighted Belief-Propagation, have proved to yield high-quality maximum a posteriori (MAP) optima at moderate computational costs [16]. Fig. 1. Left, Middle: Recognition of humans in cluttered background. Edges indicate relations between parts, not pose (see text). Right: Recognition of faces. In this paper, we present a general approach to object class recognition. Based on the probabilistic graphical model described in section 2, we explain how part detectors are learned as well as relations between parts in terms of geometry and appearance (section 3). The inference algorithms are described in section 4. Besides the well-known belief propagation (BP), and related to [17], we contribute a novel admissible heuristic for applying A∗ -search as an alternative to BP. For sufficiently small-sized networks, the latter always converges, thus returns the global optimum, and with less run-time than BP. On the other hand, BP reliably infers highly probable configurations also for larger networks in fixed time. The general applicability of our framework is validated in section 5 for two object classes, “faces” and “articulated humans”. Despite its generality, our approach compares favorably with dedicated face detection algorithms. 2 Probabilistic Graphical Model We want to locate an object with S parts in an Image I, with image domain ΩI ⊂ Z × Z. The location of part s is denoted as xs ∈ ΩI . The configuration of the entire model is therefore X = (x1 , . . . , xS ) ∈ Ω = ΩI × . . . × ΩI = ΩIS and ˆ as an MAP-estimate: we want to find the best configuration X ˆ = arg max P (X|I, G) X X∈Ω (1) G refers here to our prior model hypothesis that an object is defined by a pairwise Markov Random Field (MRF) with associated probabilistic graphical structure G = (V, E, λ) where object parts are nodes in V and relations between parts are edges in E; λ denotes a parameter vector for the geometric prior, which is learned using training data. We use dense graphs to model the complex relations between parts. To simplify presentation, we omit G in the following derivations. Using Bayes’ rule, we can factor the posterior probability for the configuration P (X|I) as P (I|X)P (X) ∝ P (I|X)P (X) X∈Ω P (I|X)P (X) P (X|I) = P (2) The first term will be denoted as the appearance or data term, the second as geometry or shape term. Because we only use unary and binary constraints, the posterior can also be written as Gibbs distribution p(X|I) ∝ exp(−E(X|I)) with corresponding energy E and potential functions ψs , ψst : X X E(X|I) = ψs (xs ) + ψst (xs , xt ) (3) s∈V st∈E Section 3 explains how ψs , ψst depend on I, G. We point out that each sample space Ωs comprises all locations in the image and that ψs , ψst are general functions learned from data. Therefore, global optimization with polynomial complexity, e.g. by computing graph cuts [18], cannot be applied, and we have to resort to approximate inference (cf. section 4). Geometry term The geometry for our MRF-representation of the object comprises pairwise terms on edges only Y Hdst (|xs − xt |) Hγst (∡(xs − xt )) P (X) ∝ (4) st∈E where H· (·) denote independent 1D histograms for relative edge-length dst = |xs − xt | and absolute edge-direction γst = ∡(xs − xt ) with respect to the xaxis, learned from the training set with 30 bins each. Concerning global object parameters (scale, rotation, and translation), we note, that by using only pairwise relations, our representation is already invariant to global translation. The use of absolute angles makes our model rotation variant. We assume, however, that our images are registered with respect to the ground-plane such that the horizon is parallel to the x-axis. To account for the scale dependency of relative lengths, we scale-normalize the training images. In a new image we treat scale as hidden and consequently compute the MAP-estimate over a set of discrete scales σ ∈ {σ1 , . . . , σL }. Appearance term We assume that the image likelihood factors as Y Y P (I|X) ∝ p(I|xs ) p(I|xs , xt ) s∈V (5) st∈E Where the individual terms are functions learned from extracted features p(I|xs ) ≈ Probs (fs (I, xs )) p(I|xs , xt ) ≈ Probst (fst (I, xs , xt )) (6) Fig. 2. Left: Human with 11 labeled parts. Right-top: Face with 5 labeled parts. Right-bottom: Patch geometry for edge “head/left hand”. Probs (fs (I, x)) is our approximation to the image likelihood for observing part s at location x against background (likewise for edges). Under the assumption that the presence or absence of a part at a certain image location only depends on a small neighborhood of that location, we compute features fs (I, x) from image patches in windows of fixed size, see section 3.1, and use a support vector machine (SVM) with Gaussian kernel and probabilistic outputs [19] to compute Probs (fs (I, xs )). We have used the implementation of [20], performing gridsearch to learn optimal SVM-parameters (C and γ) using cross-validation. Assuming independence of part-appearance is certainly not true for very selfsimilar object parts, e.g. symmetrical body parts like eyes, hands and feet. But the assumption keeps the model tractable and with the additional geometricinformation, these ambiguities can in most cases be resolved. Additionally, our SVM-detector will (and should!) give positive detections around the true location of parts due to strong local correlation. To remedy for this effect, we use nonmaxima suppression when sampling candidates from the image, see section 3.2, so that the assumptions hold approximately. 3 Supervised Learning and Implementation Details 3.1 Appearance features Features suitable for our general setting have to meet certain criteria: To facilitate implementation, one type of feature-extractor is to be used for all parts. We also require robustness for: changes in illumination and color; small occlusions and clutter; and minor variations in spatial transformations (translation, rotation, and scale). A suitable feature descriptor meeting these criteria has been proposed in [12], and variants have already proved successful for object detection compared to other descriptors [5]. The features we use are defined as follows: for each pixel in a sliding window, we compute its gradient orientation θ ∈ [0, π], i.e. modulo π, and for each block of 8 × 8 pixels we accumulate the orientation into one histogram with 8 orientation bins. Each image patch located over an object part, see fig. 2, is resized to 32 × 32 pixels for which we compute 4 × 4 blocks with 8 orientations, yielding a feature vector of size 4 × 4 × 8 = 128. As proposed in [12], we used trilinear interpolation among neighboring bins in (x, y, θ) to obtain smooth histograms and normalize the feature vectors to unit length. This whole procedure significantly reduces the dimensionality (e.g. 128 vs. 32 × 32 × 3 = 3072), while meeting our requirements. See fig. 2 for an example labeling of a human and a face image. The white frames correspond to the image patches used for learning. We have not precomputed local orientation or scale information, but rely on learning these remaining variations from the training set using the SVM. To this end, we increased the number of training samples by factor 10 for faces and by factor 20 for humans, randomly varying the scale in the interval [0.8, 1.2] and the orientation in ±10◦ and ±20◦ for faces and humans respectively. We then computed probabilistic SV-classifiers for each part against background. For the pairwise appearance information, we propose the following: For each edge in the graph, we sample an oriented patch using the image locations of the two incident parts xs , xt and their respective diameters, see the bottom right image in fig. 2 for an illustration. Each patch is then resized to 32×32 pixels. The feature vector is computed in the same way as for the single parts, and SVMlearning yields then pairwise appearance probabilities. Note that appearance is computed on all edges of the model graph, not only for the physical links, thus adding necessary redundant information to the model representation. Moreover, the geometry for the pairwise sampling is defined by the incident part-candidates, so features are invariant to rotation and foreshortening along the edge, yielding in general stronger classifiers than the individual part-classifiers. We have found, that for a multi-scale image analysis, it is necessary to speed up the process of feature generation. We have therefore changed the order of computation in that we first computed for the entire image at each 8 × 8 pixel block the corresponding histogram of orientations. For a single image location we then used linear interpolation in (x, y, θ) to obtain the 4×4×8 feature-vector. 3.2 Determining the Effective Configuration Space Based on the probabilistic model, we compute a feasible subset of the entire space, the effective configuration space. We sample candidate part-locations in the image using non-maxima suppression to account for local correlation of the image likelihood terms (6), and compress probabilistically the remaining hypotheses into a single node for each part, where the missing information is provided by prior estimates as Ps (·|I) = α Exs {exp(−ψs (xs )} Pst (·, ·|I) = α Exs ,xt {exp(−ψst (xs , xt )} (7) We take the expectation over our training set and set the penalty parameter α in our experiments manually (see section 5). 4 Inference ˆ of the model to a given image, we consider the MAP-configuration. As the best fit X We used two approaches to the combinatorial inference problem (1): Loopy Belief Propagation [21] (BP) and A∗ -search [22, 23] (A∗ ) using a novel tree-based admissible heuristic. Concerning BP, we refer the reader to the literature [21, 23]. A∗ -Algorithm with a Novel Admissible Tree-Heuristic. The A∗ -algorithm is an established technique for searching the optimal solution in terms of the shortest path within a graph, representing the whole configuration space [22]. Its performance depends on devising a heuristic for estimating the “future costs” of unexplored paths between two nodes (configurations) for the problem at hand. In order to find the global MAP optimum, the heuristic has to be admissible, i.e., it always returns a lower bound for the cost of some unexplored path of configurations. While this guaranty for global optimality holds once the search terminates, we do not have polynomial time complexity guaranteed. In previous work [23] we introduced this technique for graphical models with the admissible heuristic (8). n X o X X min ψst (xs , xt ) (8) ψst (xs , xt ) + min ψs (xs ) + xV \B ,xB =b s∈V \B st∈E11 st∈E12 xst Here, B denotes the subset of already processed nodes in V . E11 is the set of tree-edges which are not in B × B, and E12 contains all edges neither in B × B nor in E11 . A much tighter lower bound is achieved, however, by defining E21 as the union of E11 and all edges in B × (V \ B), and E22 as the set of all edges neither in B × B nor in E21 . This leads to the novel admissible heuristic: n X o X X min ψst (xs , xt ) (9) ψst (xs , xt ) + min ψs (xs ) + xV \B ,xB =b s∈V \B st∈E21 st∈E22 xst Whereas for (8) it is possible to compute lookup tables in advance, (9) requires re-computation in every exploration step of the A∗ -algorithm. In spite of this apparent disadvantage, the gained tightness of the bound more than compensates for the computational cost as far less exploration steps are necessary to compute the MAP. Moreover, with (8) it is very difficult to cope with hidden/missing nodes. 5 Experiments and Discussion Data sets. We have used three object data sets and one background set for learning and evaluation of our model: The Caltech face dataset [2] consisting of 450 frontal faces. We used the first 216 frames (14 subjects) as training and the last 234 frames (14 subjects with 3 additional artificial paintings) for testing. The BioID face dataset [24] consisting of 1521 face images of 23 subjects, featuring more variation in pose than the Caltech dataset. This dataset was used for testing only, using the model learned from the Caltech set. The human dataset, consisting of 894 images, from various consumer cameras and google image search. From the 894 frames we used 624 for training and 270 for testing. Background was obtained from 45 images without people/faces, but featuring scenes where people normally occur. For faces we chose α = 0.01 and for humans α = 1. Fig. 3. Recognition examples. Top row: BioID faces frame#(rank): B11(1465), B416(1130), B428(568), B1464(484). Bottom row: Caltech faces C289(166), C327(38), C402(202), C429(92). A ◦ denotes a found part, × are geometrically inferred missed parts. None of these persons was part of the training set. Note the difficulty of these particular images due to partial occlusion (C289, C327), illumination (C327), and “abstractness” (C429). Optimization. We applied A∗ -search and BP to all recognition experiments. While A∗ always converged and detects faces quickly (mean: 0.008 seconds), BP needs less run-time on average for the larger network (complete graph with |V | = 11) used to recognize humans. Run-times vary between 20 seconds and 0.5 hours for A∗ , and between 3 seconds and 2 minutes for BP. |x −x⋆ | Quality measure. To measure the quality of our results, let ms = x⋆ s−x⋆s | l-eye r−eye | denote the point-to-point error for part s relative to the distance of the eyes, where the ⋆ denotes the ground truth location. Images in fig. 4 are ranked by the maximal error of a single part mmax = maxs∈V ms in descending order, so ranking is from worst=1 to best=1521 (BioID), 234 (Caltech). To compare our results inPtable 1 to [25] on the BioID dataset, we also included the measure me4 = 41 s∈V ′ ms , where V ′ = {l-eye,r-eye,l-mouth,r-mouth}, i.e., our original nodes without the nose. We assume a hit if the quality measure is below a given tolerance. Comparable hit-rates reported by [25] (estimated electronically from their plots) are ≈ 0.94 for me4 < 10% and ≈ 0.97 for me4 < 15%, where we achieved hit-rates of 0.9178 and 0.9908, respectively. We also give mean error and variances for each part over the whole image set. For the Caltech face dataset, we used the same training images as [17], whereas for testing they excluded frames 328–336 (smaller scale) and 400, 402, 403 (paintings) which we kept in our test set. Note that we search faces at multiple scales and our method generalizes to the paintings, see e.g. fig. 3. [17] report a hit-rate of 0.92, but without mentioning a corresponding tolerance level or quality measure. In our tests we achieved a hit rate of 0.92 for mmax < 16.7. The BioID dataset was processed with exactly the same model learned from the 216 training images of the Caltech set. Typical examples for the two test sets are given in fig. 3. Some images with low rank are shown in fig. 4. For the Caltech set these are actually the ones with largest mmax . BioID faces 15% 18% 21% 24% 0.99 0.99 0.99 0.99 0.93 0.96 0.98 0.99 0.82 0.89 0.93 0.96 0.95 0.98 0.98 0.99 0.95 0.98 0.99 0.99 0.99 0.99 0.99 0.99 Tolerance Left eye Right eye Nose Left mouth Right mouth me4 3% 0.06 0.26 0.07 0.17 0.22 0.06 6% 0.53 0.56 0.29 0.49 0.54 0.53 9% 0.85 0.80 0.52 0.72 0.74 0.85 12% 0.96 0.90 0.72 0.87 0.87 0.96 Left eye Right eye Nose Left mouth Right mouth em4 0.57 0.50 0.29 0.14 0.20 0.13 0.86 0.83 0.63 0.48 0.51 0.66 0.91 0.88 0.78 0.65 0.68 0.94 Caltech faces (test 0.95 0.98 0.98 0.98 0.94 0.98 0.98 0.98 0.85 0.93 0.97 0.98 0.85 0.94 0.97 0.98 0.85 0.95 0.98 0.98 0.98 0.99 0.99 0.99 27% 0.99 0.99 0.98 0.99 0.99 0.99 set only) 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 30% Mean Error Var Error 0.99 5.62% 0.52% 0.99 6.41% 0.41% 0.99 10.06% 0.65% 0.99 7.11% 0.49% 0.99 6.75% 0.48% 0.99 6.47% 0.33% 0.98 0.98 0.98 0.98 0.99 0.99 4.52% 4.82% 6.85% 8.31% 7.73% 6.34% 1.93% 1.80% 1.99% 2.31% 2.29% 1.97% Table 1. Hit rates for the different face parts in the BioID and Caltech datasets. The increased errors for the nose in BioID as compared to Caltech, are in our opinion not caused by bad localization, but by a slightly different labeling scheme of our training set compared to the provided labels of BioID. Overall we can attest excellent performance of our general approach on these two unknown datasets. Fig. 4. Bad recognitions. For each image pair, the left image is part candidates, right image is MAP-result. Top row: BioID frame#(rank): B486(1), B600(40), B1294(2). Bottom row: Caltech C320(2), C325(1), C417(3). For the ranking see text. Recognition of humans is much harder, because geometry is much less constrained, and because object parts are far less discriminative without the context. Locating an elbow or a hand in an image (without color information) turned out to be quite challenging. The contextual information provided by our graphical model, however, helps a lot for resolving the ambiguities caused by false detections – compare the images with part candidates only and their corresponding MAP-result in fig. 5. However, a similar quality as for the face data sets cannot be expected; see table 2 for the hit-rates, where the tolerance levels for humans are relative to the distance between chest and hip. Failures are mainly due to the unknown scale, or due to a complete breakdown of part detectors. Fig. 5, bottom, shows some intricate examples. Tolerance Head Chest Elbows Hands Hip Knees Feet me11 10% 0.47 0.63 0.28 0.29 0.37 0.51 0.46 0.15 20% 0.69 0.79 0.49 0.43 0.65 0.74 0.69 0.49 30% 0.83 0.88 0.67 0.50 0.81 0.81 0.84 0.76 40% 0.87 0.91 0.80 0.60 0.91 0.86 0.87 0.80 50% Mean Error Var Error 0.91 20.68% 7.89% 0.91 18.07% 6.54% 0.86 28.45% 7.72% 0.66 47.33% 24.82% 0.93 20.02% 4.05% 0.93 17.23% 3.33% 0.88 21.27% 7.11% 0.89 26.20% 3.44% Table 2. Hit rates for the human data set. We have only used the images from the test set with a single human. Recognizing humans is much harder than faces. Especially hands are very hard to detect without color and the geometric prior cannot always resolve their position. Fig. 5. Positive and negative recognition results for humans Left images: part candidates, right images MAP-result with BP. 6 Conclusion We presented a general model to object class recognition. Our work demonstrates the feasibility of view-based object recognition even for articulated humans if a sufficiently rich data base is available for learning. The evaluation for different object classes showed a performance competitive to approaches that only work for a specific object class. Our future work will focus on the real-time performance of all components, and on an approach to enlarge the learning data base with minimal supervision. References 1. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for efficient learning and exhaustive recognition. In: CVPR. (2005) 2. Weber, M., Welling, M., Perona, P.: Unsupervised learning of models for recognition. In: ECCV. (2000) 18–32 3. Kumar, M.P., Torr, P.H.S., Zisserman, A.: Extending pictorial structures for object recognition. In: BMVC. (2004) 789–798 4. Gavrila, D., Philomin, V.: Real-time object detection using distance transforms. In: Proc. Intelligent Vehicles Conf. (1998.) 5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR. (2005) 886–893 6. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV 61(1) (2005) 55–79 7. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: ECCV, Springer (2004) 8. Ren, X., Berg, A., Malik, J.: Recovering human body configurations using pairwise constraints between parts. In: ICCV. (2005) 9. Sigal, L., Isard, M., Sigelman, B., Black, M.: Attractive people: Assembling looselimbed models using non-parametric belief propagation. In: NIPS. (2003) 10. Ronfard, R., Schmid, C., Triggs, B.: Learning to parse pictures of people. In: ECCV. (2002) 700–714 11. Ramanan, D., Forsyth, D.A., Zisserman, A.: Strike a pose: Tracking people by finding stylized poses. In: CVPR. Volume 1. (2005) 271–278 12. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2) (2004) 91–110 13. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. IJCV 60(1) (2004) 63–86 14. Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering objects and their locations in images. In: ICCV. IEEE (2005) 15. Frey, B., Jojic, N.: A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE PAMI 27(9) (2005) 1392–1416 16. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., Rother, C.: A comparative study of energy minimization methods for markov random fields. In: ECCV. (2006) 17. Pham, T., Smeulders, A.: Object recognition with uncertain geometry and uncertain part detection. CVIU 99(2) (2005) 241–258 18. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE PAMI 26(2) (2004) 147–159 19. Platt, J.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Advances in Large Margin Classifiers. MIT Press (2000) 61–74 20. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2001) 21. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Information Theory 51(7) (2005) 2282–2312 22. Hart, P., Nilsson, N., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Tr. Syst. Sci. Cybernetics 4 (1968) 100–107 23. Bergtholdt, M., Kappes, J., Schn¨ orr: Graphical knowledge representation for human detection. In: International Workshop on The Representation and Use of Prior Knowledge in Vision. (2006) 24. Jesorsky, O., Kirchberg, K., Frischholz, R.: Robust face detection using the hausdorff distance. In Bigun, J., Smeraldi, F., eds.: Audio and Video based Person Authentication, Springer (2001) 90–95 25. Cristinacce, D., Cootes, T.F., Scott, I.: A multi-stage approach to facial feature detection. In: BMVC. (2004)