Learning Appearance Models for Road Detection Jos´e M. Alvarez, Mathieu Salzmann and Nick Barnes NICTA {jose.alvarez, mathieu.salzmann, nick.barnes}@nicta.com.au Abstract— We introduce an approach to image-based road detection that exploits the availability of unannotated training images to learn an appearance model. Our approach allows us to remove the standard assumption that the lower part of the input image belongs to the road surface, which does not always hold and often yields strongly biased appearance models. Instead, we exploit this assumption in the training images, which yields a much more general appearance model. We then use the learned model to classify the pixels of an input image as road or background without requiring any assumptions about this image. Our experimental evaluation shows the benefits of our approach over existing methods in challenging real-world driving scenarios. I. I NTRODUCTION Road detection is key to autonomous driving systems, as well as to many driver assistance tasks, such as lane keeping, collision avoidance and road following [1], [2]. Furthermore, it can serve as a preprocessing step for other challenging problems, such as pedestrian or vehicle detection [3]. In this paper, we tackle the problem of detecting roads in monocular color images acquired with a mobile platform in real driving situations. This scenario yields a challenging computer vision problem due to the lack of control of the environment and to the large variability of road appearance arising from the different road types, as well as from varying lighting and weather conditions. To achieve robustness to such high intra-class variability, many low-level cues, such as color [4], [5], [6], [7], [8], [9], [10], [11], texture [12], [13], [14] or a combination of them [15], [16], have been proposed. The standard approach to road detection consists of modeling the road appearance by learning the distribution of these cues from image pixels labeled as road. Detection is then performed by computing the image cue at each pixel of the input image and, based on the corresponding distribution value, classifying the pixel as either road or background. To deal with the highly dynamic nature of road scenes, i.e., the road appearance may vary significantly between two images, state-of-the-art algorithms usually build the road model online by exploiting the assumption that the lower part of the input image belongs to the road surface. Indeed, for most common camera placements, the lower part of the image corresponds to an area about 4 meters away from the camera, which often contains road. Nevertheless, this assumption suffers from two major drawbacks: (1) in some scenarios, the lower part of the image may not belong to the road, e.g., Fig. 1(a); (2) even if it does belong to the road, the pixels in this area may not be representative of the (a) (b) Fig. 1. Common road detection algorithms assume that the central–lower part of the image belongs to the road surface. However, this assumption has two main drawbacks: (a) in some scenarios, that area does not belong to the road surface; (b) pixels in that area are not representative of the entire road in the image. entire road in the input image, e.g., Fig. 1(b). In both cases, the road model built on this assumption would fail to detect most of the road in the input image. In this paper, we propose to overcome the abovementioned drawbacks by making use of the availability of training images to build the road appearance model. To avoid requiring fully labeled training images, we exploit the assumption that the lower part of the training images contains road pixels. However, we do not make any such assumption for the input image, on which we seek to detect the road. As a consequence, our approach is unaffected by the presence of background objects in the lower part of the input image. Furthermore, exploiting multiple training images prevents the resulting road model from being biased to the specific appearance of a single road region. More specifically, we represent each pixel in the lower part of the training images as a linear combination of different color planes. The weights of this combination are learned so as to minimize the variance of the resulting representation, which corresponds to making road pixels in this new space as uniform as possible. Given a new input image, we compute the different color planes at each pixel, combine them according to the learned weights, and determine if the pixel is road or background based on the value of the distribution learned from training data, as well as on how uniform the region around the pixel is. No assumption about the input image is required. We evaluated our algorithm on several video sequences obtained from a vehicle driving in real-world conditions. In practice, we used early images of the sequence as training data to build a distribution and detect the road in the images later in the sequence. Our experiments demonstrate the robustness of our algorithm in challenging scenarios, and show that it outperforms state-of-the-art methods. II. R ELATED W ORK Road detection is a key component for driver assistance systems, and is a challenging problem in computer vision since images are acquired in an outdoor scenario using a mobile platform. Low–level cues such as color [7], [8], [9], [10], [11], texture [13], [14] or a combination of them [14], [15] have been widely used. Among these cues, color has been accepted as the common approach for low–level road detection since texture is scale dependent and may fail due to the strong prespective effect present in road images. Furthermore, color imposes less physical restrictions and provides powerful information for detecting the road even in the absence of shape information. The main challenge of color based approaches is dealing with the high intra– class variability due to the highly dynamic nature of the scenes. Common approaches model the road appearance by exploiting variant and invariant properties of different color spaces such as HSV in [7], normalized rg in [9] or a physics– based illuminant invariant color space in [11], [4]. More recently, in [17], a linear combination of color planes is learned to reduce the variability of the road texture. Learning algorithms are usually based on positive (i.e., road) or negative (i.e., background) samples. However, for our problem, learning a sufficiently general background representation is not feasible. Instead, only a road model is usually built. This model typically exploits road samples that are obtained from the lower part of the image. That is, most existing algorithms assume that this part of the image belongs to the road surface [11], [7], [17]. However, this assumption has two main problems: (1) the bottom part of the image may not correspond to road pixels; (2) pixels within that area may not represent the high-variability of the road appearance. To increase the number of training samples some algorithms rely on manually annotated data. For instance, labelled data is used in [17] to build a road model. This information is then combined with information from the current image to ensure adaptation to the current image conditions. Other algorithms exploit the sequential nature of the data and use road results from previous images as training data [7], [9]. Unfortunately, the former approach requires human interaction to annotate the images. The latter relies on the robustness of the algorithm and thus may propagate errors. Therefore, in the next section, we propose an algorithm to learn the appearance of the road that requires minimal assumptions. The algorithm learns from positive examples processed without any human supervision. Fig. 2. Examples of road images used to build our training set. As expected, the lower part of the image corresponds to the road surface in most cases. III. L EARNING A ROAD A PPEARANCE M ODEL In this section, we present our approach to learning an appearance model for road pixels from training images. We first describe how we exploit the training images, and then discuss our pixel representation and the resulting road model. A. Training Data for Road Modeling Our goal is to build an appearance model for road pixels, but avoid the usual assumption that the lower part of the input image belongs to the road surface, which does not always hold and may yield a strong bias to the model. To this end, we propose to exploit training images to learn the model. The standard approach to exploiting training data consists of manually labeling the pixels in the training images as positive (i.e., road) or negative (i.e., background). This, however, quickly becomes time-consuming. Furthermore, obtaining sufficiently many examples to cover the background class would require a huge number of training images. To overcome these issues, we propose to only use positive (road) examples collected automatically from the training images. These examples can be obtained by relying on the assumption that the lower part of the training images belongs to the road surface most of the time (see Fig. 2). Note that this entails no assumption on the input image. The training pixels obtained as described above depict a large variability of road pixels and could directly be used to build the appearance model. However, some appearances will be much less frequent than others (e.g., strong shadows). Therefore, we propose a pruning algorithm to balance the appearance of road samples in the training set and thus reduce the effect of dominating samples that would bias the distribution. Our pruning process consists of two steps (see Fig. 3): first, we over-segment the lower part of each training image into superpixels [18] and compute the modes (maximum histogram value) of the RGB values of each superpixel; then we cluster these modes using mean shift with a fixed bandwidth value in the RGB color space. The Fig. 3. Overview of our pruning algorithm. We first segment the lower part of the training images into superpixels, and then cluster the mean RGB values of these superpixels [18]. We only retain the superpixel closest to each cluster center as training example. reason for selecting this color space is its sensitivity to lighting variations, i.e., small changes in lighting produce high changes in the average RGB color, which makes it ideal to retain a diversified set of road samples. The final training set is obtained by taking the pixels of the closest superpixel to each cluster center. B. Road Pixel Representation where y is some representation of the pixel. To achieve better discriminative power, as well as robustness to lighting and weather conditions, we seek to find a pixel representation whose variance is minimal for all road pixels. In other words, our goal is to find a low-level cue such that all road pixels have similar appearance. Following [17], such a representation can be obtained by exploiting the variant and invariant properties of different color planes using a weighted linear combination, as depicted by Fig. 4. More specifically, let xij , ≤ i ≤ N , ≤ j ≤ P be the value of the jth color plane for the ith training pixel. We represent each pixel by a single value yi , such that P X = = i= = P X j= w j wk σ jk , (4) j,k= where x^j is the mean value over all training pixels for color plane j, and σ jk is the covariance between the jth and kth color planes. In this paper, we assume that the road has a stochastic (random) texture and thus do not take the pixel location into account. The weights of the color planes can then be obtained by solving the optimization problem minimize w subject to P X w j wk σ jk j,k= P X wj = , (5) j= w j xij , (2) j= where w = [w , . . . , wP ] represents the contribution of each color plane to the final combination. Our goal now becomes that of finding the weights w j that minimize the variance σd of the road pixels. To this end, we note that the mean pixel value µd can be expressed as X i y . N N µd = X i (y − µd ) N − i= P N P N X X X X w j xkj w j xij − N − N i= j= k= j= N P X X w j (xij − x^j ) N − N σd = Given the training set described above, we can now compute a low-level cue for each training pixel and build the appearance model. Here, we consider a simple road model, where the probability of a pixel belonging to the road is given by p(y) ∼ N(µd , σd ) , (1) yi = Note that to minimize the effect of lighting variations, µd can also be computed from the centers of the clusters found by mean shift. The variance can then be written as i= (3) − ≤ w j ≤ , ∀ j , where we further enforce the weights to sum up to 1, and bound them. This problem is a quadratic program, whose global minimum can be found using available software, such as Matlab’s quadprog function. Our final appearance model is then defined by the mean and variance given in Eq. (3) and Eq. (4), as well as by the normalized histogram hd of the training pixel values yi centered at µd . Our algorithm to build the road model is given in Algorithm 1. Fig. 4. Building a road model. Each training pixel is encoded as a linear combination of multiple color planes. The weights of the combination are computed so as to minimize the variance of the resulting values across the training samples. Algorithm 1 Learning a Road Model Compute an over–segmentation of the training images using an edge preserving superpixel algorithm (e.g., [18]). Extract superpixels corresponding to the central–lower part of each image. Compute the modes of the RGB values of each superpixel. Cluster these modes using mean shift with a fixed bandwidth (e.g., 0.025). Select the superpixels closest to each cluster centroid. Build the training set consisting of the pixels in the selected superpixels. Compute w = [w , . . . , wN ]T from Eq. (5). Compute µd from Eq. (3). Compute σd from Eq. (4). IV. ROAD D ETECTION WITH AN A PPEARANCE M ODEL We now present our road detection algorithm that makes use of the road model introduced in the previous section (see Fig. 5). Note that our road detection algorithm does not make any assumption about the image being currently analyzed. As a first step, the input image is converted into a number of different color planes (e.g., R, G, B, nr, ng, o o, L, a, b, H, S, V as derived in Table I). These color representations are then combined at pixel level using the weights learned from training data. This yields a single channel image y˜ where road areas tend to be uniform and similar to the road model. For each pixel p in this image, we can then make use of our road model to define a classifier that determines if pixel p belongs to the road, or to the background. In particular, we use the classification rule C(p) = (std(y˜ p ) < σd ) ∧ (hd (y˜ p ) > θ ) , Fig. 5. Overview of our road detection algorithm. Given an input image, we convert it to multiple color planes, which are then combined using the weights learned from the training images. We then use the resulting image in conjunction with the learned appearance model to determine whether a pixel belongs to the road or background class. Based on the detection results, the current image may be added to the training set to improve the model for future images. TABLE I D ERIVATION OF NORMALIZED rgb, OPPONENT COLOR SPACE , HSV AND CIE–Lab COLOR SPACES FROM RGB VALUES . Color Space Definition normalized RG Opponent Color Space G R , g = R+G+B r = R+G+B O = R−G , O = B − R+G HSV V V = V − √ √ − √ − √ H = arctan VV , S = X 0.490 Y = 0.177 Z 0.000 0.310 0.812 0.010 √ √ R G B √ V +V 0.200 R 0.011 G 0.990 B CIE–Lab Y L = ( Y ) − , i h X a = ( X ) − ( YY ) , h i b = ( YY ) − ( ZZ ) . X , Y and Z correspond to white reference point. (6) where y˜ p is the image value at pixel p, hd is the normalized histogram of training pixel values, and std(·) is the standard deviation of the image value in a small region around pixel p. In particular, we used a neighborhood of × pixels. In our experiments, the parameter θ was set to 0.01, but the results are insensitive to its specific value. According to this rule, pixel p is labeled as road if C(p) > . The first term in Eq. (6) classifies pixels according to how uniform the region around the pixel is. This term follows the analysis based on uniformity proposed in [17]. However, 1 True Positive Rate (Sensitivity) 0.9 0.8 0.7 Ours, (EER=0.154, AUC=0.907) Alv−prev, (EER=0.320, AUC=0.737) H−prev, (EER=0.274, AUC=0.771) S−prev, (EER=0.254, AUC=0.813) V−prev, (EER=0.330, AUC=0.720) Alv−cur, (EER=0.314, AUC=0.746) H−cur, (EER=0.267, AUC=0.788) S−cur, (EER=0.264, AUC=0.798) V−cur, (EER=0.290, AUC=0.771) 0.6 0.5 0.4 0.3 0.03 0.1 False Positive Rate (1−Specificity) 1 Ours Alv-prev H-prev S-prev V-prev Alv-cur H-cur S-cur V-cur AUC . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . EER . . . . . . . . . Fig. 6. Quantitative comparison of our approach against different baselines. We show the ROC curves on the left and the area under the curve (AUC) and the equal error rate (EER) on the right, averaged over 50 test images. Note that our approach yields better accuracy than the baselines. EER is defined as the intersection between the curve and the line where error rates are equal, i.e., ( − T PR) = FPR. uniformity is limited to the resolution of the histogram (i.e., bins may not yield enough resolution), and often leads to false positive detections (e.g., the sky often is uniform). The second term in Eq. (6) removes many of these false positives by classifying the pixels according to their transformed image value compared to the expected road value, µd . For non-road regions, these values are expected to be dissimilar. Although any training set could be used, in our experiments we exploited the fact that images were acquired as video sequences to perform detection. Therefore, to detect the road in image t of the sequence, we used images up to t − K as training images. Note that this still entails no assumption on the current image, but only on the previous images in the sequence. While some of these images may violate the assumption that their lower part contains road pixels, the combination of multiple training images proved robust enough to still learn an effective road model. Furthermore, after detection of the road in image t, we can decide whether or not to add it to the training set for image t + by analyzing the central–lower part of the road mask. If the area is mainly road according to the current model, it is included as training, otherwise the image is discarded. This process minimizes the presence of outliers in the training set, and thus offers some control over the error propagation of our online algorithm. V. E XPERIMENTS In this section, we validate the proposed algorithm using road images acquired in real driving situations. Our dataset consists of images of × pixel resolution acquired using an onboard camera mounted on a vehicle. To obtain our results, we used different color planes: the R, G, and B channels and the color planes defined in Table I. The number of previous frames used to learn the road model was set to K = frames. As a first baseline, we compared our algorithm against the texture-less descriptor proposed in [17]. Similarly to our approach, this method models the image pixels as a linear combination of multiple color planes, whose weights minimize the variance of the resulting representation across the training set. However, in that work, detection is achieved by only analyzing the uniformity (texture) of small areas. Furthermore, no pruning of the training set is performed. As a second baseline, we performed detection using a non– parametric classifier on different color planes. In particular, the hue, saturation and intensity planes from the HSV color space (Table I). Hue and saturation are well–known to be relatively invariant to lighting variations and shadows [7], [19], while, on the other hand, the intensity is sensitive to lighting variations. For these color planes, the road model was taken as the normalized histogram built from the training samples. This histogram was then used as an estimate of the probability of each image pixel of belonging to the road class. For each baseline, we used two different training sets. First, as in our approach, we utilized the pixels from the lower part of the previous images in the sequence, but without any pruning, since pruning is part of our contributions. Second, we used only the lower part of the current image as training data. The baselines are referred to as Alv-prev, Alv-cur, {H, S,V }-prev and {H, S,V }-cur, respectively. Fig. 6 shows the receiver operating curves (ROC) and area under the curve (AUC) respectively for our method and the different baselines averaged over 50 input images. Note that our approach significantly outperforms the other methods. The relatively poor performance of Alv-prev and Alv-cur comes from the fact that the sky is often detected as road due to its uniformity. On average the performance of the baselines degrades when using previous images as training data rather than just the current image. Our approach is able to leverage the information in previous images to achieve better accuracy than the baselines. For qualitative evaluation, rather than showing simple road scenes where many methods perform adequately, in Fig. 7 we show images with more complex road shape, and shadows or other cars on the road. Additionally, results on images where the lower part does not belong to the road surface are shown in Fig. 8. These results are provided without any post–processing steps (e.g., morphological operations). Note that our algorithm recovers road areas despite shadows and the presence of other vehicles in different areas of the image. This confirms the ability of Input Gnd-truth Ours Alv-prev H-prev S-prev V-prev Input Alv-cur H-cur S-cur V-cur Fig. 7. Typical qualitative results. We compare our results to those obtained with different baselines. Results referred to as -prev use the same training set as our approach. Results referred to as -cur use the lower part of the image as training set as shown in input . Note that our detections corresponds much more closely to the ground-truth labels. our algorithm to learn a robust road model without relying on samples from the current image. From these results, it is also clear that our approach outperforms the baselines. It can be observed that Alv-prev and Alv-cur tend to detect road pixels, but also other uniform areas (e.g., sky). This is mainly due to the low resolution of the histogram used to estimate the uniformity. Furthermore, Alv-cur fails to recover the road when the lower part of the image does not belong to the road surface (see Fig. 8). In most cases, {H, S,V }-prev tends to fail to detect the road due to the large variations input ground truth Ours Alv-prev H-prev S-prev V-prev Input Alv-cur H-cur S-cur V-cur Fig. 8. Qualitative results on images where the lower part does not correspond to the road. Note that approaches learning a model from the lower part of the image fail to detect the road. In contrast, our method still yields good detections. in the appearance of the road. In contrast, {H, S,V }-cur is not able to properly characterize the road due to the lack of generality of the pixels in the lower part of the image. Furthermore, {H, S,V }-cur completely fails when the lower part does not belong to the road surface. Failure analysis reveals that our algorithm fails to distinguish uniformly colored areas when they are similar to the road surface (e.g., gray vehicles). This is to be expected, since our algorithm only uses appearance cues. Results could be improved by adding a post–processing step to select areas connected to the lower part of the image. VI. C ONCLUSIONS In this paper, we have introduced an approach to road detection based on an appearance model learned from training data. Our algorithm exploits unlabeled training images to collect positive examples, and removes the restrictive assumption that the lower part of the input image contains road pixels. Given training images, pixels are automatically selected so as to allow modeling the diversity of road appearance. These pixels are encoded using a linear combination of color planes, whose weights yield minimum variance in the road areas. Detection is then performed by comparing the input image pixels against the learned model. Experiments conducted on real driving situations demonstrate the ability of our algorithm to detect road despite the presence of shadows and other objects in the scene, and show the benefits of our approach over existing methods. In the future, we intend to study the use of training images coming from different sequences as training images, thus removing the sequential aspect of our approach and hoping to build an even more general appearance model. ACKNOWLEDGEMENTS NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications, and the Digital Economy, and the Australian Research Council (ARC) through the ICT Centre of Excellence Program. R EFERENCES [1] C. Thorpe, M. Hebert, T. Kanade, and S. Shafer, “Vision and navigation for the carnegie-mellon navlab,” IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), vol. 10, no. 3, pp. 362 – 373, May 1988. [2] A. Lookingbill, J. Rogers, D. Lieb, J. Curry, and S. Thrun, “Reverse optical flow for self-supervised adaptive autonomous robot navigation,” International Journal of Computer Vision (IJCV), vol. 74, no. 3, pp. 287–302, 2007. [3] P. Doll´ar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” vol. 99, 2011. [4] B. Kim, J. Son, and K. Sohn, “Illumination invariant road detection based on learning method,” in ITSC’11, oct. 2011, pp. 1009 –1014. [5] G. K. Siogkas and E. S. Dermatas, “Random-walker monocular road detection in adverse conditions using automated spatiotemporal seed selection,” IEEE Trans. on Intel. Transp. Systems (ITS), vol. PP, no. 99, pp. 1 –12, 2012. [6] C. Oh, J. Son, and K. Sohn, “Illumination robust road detection using geometric information,” in ITSC’12, sept. 2012, pp. 1566 –1571. [7] M. Sotelo, F. Rodriguez, and L. Magdalena, “Virtuous: vision-based road transportation for unmanned operation on urban-like scenarios,” IEEE Trans. Intelligent Transportation Systems (ITS), vol. 5, no. 2, pp. 69 – 83, June 2004. [8] Y. He, H. Wang, and B. Zhang, “Color–based road detection in urban traffic scenes,” IEEE Trans. Intelligent Transportation Systems (ITS), vol. 5, no. 24, pp. 309 – 318, 2004. [9] C. Tan, T. Hong, T. Chang, and M. Shneier, “Color model-based real-time learning for road following,” in ITSC’06: Procs. IEEE Intl. Conf. on Intel. Transp. Systems, 2006, pp. 939–944. [10] J. M. Alvarez, T. Gevers, and A. M. Lopez, “Learning photometric invariance from diversified color model ensembles,” in CVPR’09, 2009, pp. 565–572. [11] J. M. Alvarez and A. Lopez, “Road detection based on illuminant invariance,” IEEE Trans. on Intelligent Transportation Systems (ITS), vol. 12, no. 1, pp. 184 –193, 2011. [12] H. Kong, J. Y. Audibert, and J. Ponce, “General road detection from a single image,” IEEE Trans. on Image Processing (TIP), vol. 19, no. 8, pp. 2211 –2220, 2010. [13] C. Rasmussen, “Grouping dominant orientations for ill-structured road following.” in CVPR’04: Procs. of the IEEE on Computer Vision and Pattern Recognition, Washington, DC, 2004, pp. 470–477. [14] P. Lombardi, M. Zanin, and S. Messelodi, “Switching models for vision-based on–board road detection,” in ITSC’05: Procs. IEEE Intl. Conf. on Intel. Transp. Systems, Vienna, Austria, 2005, pp. 67 – 72. [15] P. Sturgess, K. Alahari, L. Ladicky, and P. H. S. Torr, “Combining appearance and structure from motion features for road scene understanding,” in BMVC’09, 2009. [16] S. Yun, Z. Guo-ying, and Y. Yong, “A road detection algorithm by boosting using feature combination,” in IV’07: Procs. of the IEEE Intel. Vehicles Symposium, June 2007, pp. 364–368. [17] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene segmentation from a single image,” in ECCV’12: Procs. of European Conference Computer Vision, ser. Lecture Notes in Computer Science, vol. 7578, 2012, pp. 376–389. [18] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, and K. Siddiqi, “Turbopixels: Fast superpixels using geometric flows,” PAMI, vol. 31, no. 12, 2009. [19] C. Rotaru, T. Graf, and J. Zhang, “Color image segmentation in HSI space for automotive applications,” Journal of Real-Time Image Processing, pp. 1164–1173, 2008.
© Copyright 2024