International Journal on Applications in Science, Engineering & Technology Volume.1, Issue.3, 2015, pp.45-51 www.ijaset.org Perception-Guided Models For Image Cropping And Photo Aesthetic Assessment M. Chandliya*, C. Kannan Department of Electrical and Electronics Engineering, Arunai Engineering College, Tamil Nadu, India * [email protected] In addition, the semantics are typically detected using a set of external object detectors, e.g., a human face detector. Abstract- Image cropping mostly used in printing industry, telephotography and cinematography. Conventional approaches suffer from the following three challenges. First the role of semantic contents is not to be focused and that are many times more important than low level visual features in photo aesthetics. Second the absence of a sequential ordering in existing cropping models. In contrast humans look at semantically important regions sequentially when viewing a photo. Third, photo aesthetic quality evaluation is a challenging task in multimedia and computer vision fields. To address these challenges, we proposes semantics-aware image cropping, which crops the image by assuming the process of humans sequentially perceiving semantically important regions of a photo. In particular, a weakly supervised learning paradigm is developed to project the local aesthetic signifiers (graphlet in this paper) into a low-dimensional semantic space. Since humans generally perceive only a few prominent regions in a photo, a sparsity-constrained graphlet ranking algorithm is proposed that seamlessly incorporates both the low-level and the high level visual cues. Finally we learn a probabilistic aesthetic measure based on such actively viewing paths (AVPs) from the training photos that are noticed as aesthetically pleasing by multiple users. The observational results show that: 1) the AVPs are 87.65% coherent with real human gaze shifting paths, as verified by the eye-tracking data: and 2) our photo aesthetic measure outperforms many of its competitors. Eye tracking experiments show that humans allocate gazes to important regions in a consecutive manner. Existing cropping models fail to encode such a shifting sequence, i.e., the path linking different graphlets. Psychological science studies have shown that both the bottom up and the top-down visual features draw the attention of human eye. However, current models typically fuse multiple types of features in a linear or nonlinear way, where the cross feature information is not well utilized. Even worse, these integration Schemes cannot emphasize the visually/semantically salient regions within a photo. To solve these problems, a sparsity-constrained ranking algorithm jointly discovers visually/semantically important graphlets along the human gaze shifting path, based on which a photo aesthetic model is learned. An overview of our proposed aesthetic model is presented in Fig. 1. By transferring semantics of image labels into different graphlets of a photo, we represent each graphlet by a couple of low-level and highlevel visual features. Then, a sparsity constrained framework is proposed to integrate multiple types of features for calculating the saliency of each graphlet. Particularly, by constructing the matrices containing the visual/semantic features of graphlets in a photo, the proposed framework seeks the consistently sparse elements from the joint decompositions of the multiple-feature matrices into pairs of low-rank and sparse matrices. Compared with previous methods that linearly/non-linearly combines multiple global aesthetic features, our framework can seamlessly integrate multiple visual/semantic features for salient graphlets discovery. These discovered graphlets are linked into a path, termed actively viewing path (AVP), to simulate a human gaze shifting path. Finally, we employ a Gaussian mixture model (GMM) to learn the distribution of AVPs from the aesthetically pleasing training photos. The learned GMM can be used as the aesthetic measure, since it quantifies the amount of AVPs that are shared between the aesthetically pleasing training photos and the test image. Index Terms— Actively viewing, Gaze shifting, Graphlet Path, Multimodal, Photo Cropping). I. INTRODUCTION Photo aesthetic quality evaluation is a useful technique in multimedia applications. For example, a successful photo management system should rank photos based on the human perception of photo aesthetics, so that users can conveniently select their favorite pictures into albums. Furthermore, an efficient photo aesthetics prediction algorithm can help photographers to crop an aesthetically pleasing sub-region from an original poorly framed photo. However, photo aesthetics evaluation is still a challenging task due to the following three reasons. Semantics is an important cue to describe photo aesthetics, but the state-of-the-art models cannot exploit semantics effectively. Typically, a photo aesthetic model only employs a few heuristically defined semantics according to a specific data set. They are determined by whether photos in a data set are discovered by objects like sky, water, and etc. 1.1. Objectives The main objectives of this paper are, 45 International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51 Sparsity – constrained ranking framework that discovers visually/semantically important graphlets that draw the attention of the human eye, by seamlessly combining a few low-level and high-level visual features. to generate effective crops that are shown to surpass representative attention based and aesthetics-based techniques. Actively viewing path (AVP), a new aesthetic descriptor that mimics the way humans actively allocate gazes to visually/ semantically important regions in a photo. This paper proposes semantics – aware photo cropping, which crops a photo by simulating the process of humans sequentially perceiving semantically important regions of a photo as shown in the proposed block diagram figure 2. We first project the local features onto the semantic space, which is constructed based on the category information of the training photos. Fig. 1 Diagram for photo aesthetic model Fig.2 Proposed Block Diagram 1.3. Proposed Method Recently, several photo cropping and photo assessment approaches have been proposed, which are briefly reviewed in the rest of this section. To describe the spatial interaction of image patches, probabilistic local patch integration-based cropping models are proposed. These approaches extract local patches within each candidate cropped photo, and then probabilistically integrate them into a quality measure to select the cropped photo. 1.2. Our Approach The use of aesthetic evaluation has been broadly applied to various problems other than conventional image cropping, such as image quality assessment, object rearrangement in images, and view finding in large scenes. While a generic aesthetics-based approach is sensible for evaluating the attractiveness of a cropped image, we argue in this paper that it is an incomplete measure for determining an ideal crop of a given input image, as it accounts only for what remains in the cropped image, and not for what is removed or changed from the original image. Aesthetics-based methods do not directly Weigh the influence of the starting composition on the ending Composition, or which of the original image regions are most suitable for a crop boundary to cut through. They also do not explicitly identify the distracting regions in the input image, or model the lengths to which a photographer will go to remove them at the cost of sacrificing compositional quality. Though some of these factors may be implicitly included in a perfect aesthetics metric, it remains questionable whether existing aesthetics measures can effectively capture such considerations in manual cropping. In this work, we present a technique that directly accounts for these factors in determining the crop boundaries of an input image. Proposed are several features that model what is removed or changed in an image by a crop. Together with some standard aesthetic properties, the influence of these features on crop solutions is learned from training sets composed of 1000 image pairs, before and after cropping by three expert photographers. Through analysis of the manual cropping results, the image areas that were cut away, and compositional relationships between the original and cropped images, our method is able II. RELATED WORK 2.1. Aesthetics and originality Aesthetics means: “Concerned with beauty and art and the understanding of beautiful things”. The originality score given to some photographs can also be hard to interpret, because what seems original to some viewers may not be so for others. Depending on the experiences of the viewers, the originality scores for the same photo can vary considerably. Thus the originality score is subjective to a large extent as well. Fig. 3 Aesthetics scores can be significantly influenced by the semantics One of the first observations made on the gathered data was the strong correlation between the aesthetics and originality 46 International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51 ratings for a given image. Aesthetics and originality ratings have approximately linear correlation with each other. This can be due to a number of factors. Many users quickly rate a batch of photos in a given day. They tend not to spend too much time trying to distinguish between these two parameters when judging a photo. They more often than not rate photographs based on a general impression. Typically, a very original concept leads to good aesthetic value, while beauty can often be characterized by originality in view angle, color, lighting, or composition. Also, because the ratings are averages over a number of people, disparity by individuals may not be reflected as high in the averages. Hence there is generally not much disparity in the average ratings. combining the scores of the SVM classifier corresponding to a photo’s internal subject regions. III. LOW-LEVEL AND HIGH-LEVEL LOCAL AESTHETICS DESCRIPTION 3.1 The Concept of Graphlets There are usually a number of of components (e.g., the human and the red track in Fig. 3 in a photo. Among these components, a few spatially neighboring ones and their spatial interactions capture the local aesthetics of a photo. Since graph is a powerful tool to describe the relationships among objects, we use it to model the spatial 2.2. Photo Aesthetics Quality Evaluation In recent years, many photo aesthetic quality evaluation methods have been proposed. Roughly, these methods can be divided into two categories: global feature-based approaches and local patch integration-based approaches. Global featurebased approaches design global low-level and high level visual features that represent photo aesthetics in an implicit manner. A group of high-level visual features, such as an image simplicity based on the spatial distribution of edges, to imitate human perception of photo aesthetics.e.g., shape convexity, to capture photo aesthetics. A set of high-level attribute-based predictors to evaluate photo aesthetics. GMMbased hue distribution and a prominent lines-based texture distribution to represent the photo global composition. To capture the photo local composition, regional features describing human faces, region clarity, and region complexity were developed. Experiments shown that the two generic descriptors outperform many specifically designed photo aesthetic descriptors. It is worth noting the limitations of the above approaches: 1) Approach relies on category dependent regional feature extraction, requiring that photos can be 100% accurately classified into one of the seven categories. This prerequisite is infeasible in practice; 2) the attributes are designed manually and are data set dependent. Thus, it is difficult to generalize them to different data sets; and 3) all these global low-level and high-level visual features are designed heuristically. There is no strong indication that they can capture the complicated spatial configurations of different photos. Local patch integration-based approaches extract patches within a photo and then integrate them to measure photo aesthetic quality. Fig.4. An example of differently sized graphlets extracted from a photo. Interactions of components in a photo. Our technique is to segment a photo into a set of atomic regions, and then construct graphlets to characterize the local aesthetics of this photo. In particular, a graphlet is a small-sized graph defined as: G = (V, E) where V is a set of vertices representing those spatially neighboring atomic regions and E is a set of edges, each of which connects pair wise spatially adjacent atomic regions. We call a graphlet with t vertices a t-sized graphlet. It is worth emphasizing that the number of graphlets within a photo is exponentially increasing with the graphlet size. Therefore, only small graphlets are employed. In this work, we characterize each graphlet in both color and texture channels. we employ a t × t adjacency matrix as: Ms(i, j) = θ (Ri ,Rj) 0 if Ri and Rj are spatially adjacent otherwise where θ(Ri,Rj ) is the horizontal angle of the vector from the center of atomic region Ri to the center of atomic region Rj . Based on the three matrices M c r , M t r and Ms, the color and c texture channel of a graphlet is described by Mc = [ M r , M s ] In the omni-range context, i.e., the spatial distribution of arbitrary pair wise image patches, to model photo composition. The learned omni-range context priors are combined with the other cues, such as the patch number, to form a posterior probability to measure the aesthetics of a photo. One limitation of model is that only the binary correlation between image patches is considered. To describe high-order spatial interactions of image patches, first detected multiple subject regions in a photo, where each subject region is a bounding rectangle containing the salient parts of an object. Then, an SVM classifier is trained for each subject region. Finally, the aesthetics of a test photo is computed by t and Mt = [ M r , M s ] , respectively. 3.2. Extraction of graphlet and representation An image usually contains multiple semantic components, each spanning several super pixels. Given a super pixel set, two observations can be made. First, the appearance and spatial structure of the super pixels collaboratively contribute to their homogeneity. Second, the more their appearance and spatial structure correlate with a particular semantic object, the stronger their homogeneity. Compared with the stripe47 International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51 distributed yellow super pixels, the strip distributed blue super pixels appear more common in semantic objects, such as lake and river, which indicates they are low correlated with any particular semantic object, thus should be assigned with a weaker homogeneity. nodes (blue color) and hidden nodes (gray color). More specifically, it can be divided into four layers. The first layer represents all the training photos, the second layer denotes the AVPs extracted from the training photos, the third layer represents the AVP of the test photo, and the last layer denotes the test photo. The correlation between the first and the second layers is p (P|I1, · · · , IH), the correlation between the second and the third layers is p(P*|P), and the correlation between the third and the fourth layers is p(I*|P*). Compared with the stripe distributed yellow super pixels, the triangularly distributed yellow super pixels are unique for the Egyptian pyramid, thus they should be assigned with a stronger homogeneity. We propose to use graphlets to capture the appearance and spatial structure of super pixels. The graphlets are obtained by extracting connected sub graphs from an RAG. The size of a graphlet is defined as the number of its constituent super pixels. In this work, only small-sized graphlets are adopted because: 1) the number of all the possible graphlets is exponentially increasing with its size; 2) the graphlet embedding implicitly extends the homogeneity beyond single small-sized graphlets. 3) empirical results show that the segmentation accuracy stops increasing when the graphlet size increases from 5 to 10, thus small-sized graphlets are descriptive enough. Let T denote the maximum graphlet size, we extract graphlets of all sizes ranging from 2 to T. The graphlet extraction is based on depth-first search, which is computationally efficient. 3.3. Evaluate probabilistic aesthetic guidance According to the formulation above, photo aesthetics can be Fig.6 An illustration of the probabilistic graphical model quantified as the similarity between the AVPs from the test photo and those from the training aesthetically pleasing photos. The similarity is interpreted as the amount of AVPs that can be transferred from the training photos into the test image. That is, the aesthetic quality of the test photo I* can be formulated as: measure by perception Based on the above discussion, the top-ranked graphlets are the salient ones that can draw the attention of the human eye. That is, humans first fixate on the most salient graphlet in a photo, and then shift their gazes to the second salient one, and so on. Inspired by the scan path used in human eye-tracking experiments, we propose an actively viewing path (AVP) to mimic the sequential manner biological species perceive a visual scene. The procedure of generating an AVP from a photo is described in Fig. 4. It is noticeable that all the AVPs from a data set are with the same number of graphlets K. Typically, we set K to 4 and its influence on photo aesthetics prediction is evaluated in our experiments. Given a set of aesthetically pleasing training photos {I1, · · · , IH} ( I * ) p ( I * I * , ..., I H ) p ( I * P * ). p ( P * P ). p ( P I , I , ..., I 1 2 H ) The probabilities p ( I * p * ), p ( P * p ), and 1 p ( p I , ..., I H ). are computed respectively as p ( I * P * ) G * p * P ( I * G * ) G * p * p ( G 1 * , ...., G T * I * ) p ( G 1 * ..., G T * ) G * p * p ( G 1 * , ..., G T * I * ) p ( I * ) G * p * T i 1 Yt j 1 p ( G t * ( j ) I * ), where T is the maximum graphlet size, Yi is the number of isized graphlets in the test photo I*, and Gt*(j) is the j-th t sized graphlet of AVP from the test photo. p(Gt* (j)|I*) denotes the probability of extracting graphlets Gt*(j) from the test photo I*. IV. EXPERIMENTAL AND RESULT ANALYSIS Fig.5. An illustration of AVP generation based on the top- ranked graphlets. In this section, we evaluate the effectiveness of the proposed Semantics-aware photo cropping based on our experiments. We discuss the influences of parameters on the cropping results. And a qualitative and quantitative comparison between the proposed active graphlet path and human gaze shifting path is presented. Because there are not yet standard datasets and a test image I*, they are highly correlated through their respective AVPs P and P*. Thus, a probabilistic graphical model is utilized to describe this correlation. As shown in Fig. 5, the graphical model contains two types of nodes: observable 48 International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51 released for evaluating cropping performance, we compile our own photo cropping dataset. The total training data contain approximately 6000 highly-aesthetic as well as 6000 low aesthetic photos, which are crawled from two online photo sharing websites Photosig and Flicker. In our experiment, for the whole 6000*2 images, we randomly selected half highly aesthetic photos and half low aesthetic ones as training data, and leave the rest for testing. Finally, a Gibbs sampling based parameter inference is adopted to output the most qualified cropped photo. To represent each subject or background region, a 512dimensional edge histogram and a 256-dimensional color histogram are extracted. Then a region-level SVM classifier is trained based on the concatenated 768-dimensional feature vector and further used to score the quality of each subject region. The scores from all subject regions are concatenated into a feature vector, which is used to train an image-level SVM classifier for scoring the quality of each candidate cropped photo. 4.1. Comparative Study In this section, we evaluate the proposed active graphlet pathbased cropping (AGPC) in comparison with several well known cropping methods, which are the sparse coding of saliency maps (SCSM) sensation-based photo cropping (SBPC), omni-range context based cropping (OCBC), personalized photo ranking (PPR) describable attribute for photo cropping (DAPC), and graphlet-transferring based photo cropping (GTPC). SCSM selects the cropped region that can be decoded by the dictionary learned from training saliency maps with minimum error. SBPC selects the cropped region with the maximum quality score, which is computed by probabilistically integrating the SVM scores corresponding to each detected subjects in a photo. OCBC integrates the prior of spatial distribution of arbitrary pair wise image patches into a probabilistic model to score each candidate cropped photo, and the maximum scored one is selected as the cropped photo. GTPC extracts different-sized graphlets from each photo and then embeds them into equal-length feature vectors using linear discriminate analysis (LDA). Finally the postembedding graphlets are transferred into the cropped one. It is noted that, those photo quality evaluation methods, i.e., PPR and DAPC, only output a quality score of each photo. 4.2. Performance under Different Parameter Settings This experiment studies shows how free parameters affect the performance of the cropping result. That is, there are three free-tuning parameters: 1. The maximum graphlet size T 2. The dimensionality of post-embedding graphlets d, 3. The number of actively selected graphlets K TIME CONSUMPTION OF THE COMPARED CROPPING METHODS DAPC SCSM OCBC AGPC RESOLUTION (in sec) (in sec) (in sec) (in sec) 6.624 800*600 14.321 45.541 10.018 1024*768 30.231 1600*1200 54.678 1000*200 6.564 2000*400 25.998 14.556 93.44 69.343 197.64 4.541 19.987 4.341 3.454 16.784 46.874 7.774 12.226 V . SIMULATION RESULTS Thus, it is impossible to compare our approach with them directly because our approach outputs the cropped region from each original photo. Fortunately, it is straightforward to transfer each of those photo quality evaluation methods into a photo cropping method. In particular, we sequentially sample a number of candidate cropped photos from the original photo. Then, we use one of these photo quality evaluation methods to score each candidate cropped photo, and the best qualified one is output as the cropped photo. For all these compared approaches, the source codes of SCSM and PPR are available. For SCSM, all the parameter settings are the same as those in the publications. For PPR, we use the executable program2 and the parameter settings keep unchanged. For GTPC and AGPC, the maximum graphlet size T is 5, the dimensionality of post-embedding graphlet is 20, and the number of actively discovered graphlets is 4 (only for AGPC). For OCBC, we use UFC -based segmentation to decompose each training photo into a number of atomic regions. Then all training atomic regions are clustered into 1000 centers by k-means. For arbitrary pair wise k-means centers, we use five component GMM to model its distribution. Given a candidate cropped photo, the probability of its pair wise atomic regions is computed based on GMM and further integrated into a probabilistic quality measure. 5.1.Cropping In Order To Emphasize The Semantic Conents and Evaluating The Phot Aesthetic Quality 49 International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51 5.2. Sequential Order Of Cropping VI. CONCLUSION A new method is proposed to crop a photo by simulating the procedure of human sequentially perceiving semantics of a photo. Particularly, a so – called active graphlet path is constructed to mimic the process that humans actively look at semantically important components in a photo. So there is no loss of elements in the image. Experimental result shows that the active graphlet path accurately predicts the human gaze shifting, and is more indicative for photo aesthetics than conventional saliency maps. The cropped photos produced by our approach outperform its competitors in both qualitative and quantitative comparisons. Further, we also for the first time propose a probabilistic model to maximally transfer the paths from a training set of aesthetically pleasing photos into the cropped one. Extensive comparative experiments demonstrate the effectiveness of our approach. 5.3. Cropping In Order To Emphasize the Primary Subject REFERENCES [1]. L. Zhang, M. Song, Z. Liu, X. Liu, J. Bu, and C. Chen, Probabilistic Graphlet Cut: Exploring Spatial Structure Cue For Weakly Supervised Image Segmentation, Proc. CVPR, 2013, pp. 1908–1915. [2]. L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, and C. Chen, Probabilistic Graphlet Transfer For Photo Cropping, IEEE Trans. Image Process., vol. 22, 2013, pp. 2887–2897. [3]. L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, and X. Li, Fusion Of Multichannel Local And Global Structural Cues For Photo Aesthetics Evaluation, IEEE Trans. Image Process., vol. 23, 2014, pp. 1419–1429. 50 International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51 [4]. Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, Manifold Regularized Multitask Learning For SemiSupervised Multilabel Image Classification, IEEE TIP, vol. 22, 2013, pp. 523–536. [5]. L. Zhang, M. Song, Y. Yang, Q. Zhao, C. Zhao, and N. Sebe, Weakly Supervised Photo Cropping, IEEE Trans. Multimedia, vol. 16, 2014, pp. 94–107. [6]. L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji, Representative Discovery Of Structure Cues For Weakly-Supervised Image Segmentation, IEEE Trans. Multimedia, vol. 16, 2014, pp. 470–479. [7]. J. You, A. Perkis, M. M. Hannuksela, and M. Gabbouj, Perceptual Quality Assessment Based On Visual Attention Analysis, in Proc. ACM Multimedia, 2012, pp. 561–564. [8]. W. Luo, X. Wang, and X. Tang, Content-Based Photo Quality Assessment, in Proc. ICCV, 2011, pp.2206– 2213. [9] Luming Zhang, Yue Gao, Chao Zhang, Hanwang Zhang, Qi Tian, Roger Zimmermann, Perception-Guided Multimodal Feature Fusion for Photo Aesthetics Assessment, IEEE Trans. Multimedia, vol. 13, 2014, pp.562-565. [10] Luming Zhang, Yue Gao, Rongrong Ji, Yingjie Xia, Qionghai Dai, Xuelong Li, Actively Learning Human Gaze Shifting Paths for Semantics-Aware Photo Cropping, IEEE T-IP, vol. 23, 2014, 2235–2245. [11] Yuzhen Niu, Feng Liu, Wu-Chi Feng, and Hailin Jin, Aesthetics-Based Stereoscopic Photo Cropping for Heterogeneous Displays, IEEE Transactions On Multimedia, vol. 14, 2012. [12] L. Zhang, Y. Han, Y. Yang, M. Song, S. Yan, and Q. Tian, Discovering Discrminative Graphlets For Aerial Image Categories Recognition, IEEE Trans. Image Process., vol. 22, 2013, pp. 5071–5084. [13]. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and X. Wu, Visual-textual joint relevance learning for tag-based social image search, IEEE Trans. Image Process., vol. 22, 2013, pp. 363–376. [14] L.-K. Wong and K.-L. Low, Saliency-enhanced image aesthetics class prediction, IEEE Trans. Image Process., in Proc. ICIP, 2011. pp. 997–1000. 51
© Copyright 2024