Perception-Guided Models For Image Cropping And Photo Aesthetic

International Journal on Applications in Science, Engineering & Technology
Volume.1, Issue.3, 2015, pp.45-51
www.ijaset.org
Perception-Guided Models For Image Cropping And
Photo Aesthetic Assessment
M. Chandliya*, C. Kannan
Department of Electrical and Electronics Engineering, Arunai Engineering College, Tamil Nadu, India
* [email protected]
In addition, the semantics are typically detected using a set of
external object detectors, e.g., a human face detector.
Abstract- Image cropping mostly used in printing industry,
telephotography and cinematography. Conventional approaches
suffer from the following three challenges. First the role of
semantic contents is not to be focused and that are many times
more important than low level visual features in photo aesthetics.
Second the absence of a sequential ordering in existing cropping
models. In contrast humans look at semantically important regions
sequentially when viewing a photo. Third, photo aesthetic quality
evaluation is a challenging task in multimedia and computer vision
fields. To address these challenges, we proposes semantics-aware
image cropping, which crops the image by assuming the process of
humans sequentially perceiving semantically important regions of a
photo. In particular, a weakly supervised learning paradigm is
developed to project the local aesthetic signifiers (graphlet in this
paper) into a low-dimensional semantic space. Since humans
generally perceive only a few prominent regions in a photo, a
sparsity-constrained graphlet ranking algorithm is proposed that
seamlessly incorporates both the low-level and the high level visual
cues. Finally we learn a probabilistic aesthetic measure based on
such actively viewing paths (AVPs) from the training photos that
are noticed as aesthetically pleasing by multiple users. The
observational results show that: 1) the AVPs are 87.65% coherent
with real human gaze shifting paths, as verified by the eye-tracking
data: and 2) our photo aesthetic measure outperforms many of its
competitors.
Eye tracking experiments show that humans allocate gazes to
important regions in a consecutive manner. Existing cropping
models fail to encode such a shifting sequence, i.e., the path
linking different graphlets. Psychological science studies have
shown that both the bottom up and the top-down visual
features draw the attention of human eye. However, current
models typically fuse multiple types of features in a linear or
nonlinear way, where the cross feature information is not well
utilized.
Even worse, these integration Schemes cannot emphasize the
visually/semantically salient regions within a photo. To solve
these problems, a sparsity-constrained ranking algorithm
jointly discovers visually/semantically important graphlets
along the human gaze shifting path, based on which a photo
aesthetic model is learned. An overview of our proposed
aesthetic model is presented in Fig. 1. By transferring
semantics of image labels into different graphlets of a photo,
we represent each graphlet by a couple of low-level and highlevel visual features. Then, a sparsity constrained framework
is proposed to integrate multiple types of features for
calculating the saliency of each graphlet. Particularly, by
constructing the matrices containing the visual/semantic
features of graphlets in a photo, the proposed framework seeks
the consistently sparse elements from the joint decompositions
of the multiple-feature matrices into pairs of low-rank and
sparse matrices. Compared with previous methods that
linearly/non-linearly combines multiple global aesthetic
features, our framework can seamlessly integrate multiple
visual/semantic features for salient graphlets discovery. These
discovered graphlets are linked into a path, termed actively
viewing path (AVP), to simulate a human gaze shifting path.
Finally, we employ a Gaussian mixture model (GMM) to learn
the distribution of AVPs from the aesthetically pleasing
training photos. The learned GMM can be used as the
aesthetic measure, since it quantifies the amount of AVPs that
are shared between the aesthetically pleasing training photos
and the test image.
Index Terms— Actively viewing, Gaze shifting, Graphlet Path,
Multimodal, Photo Cropping).
I. INTRODUCTION
Photo aesthetic quality evaluation is a useful technique in
multimedia applications. For example, a successful photo
management system should rank photos based on the human
perception of photo aesthetics, so that users can conveniently
select their favorite pictures into albums. Furthermore, an
efficient photo aesthetics prediction algorithm can help
photographers to crop an aesthetically pleasing sub-region
from an original poorly framed photo. However, photo
aesthetics evaluation is still a challenging task due to the
following three reasons.
Semantics is an important cue to describe photo aesthetics, but
the state-of-the-art models cannot exploit semantics
effectively. Typically, a photo aesthetic model only employs a
few heuristically defined semantics according to a specific
data set. They are determined by whether photos in a data set
are discovered by objects like sky, water, and etc.
1.1. Objectives
The main objectives of this paper are,
45
International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51
Sparsity – constrained ranking framework that discovers
visually/semantically important graphlets that draw the
attention of the human eye, by seamlessly combining a few
low-level and high-level visual features.
to generate effective crops that are shown to surpass
representative attention based and aesthetics-based techniques.
Actively viewing path (AVP), a new aesthetic descriptor that
mimics the way humans actively allocate gazes to visually/
semantically important regions in a photo.
This paper proposes semantics – aware photo cropping, which
crops a photo by simulating the process of humans
sequentially perceiving semantically important regions of a
photo as shown in the proposed block diagram figure 2. We
first project the local features onto the semantic space, which
is constructed based on the category information of the
training photos.
Fig. 1 Diagram for photo aesthetic model
Fig.2 Proposed Block Diagram
1.3. Proposed Method
Recently, several photo cropping and photo assessment
approaches have been proposed, which are briefly reviewed in
the rest of this section. To describe the spatial interaction of
image patches, probabilistic local patch integration-based
cropping models are proposed. These approaches extract local
patches within each candidate cropped photo, and then
probabilistically integrate them into a quality measure to select
the cropped photo.
1.2. Our Approach
The use of aesthetic evaluation has been broadly applied to
various problems other than conventional image cropping,
such as image quality assessment, object rearrangement in
images, and view finding in large scenes. While a generic
aesthetics-based approach is sensible for evaluating the
attractiveness of a cropped image, we argue in this paper that
it is an incomplete measure for determining an ideal crop of a
given input image, as it accounts only for what remains in the
cropped image, and not for what is removed or changed from
the original image. Aesthetics-based methods do not directly
Weigh the influence of the starting composition on the ending
Composition, or which of the original image regions are most
suitable for a crop boundary to cut through. They also do not
explicitly identify the distracting regions in the input image, or
model the lengths to which a photographer will go to remove
them at the cost of sacrificing compositional quality. Though
some of these factors may be implicitly included in a perfect
aesthetics metric, it remains questionable whether existing
aesthetics measures can effectively capture such
considerations in manual cropping. In this work, we present a
technique that directly accounts for these factors in
determining the crop boundaries of an input image. Proposed
are several features that model what is removed or changed in
an image by a crop. Together with some standard aesthetic
properties, the influence of these features on crop solutions is
learned from training sets composed of 1000 image pairs,
before and after cropping by three expert photographers.
Through analysis of the manual cropping results, the image
areas that were cut away, and compositional relationships
between the original and cropped images, our method is able
II. RELATED WORK
2.1. Aesthetics and originality
Aesthetics means: “Concerned with beauty and art and the
understanding of beautiful things”. The originality score given
to some photographs can also be hard to interpret, because
what seems original to some viewers may not be so for others.
Depending on the experiences of the viewers, the originality
scores for the same photo can vary considerably. Thus the
originality score is subjective to a large extent as well.
Fig. 3 Aesthetics scores can be significantly influenced by the semantics
One of the first observations made on the gathered data was
the strong correlation between the aesthetics and originality
46
International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51
ratings for a given image. Aesthetics and originality ratings
have approximately linear correlation with each other. This
can be due to a number of factors. Many users quickly rate a
batch of photos in a given day. They tend not to spend too
much time trying to distinguish between these two parameters
when judging a photo. They more often than not rate
photographs based on a general impression. Typically, a very
original concept leads to good aesthetic value, while beauty
can often be characterized by originality in view angle, color,
lighting, or composition. Also, because the ratings are
averages over a number of people, disparity by individuals
may not be reflected as high in the averages. Hence there is
generally not much disparity in the average ratings.
combining the scores of the SVM classifier corresponding to a
photo’s internal subject regions.
III. LOW-LEVEL AND HIGH-LEVEL LOCAL
AESTHETICS DESCRIPTION
3.1 The Concept of Graphlets
There are usually a number of of components (e.g., the human and
the red track in Fig. 3 in a photo. Among these components, a few
spatially neighboring ones and their spatial interactions capture the
local aesthetics of a photo. Since graph is a powerful tool to describe
the relationships among objects, we use it to model the spatial
2.2. Photo Aesthetics Quality Evaluation
In recent years, many photo aesthetic quality evaluation
methods have been proposed. Roughly, these methods can be
divided into two categories: global feature-based approaches
and local patch integration-based approaches. Global featurebased approaches design global low-level and high level
visual features that represent photo aesthetics in an implicit
manner. A group of high-level visual features, such as an
image simplicity based on the spatial distribution of edges, to
imitate human perception of photo aesthetics.e.g., shape
convexity, to capture photo aesthetics. A set of high-level
attribute-based predictors to evaluate photo aesthetics. GMMbased hue distribution and a prominent lines-based texture
distribution to represent the photo global composition. To
capture the photo local composition, regional features
describing human faces, region clarity, and region complexity
were developed. Experiments shown that the two generic
descriptors outperform many specifically designed photo
aesthetic descriptors. It is worth noting the limitations of the
above approaches: 1) Approach relies on category dependent
regional feature extraction, requiring that photos can be 100%
accurately classified into one of the seven categories. This
prerequisite is infeasible in practice; 2) the attributes are
designed manually and are data set dependent. Thus, it is
difficult to generalize them to different data sets; and 3) all
these global low-level and high-level visual features are
designed heuristically. There is no strong indication that they
can capture the complicated spatial configurations of different
photos. Local patch integration-based approaches extract
patches within a photo and then integrate them to measure
photo aesthetic quality.
Fig.4. An example of differently sized graphlets extracted from a photo.
Interactions of components in a photo. Our technique is to segment a
photo into a set of atomic regions, and then construct graphlets to
characterize the local aesthetics of this photo. In particular, a graphlet
is a small-sized graph defined as:
G = (V, E)
where V is a set of vertices representing those spatially neighboring
atomic regions and E is a set of edges, each of which connects pair
wise spatially adjacent atomic regions. We call a graphlet with t
vertices a t-sized graphlet. It is worth emphasizing that the number
of graphlets within a photo is exponentially increasing with the
graphlet size. Therefore, only small graphlets are employed. In this
work, we characterize each graphlet in both color and texture
channels. we employ a t × t adjacency matrix as:
Ms(i, j) =
θ (Ri ,Rj)
0
if Ri and Rj are spatially adjacent
otherwise
where θ(Ri,Rj ) is the horizontal angle of the vector from the
center of atomic region Ri to the center of atomic region Rj .
Based on the three matrices M
c
r
, M
t
r
and Ms, the color and
c
texture channel of a graphlet is described by Mc = [ M r , M s ]
In the omni-range context, i.e., the spatial distribution of
arbitrary pair wise image patches, to model photo
composition. The learned omni-range context priors are
combined with the other cues, such as the patch number, to
form a posterior probability to measure the aesthetics of a
photo. One limitation of model is that only the binary
correlation between image patches is considered. To describe
high-order spatial interactions of image patches, first detected
multiple subject regions in a photo, where each subject region
is a bounding rectangle containing the salient parts of an
object. Then, an SVM classifier is trained for each subject
region. Finally, the aesthetics of a test photo is computed by
t
and Mt = [ M r , M s ] , respectively.
3.2. Extraction of graphlet and representation
An image usually contains multiple semantic components,
each spanning several super pixels. Given a super pixel set,
two observations can be made. First, the appearance and
spatial structure of the super pixels collaboratively contribute
to their homogeneity. Second, the more their appearance and
spatial structure correlate with a particular semantic object, the
stronger their homogeneity. Compared with the stripe47
International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51
distributed yellow super pixels, the strip distributed blue super
pixels appear more common in semantic objects, such as lake
and river, which indicates they are low correlated with any
particular semantic object, thus should be assigned with a
weaker homogeneity.
nodes (blue color) and hidden nodes (gray color). More
specifically, it can be divided into four layers. The first layer
represents all the training photos, the second layer denotes the
AVPs extracted from the training photos, the third layer
represents the AVP of the test photo, and the last layer denotes
the test photo. The correlation between the first and the second
layers is p (P|I1, · · · , IH), the correlation between the second
and the third layers is p(P*|P), and the correlation between the
third and the fourth layers is p(I*|P*).
Compared with the stripe distributed yellow super pixels, the
triangularly distributed yellow super pixels are unique for the
Egyptian pyramid, thus they should be assigned with a
stronger homogeneity. We propose to use graphlets to capture
the appearance and spatial structure of super pixels. The
graphlets are obtained by extracting connected sub graphs
from an RAG. The size of a graphlet is defined as the number
of its constituent super pixels. In this work, only small-sized
graphlets are adopted because: 1) the number of all the
possible graphlets is exponentially increasing with its size; 2)
the graphlet embedding implicitly extends the homogeneity
beyond single small-sized graphlets. 3) empirical results show
that the segmentation accuracy stops increasing when the
graphlet size increases from 5 to 10, thus small-sized graphlets
are descriptive enough. Let T denote the maximum graphlet
size, we extract graphlets of all sizes ranging from 2 to T. The
graphlet extraction is based on depth-first search, which is
computationally efficient.
3.3. Evaluate probabilistic aesthetic
guidance
According to the formulation above, photo aesthetics can be
Fig.6 An illustration of the probabilistic graphical model
quantified as the similarity between the AVPs from the test
photo and those from the training aesthetically pleasing
photos. The similarity is interpreted as the amount of AVPs
that can be transferred from the training photos into the test
image. That is, the aesthetic quality of the test photo I* can be
formulated as:
measure by perception
Based on the above discussion, the top-ranked graphlets are
the salient ones that can draw the attention of the human eye.
That is, humans first fixate on the most salient graphlet in a
photo, and then shift their gazes to the second salient one, and
so on. Inspired by the scan path used in human eye-tracking
experiments, we propose an actively viewing path (AVP) to
mimic the sequential manner biological species perceive a
visual scene. The procedure of generating an AVP from a
photo is described in Fig. 4. It is noticeable that all the AVPs
from a data set are with the same number of graphlets K.
Typically, we set K to 4 and its influence on photo aesthetics
prediction is evaluated in our experiments. Given a set of
aesthetically pleasing training photos {I1, · · · , IH}
 ( I * )  p ( I * I * , ..., I
H
)
 p ( I * P * ). p ( P * P ). p ( P I , I , ..., I
1
2
H
)
The probabilities p ( I * p * ), p ( P * p ), and
1
p ( p I , ..., I
H
). are computed respectively as
p ( I * P * )   G * p * P ( I * G * )
  G * p *
p ( G 1 * , ...., G T * I * )
p ( G 1 * ..., G T * )
  G * p * p ( G 1 * , ..., G T * I * ) p ( I * )
  G * p *

T
i 1

Yt
j 1
p ( G t * ( j ) I * ),
where T is the maximum graphlet size, Yi is the number of isized graphlets in the test photo I*, and Gt*(j) is the j-th t sized
graphlet of AVP from the test photo. p(Gt* (j)|I*) denotes the
probability of extracting graphlets Gt*(j) from the test photo I*.
IV. EXPERIMENTAL AND RESULT ANALYSIS
Fig.5. An illustration of AVP generation based on the top- ranked
graphlets.
In this section, we evaluate the effectiveness of the proposed
Semantics-aware photo cropping based on our experiments.
We discuss the influences of parameters on the cropping
results. And a qualitative and quantitative comparison between
the proposed active graphlet path and human gaze shifting
path is presented. Because there are not yet standard datasets
and a test image I*, they are highly correlated through their
respective AVPs P and P*. Thus, a probabilistic graphical
model is utilized to describe this correlation. As shown in Fig.
5, the graphical model contains two types of nodes: observable
48
International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51
released for evaluating cropping performance, we compile our
own photo cropping dataset. The total training data contain
approximately 6000 highly-aesthetic as well as 6000 low
aesthetic photos, which are crawled from two online photo
sharing websites Photosig and Flicker. In our experiment, for
the whole 6000*2 images, we randomly selected half highly
aesthetic photos and half low aesthetic ones as training data,
and leave the rest for testing.
Finally, a Gibbs sampling based parameter inference is
adopted to output the most qualified cropped photo. To
represent each subject or background region, a 512dimensional edge histogram and a 256-dimensional color
histogram are extracted. Then a region-level SVM classifier is
trained based on the concatenated 768-dimensional feature
vector and further used to score the quality of each subject
region. The scores from all subject regions are concatenated
into a feature vector, which is used to train an image-level
SVM classifier for scoring the quality of each candidate
cropped photo.
4.1. Comparative Study
In this section, we evaluate the proposed active graphlet pathbased cropping (AGPC) in comparison with several well
known cropping methods, which are the sparse coding of
saliency maps (SCSM) sensation-based photo cropping
(SBPC), omni-range context based cropping (OCBC),
personalized photo ranking (PPR) describable attribute for
photo cropping (DAPC), and graphlet-transferring based photo
cropping (GTPC). SCSM selects the cropped region that can
be decoded by the dictionary learned from training saliency
maps with minimum error. SBPC selects the cropped region
with the maximum quality score, which is computed by
probabilistically integrating the SVM scores corresponding to
each detected subjects in a photo. OCBC integrates the prior
of spatial distribution of arbitrary pair wise image patches into
a probabilistic model to score each candidate cropped photo,
and the maximum scored one is selected as the cropped photo.
GTPC extracts different-sized graphlets from each photo and
then embeds them into equal-length feature vectors using
linear discriminate analysis (LDA). Finally the postembedding graphlets are transferred into the cropped one. It is
noted that, those photo quality evaluation methods, i.e., PPR
and DAPC, only output a quality score of each photo.
4.2. Performance under Different Parameter Settings
This experiment studies shows how free parameters affect the
performance of the cropping result.
That is, there are three free-tuning parameters:
1. The maximum graphlet size T
2. The dimensionality of post-embedding graphlets d,
3. The number of actively selected graphlets K
TIME CONSUMPTION OF THE COMPARED CROPPING METHODS
DAPC
SCSM
OCBC
AGPC
RESOLUTION
(in sec)
(in sec)
(in sec)
(in sec)
6.624
800*600
14.321
45.541
10.018
1024*768
30.231
1600*1200
54.678
1000*200
6.564
2000*400
25.998
14.556
93.44
69.343
197.64
4.541
19.987
4.341
3.454
16.784
46.874
7.774
12.226
V . SIMULATION RESULTS
Thus, it is impossible to compare our approach with them
directly because our approach outputs the cropped region from
each original photo. Fortunately, it is straightforward to
transfer each of those photo quality evaluation methods into a
photo cropping method. In particular, we sequentially sample
a number of candidate cropped photos from the original photo.
Then, we use one of these photo quality evaluation methods to
score each candidate cropped photo, and the best qualified one
is output as the cropped photo. For all these compared
approaches, the source codes of SCSM and PPR are available.
For SCSM, all the parameter settings are the same as those in
the publications. For PPR, we use the executable program2
and the parameter settings keep unchanged. For GTPC and
AGPC, the maximum graphlet size T is 5, the dimensionality
of post-embedding graphlet is 20, and the number of actively
discovered graphlets is 4 (only for AGPC). For OCBC, we use
UFC -based segmentation to decompose each training photo
into a number of atomic regions. Then all training atomic
regions are clustered into 1000 centers by k-means. For
arbitrary pair wise k-means centers, we use five component
GMM to model its distribution. Given a candidate cropped
photo, the probability of its pair wise atomic regions is
computed based on GMM and further integrated into a
probabilistic quality measure.
5.1.Cropping In Order To Emphasize The Semantic Conents
and Evaluating The Phot Aesthetic Quality
49
International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51
5.2. Sequential Order Of Cropping
VI. CONCLUSION
A new method is proposed to crop a photo by simulating the
procedure of human sequentially perceiving semantics of a
photo. Particularly, a so – called active graphlet path is
constructed to mimic the process that humans actively look at
semantically important components in a photo. So there is no
loss of elements in the image. Experimental result shows that
the active graphlet path accurately predicts the human gaze
shifting, and is more indicative for photo aesthetics than
conventional saliency maps. The cropped photos produced by
our approach outperform its competitors in both qualitative
and quantitative comparisons. Further, we also for the first
time propose a probabilistic model to maximally transfer the
paths from a training set of aesthetically pleasing photos into
the cropped one. Extensive comparative experiments
demonstrate the effectiveness of our approach.
5.3. Cropping In Order To Emphasize the Primary Subject
REFERENCES
[1]. L. Zhang, M. Song, Z. Liu, X. Liu, J. Bu, and C. Chen,
Probabilistic Graphlet Cut: Exploring Spatial Structure
Cue For Weakly Supervised Image Segmentation, Proc.
CVPR, 2013, pp. 1908–1915.
[2]. L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, and C.
Chen, Probabilistic Graphlet Transfer For Photo
Cropping, IEEE Trans. Image Process., vol. 22, 2013,
pp. 2887–2897.
[3]. L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, and X. Li,
Fusion Of Multichannel Local And Global Structural
Cues For Photo Aesthetics Evaluation, IEEE Trans.
Image Process., vol. 23, 2014, pp. 1419–1429.
50
International Journal on Applications in Science, Engineering & Technology, 1 (3), 2015, 45-51
[4]. Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank,
Manifold Regularized Multitask Learning For SemiSupervised Multilabel Image Classification, IEEE TIP,
vol. 22, 2013, pp. 523–536.
[5]. L. Zhang, M. Song, Y. Yang, Q. Zhao, C. Zhao, and N.
Sebe, Weakly Supervised Photo Cropping, IEEE Trans.
Multimedia, vol. 16, 2014, pp. 94–107.
[6]. L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji,
Representative Discovery Of Structure Cues For
Weakly-Supervised Image Segmentation, IEEE Trans.
Multimedia, vol. 16, 2014, pp. 470–479.
[7]. J. You, A. Perkis, M. M. Hannuksela, and M. Gabbouj,
Perceptual Quality Assessment Based On Visual
Attention Analysis, in Proc. ACM Multimedia, 2012,
pp. 561–564.
[8]. W. Luo, X. Wang, and X. Tang, Content-Based Photo
Quality Assessment, in Proc. ICCV, 2011, pp.2206–
2213.
[9] Luming Zhang, Yue Gao, Chao Zhang, Hanwang Zhang,
Qi Tian, Roger Zimmermann, Perception-Guided
Multimodal Feature Fusion for Photo Aesthetics
Assessment, IEEE Trans. Multimedia, vol. 13, 2014,
pp.562-565.
[10] Luming Zhang, Yue Gao, Rongrong Ji, Yingjie Xia,
Qionghai Dai, Xuelong Li, Actively Learning Human
Gaze Shifting Paths for Semantics-Aware Photo
Cropping, IEEE T-IP, vol. 23, 2014, 2235–2245.
[11] Yuzhen Niu, Feng Liu, Wu-Chi Feng, and Hailin Jin,
Aesthetics-Based Stereoscopic Photo Cropping for
Heterogeneous Displays, IEEE Transactions On
Multimedia, vol. 14, 2012.
[12] L. Zhang, Y. Han, Y. Yang, M. Song, S. Yan, and Q.
Tian, Discovering Discrminative Graphlets For Aerial
Image Categories Recognition, IEEE Trans. Image
Process., vol. 22, 2013, pp. 5071–5084.
[13]. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and X. Wu,
Visual-textual joint relevance learning for tag-based
social image search, IEEE Trans. Image Process., vol.
22, 2013, pp. 363–376.
[14] L.-K. Wong and K.-L. Low, Saliency-enhanced image
aesthetics class prediction, IEEE Trans. Image Process.,
in Proc. ICIP, 2011. pp. 997–1000.
51