Automatic Discrimination of Text and Non

Automatic Discrimination of Text and Non-Text
Natural Images
Chengquan Zhang, Cong Yao, Baoguang Shi, Xiang Bai
School of Electronic Information and Communications
Huazhong University of Science and Technology
Wuhan, 430074, P.R. China.
Email: {zchengquan,yaocong2010,shibaoguang}@gmail.com , [email protected]
Abstract—With the rapid growth of image and video data,
there comes an interesting yet challenging problem: How to
organize and utilize such large volume of data? Textual content
in images and videos is an important source of information,
which can be of great usefulness and assistance. Therefore, we
investigate in this paper the problem of text image discrimination,
which aims at distinguishing natural images with text from those
without text. To tackle this problem, we propose a method that
combines three mature techniques in this area, namely: MSER,
CNN and BoW. To better evaluate the proposed algorithm, we
also construct a large benchmark for text image discrimination,
which includes natural images in a variety of scenarios. This
algorithm has proven to be both effective and efficient, thus it
can serve as a tool for mining valuable textual information from
huge amount of image and video data.
I.
I NTRODUCTION
Text present in images and videos, either protogenous or
superimposed, carries rich and precise information, thus it can
serve as an important cue for a wide range of applications,
such as image search [1], human-computer interaction [2] and
assistive system for the blind [3]. Different from conventional
approaches related to textual information understanding, which
are usually concerned with text detection or recognition [4], we
tackle in this paper the problem of text image discrimination.
Given a collection of natural images, the goal of text image
discrimination is to discriminate the images that contain text
from those that do not contain any text, without considering the
location or content of text, as shown in Fig. 1. This problem
is obviously of great importance and practicality, but has been
neglected by the community for a long period of time.
The difficulties in text image discrimination mainly stem
from three aspects: (1) Diversity of texts. In stark contrast to
characters in document images, which are usually with regular
fonts, similar color and horizontal orientation, texts in natural
images are far more diverse, i.e., they may exhibit significant variations in font, color, orientation, or even language
type. Moreover, the portion of texts in an image can vary
tremendously, for instance, a single character in a corner or
multiple lines of characters occupying the whole image; (2)
Complexity of backgrounds. The backgrounds may contain a
lot of non-text elements (such as branch, grass, fence and sign)
that are very alike true text, which might cause confusion; (3)
Interference factors. Various factors, like noise, blur and nonuniform illumination, may give rise to failures.
To address these challenges, we propose an effective and
robust algorithm, which takes the advantages of three widely
Fig. 1: Text image discrimination. Given a collection of images, the
goal of text image discrimination is to seperate images containing
text from those not containing text. Text image discrimination can
eliminate majority of irrelevant images in an early stage, thus it could
benefit subsequent processing, such as text detection, text recognition,
and image/video indexing.
used techniques in this field: MSER, CNN and BoW. Character
candidates are extracted using MSER [5]. CNN act as a discriminative codebook, which compute a bank of responses for
each candidate. For each image, a feature vector is generated
with BoW, by aggregating and pooling the responses of all the
candidates in the image. The final prediction (text or non-text
image) is given by a trained SVM classifier.
Moreover, we generate and release a large image database
for text image discrimination, since there is no public benchmark in this field. This dataset contains a collection of natural
images with and without text, from various scenarios. The
texts in the images are in different languages, colors, scales,
fonts, orientations, and layouts. Due to its diversity and representativeness, we believe this dataset can serve as a standard
benchmark for text image discrimination.
We have evaluated the effectiveness of the proposed algorithm on the proposed dataset. The experiments demonstrate
that the proposed method achieves superior performance, compared with conventional approaches. With GPU, this algorithm
runs reasonably fast, so it can be used in large-scale non-text
image filtration.
In summary, the contributions of this paper are two-fold:
•
We emphasize the significance of text image discrimination and propose an effective algorithm for it.
•
We establish a large benchmark1 for algorithm development, assessment and comparison. It includes
15302 images, exhibiting variations in scene, style,
illumination, text appearance and layout, etc.
II.
R ELATED W ORK
Text has been the core research object in the document
analysis and recognition community. However, majority of the
existing works have focused on document images [6], [7],
while those concerned with natural images mainly tackle tasks
like text detection and recognition [8]–[13], rather the task of
text/non-text distinction.
There are a fraction of works that address text/non-text
distinction. Indermuhle et al. [14] and Vidya et al. [15]
proposed a system that can distinguish between text and nontext regions in handwritten documents, respectively. However,
these methods were aimed at classifying text/non-text at region
level, rather than image level.
Most related to our work, the algorithm of Alessi et al. [16]
is able to detect if a digital image contains text or not, but it is
only applicable to document images or simple video frames.
In contrast, the proposed approach is capable of distinguishing
between natural images with text and those without text. In this
sense, this work is the first that formally tackle the problem
of text image discrimination for real-world natural images.
III.
T EXT I MAGE D ISCRIMINATION M ETHOD
In this section, we describe in detail the proposed method
for text image discrimination, which is an ingenious combination of Maximally Stable Extremal Region (MSER) [5],
Convolutional Neural Network (CNN) [17] and Bag of Words
(BoW) [18], [19]. Recently, a rich body of works [20]–
[22] have demonstrated that CNN is effective for scene text
detection and recognition. The proposed algorithm also utilizes
the ability of CNN to learn discriminative features for accurate
text image discrimination. In training phase, K-means [23] are
applied to text regions extracted by MSER, before training a
CNN model. This converts the binary-class classification problem (text vs. non-text) to multi-class classification problem.
According to the response of each regions, a histogram-like
descriptor, which represents the characteristics of a natural
image, is formed using Bag of Words (BoW).
Fig. 2: Pipeline for text image discrimination.
As shown in Fig. 2, the proposed method is composed
of the follow major processing steps: region extraction and
clustering, multi-class CNN model training, feature vector
generation and linear SVM training.
1 The dataset is available at http://zchengquan.github.io/textDis/ for academic use only.
A. Region Extraction and Clustering
Maximally Stable Extremal Regions [5] (MSER) is widely
used in the fields of text detection and recognition, since it is
able to discover candidate characters effectively and efficiently.
Neumann et al. [24] have also proved that MSER can capture
over 95% true text regions with using multiple channels.
In our method, we employ MSER to extract regions which
can support the discrimination of text and non-text images.
However, different from text detection or recognition, text
image discrimination does not require the localization and
recognition characters. Moreover, the proposed method allows
text regions with few connection characters. Multi-channels
(RGB-Gray) and loose settings are used in our method.
Before training the CNN model, we compute HOG [25]
feature for each text region from foreground (text component),
and K-means is performed to cluster the text regions into
multiple categories. The softmax layer in CNN model will
produce a response with (K+1) dimension for each region.
Note that, clustering is not applied to regions from background.
B. Multi-Class CNN Model Training
The whole CNN Structure is composed of 4 convolutional
layers and 2 full connection layers. Normally, each convolutional layer is followed by maxpooling and rectified units. The
setting of the CNN architecture used in this paper is showed
in TABLE I.
TABLE I: The setting of the CNN architecture used in this work.
ks means the kernel size; ps means the padding size; ss means stride
size; nMap means the number of feature map, and nNode means the
number of linear layer node.
Layer
Conv-1
Conv-2
Conv-3
Conv-1
Full-1
Full-2
Settings
ks=5, ps=2, ss=1,
ks=5, ps=2, ss=1,
ks=3, ps=1, ss=1,
ks=3, ps=0, ss=1,
nNode=1024
nNode=(K+1)
nMap=64
nMap=128
nMap=384
nMap=512
The input image region is rescaled to 32 × 32, and this
would generate 1×1 size maps through the first 4 convolutional
layers under special design. The kernel numbers we used are
64,128,384 and 512, respectively. After 4 convolution steps, 2
full connection layers with 1024 and (K+1) perception units are
closely followed. Since softmax are used in the last layer, we
can obtain a vector of (K+1) dimension, where each dimension
denotes the probability of one category.
In the training phase, we use Stochastic Gradient Descent
and back-propagation to jointly optimize all parameters, which
can minimize the classification error as much as possible. Two
important measures are taken into consideration in the training
process: one is penalty factor, the other is boosting. Since
clustering methods belong to unsupervised learning, which
can’t divide the samples into accuracy classes. Thus, we can
use different penalty factors for regions from foreground and
background, by using the Negative Log Likelihood criterion as
loss function. Here, we design low penalty factors for K text
region clusters, which allows for CNN model to adjust the
suitable clustering results. However, high value is assigned to
regions from background, because whether the model is good
at filtering non-text regions is important for the feature vector
generation step.
In order to make the CNN model be good at filtering nontext regions, we use boosting [26] to optimize our model.
At first, we divide the regions into two groups: regions from
foreground (text regions) and regions from background (nontext regions). K-means is adopted to divide the text regions into
multiple clusters, while keep one class for non-text regions.
Because of the complexity and diversity of background, we
randomly select some samples from the background. When
the first time of the CNN model achieving an optimum state,
we add hard non-text examples in next training process with
the use of this model, which procedure repeats for 3∼5 times.
C. Feature Vector Generation and Linear SVM Training
Given an image, N regions are extracted with MSER in
multi-channels, which is followed by color channel changing
and scale regularization. Each region will generate a response
with (K+1) dimension. A feature matrix with the size of
N × (K + 1) can be used to represent the whole image.
Then, a feature vector is formed with BoW, by aggregating
and pooling the feature matrix, which is of (K+1) dimension.
We call this operation CNN Coding, since the softmax output
of CNN model acts as the codebook, which is similar to sparse
coding used in word retrieval [27]. The coding output can be
expressed as follow:
Φ(I) =
K+1
X
αi φi ,
(1)
i=1
where I denotes an image and φi is the i-th class response.
αi denotes the weight of i-th class. Thus, Φ(I) stands for the
CNN Coding result for a whole image.
A feasible way that can improve upon CNN Coding is
to perform region filtering before coding, since regions from
background are complex and diverse. Moreover, given an text
image, the number of non-text regions extracted by MSER is
usually several order of magnitude larger than that of the text
regions. In order to generate an reliable and effective descriptor
for an image, some measures should be adopted to control
the ratio between text and non-text regions. The trained CNN
model has proven to be excellent at filtering non-text regions,
so we not only utilize the CNN model to generate a feature
vector of an image, but also use the model as a filter. As shown
in Fig. 3, the CNN model is able to effectively eliminate nontext regions, which makes the final feature more discriminative.
We will demonstrate the advantage of this strategy in Sec. V.
Finally, the coding results are fed to a SVM classifier. In
our method, we used LIBLINEAR [28] as the final classifier,
which leads to excellent performance.
IV.
DATASET AND E VALUATION P ROTOCOL
In this section, we introduce the proposed dataset for
text image discrimination and the corresponding evaluation
method.
A. Dataset
In order to evaluate the proposed algorithm and compare
it with other methods, we have collected a large dataset from
internet, which includes 7302 text images and 8000 non-text
Fig. 3: Regions filtered by the CNN model. (a) Text image, (b)
Non-text image. Input images are showed in the first row. The red
bounding boxes presented in the second row are the regions extracted
by MSER, while the green bounding boxes presented in the third row
are the survived regions after filtering.
images. Majority of the images are natural image, while a small
number of them are born-digital images and scanned document
images. If one image contains any piece of text line, it is
considered as text image, regardless of the portion of text areas.
The dataset is challenging and diverse, due to the variations in
scene type, style, illumination, language type and text layout,
as well as the complexity of backgrounds. As shown in Fig. 4,
the text images are manually labelled with bounding boxes for
visible text regions. We randomly select 2000 samples from
each class to build the testing dataset, all the rest images are
used as the training set.
B. Evaluation Protocol
Alike to evaluation methods for general image classification, precision(P) , recall(R) and F-Measure(F) are used as
the metrics to measure the performance of algorithms for
text image discrimination. These three metrics are defined as
follows:
TP
P =
TP + FP
TP
R=
TP + FN
2×P ×R
F =
(2)
P +R
where, T P is the number of text images that are correctly
classified and F P is the number of non-text images that are
falsely classified. F N is the number of text images that are
falsely classified. In our experiments, precision means the ratio
(a) text images
(b) non-text images
Fig. 4: Examples from the proposed dataset. (a) Text images along with manually labelled bounding boxes. (b) Non-text images.
TABLE III: Effects of different percentages of regions reserved.
between the number of text images over that of those predicted
as positive, and recall means the ratio between the number of
Percentage Precision Recall F-Measure
correctly classified text images over all text images in the test
1%
0.900
0.884
0.892
2%
0.898
0.903
0.901
set.
V.
E XPERIMENTS AND D ISCUSSIONS
In this section, we evaluate our algorithm on the proposed
benchmark dataset, using the standard evaluation protocol.
In the following experiments, we will discuss the effects of
different numbers of clusters on text regions, and different
percentages of regions reserved after filtering. Moreover, we
will compare our method with other traditional algorithms
for image classification, such as Locality-constrained Linear
Coding (LLC) [29], CNN, and improved LLC with using
MSER.
Effect of cluster number: As shown in TABLE II, we
compare 7 different numbers of clusters used in training the
CNN model. The precision and recall are measured when
the F-Measure achieving optimum value. Comparing different
values of K, we find out the best performance is achieved at
the value of K = 100, over which there will be no obvious
improvement.
TABLE II: Effects of different values of K
(K+1)
2
51
101
201
301
401
501
Precision
0.889
0.906
0.898
0.892
0.881
0.894
0.879
Recall
0.878
0.874
0.903
0.892
0.902
0.884
0.908
F-Measure
0.883
0.890
0.901
0.892
0.891
0.888
0.892
Effect of reserved region ratio: As aforementioned, we
found that too many non-text regions will have a negative
effect on the final performance. In our experiments, we have
experimented with different percentages of regions reserved.
As can be observed from TABLE III, the best performance
is obtained when 2% top-ranked regions are reserved, which
strongly proves that too many regions from background would
hurt the classification performance. In this experiment, the precision and recall are also measured when F-Measure achieves
optimum value.
5%
10%
20%
50%
100%
0.906
0.897
0.916
0.903
0.901
0.866
0.853
0.819
0.816
0.812
0.885
0.874
0.864
0.857
0.854
Comparison with other methods: (1) Locality-constrained
Linear Coding [29] is our first baseline method, which extracts
dense sift features with 3 different scales. The codebook size
in our experiment is 2048. Besides, we replace SPM [30]
with global max-pooling, since we find the SPM hardly bring
any improvement with spending more time on coding. (2)
CNN is considered as the second baseline method, whose
structure is similar to the model used in our proposed method,
except that the input scale is 224 × 224 and a global maxpooling is performed before the last 2 full connection layers.
(3) Taking the bounding box information into account, we
also design an improved coding scheme for the traditional
LLC. For each image HOG, LBP, gradient histogram, and
gradient orientation histogram are extracted for the stable
regions after filtering, and integrated with the LLC method.
We call this method MSER+Adaboost+LLC. The P-R curves
in Fig. 5 show that with the bounding box information, the
proposed method achieves the significantly enhanced performance, compared with the baseline methods. Note that the
comparison between the proposed method (CNN Coding)
and MSER+Adaboost+LLC is fair, since both methods used
bounding box labels in the training phase.
Time Cost: We also measured the time cost of our algorithm
on a regular PC (CPU: Intel(R) Xeon(R) CPU E3-1230 V2
@3.30 GHz; GPU: Tesla K40c; RAM: 8GB). As shown in TABLE IV, our method takes around 0.43∼0.49s to accomplish
the classification task on a single CPU and a single GPU. We
analyse the average time taken for each stage of our pipeline
on our dataset, which has an average image size of 720 × 620.
The step of MSER is executed on C++ with Opencv-2.4.8. The
CNN Coding and linear SVM are executed on the platform of
Torch7 under Linux.
The proposed system achieves high classification accuracy
[6]
[7]
[8]
[9]
[10]
[11]
Fig. 5: The precision/recall curves of different methods on our
[12]
dataset.
and runs reasonably fast, thus it can be used as a powerful tool
in large-scale textual information mining tasks.
TABLE IV: Time cost for each stage.
Stage
MSER Extraction
CNN Coding
SVM Classification
Total
VI.
Time
0.18∼0.23s
0.25∼0.26s
0.24ms
0.43∼ 0.49s
C ONCLUSION
We have presented an effective and efficient algorithm for
text image discrimination, an important problem in textual
information understanding. Moreover, a large image dataset,
which includes diverse natural images with and without text,
is generated and released. This dataset can serve as a standard
benchmark in this field. The experiments on the proposed
benchmark show the advantages of the proposed algorithm
over conventional methods.
In this paper, we used the bounding boxes as well as
the image level labels, for training the model for text image
discrimination. In the future, we will investigate methods that
only utilize image level labels [31]. Besides, it is certainly that
using this method can also narrow the searching range of text
region, which would be helpful to text localization and deserve
further study.
ACKNOWLEDGEMENTS
This work was primarily supported by National Natural
Science Foundation of China (NSFC) (No.61222308), and
in part by Program for New Century Excellent Talentsin
University under Grant NCET-12-0217, and by CCFTencentRAGR20140116.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
S. S. Tsai, H. Chen, D. Chen, G. Schroth, R. Grzeszczuk, and B. Girod,
“Mobile visual search on printed documents using text and low bit-rate
features,” in In Proc. of ICIP, 2011, pp. 2601–2604.
B. Kisacanin, V. Pavlovic, and T. S. Huang, Real-time vision for humancomputer interaction. Springer Science & Business Media, 2005.
B. T. Nassu, R. Minetto, and L. E. S. d. Oliveira, “Text line detection
in document images: Towards a support system for the blind,” in In
Proc. of ICDAR, 2013, pp. 638–642.
C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A learned multi-scale
representation for scene text recognition,” in In Proc. of CVPR, 2014.
J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline
stereo from maximally stable extremal regions,” Image and vision
computing, vol. 22, no. 10, pp. 761–767, 2004.
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
N. Chen and D. Blostein, “A survey of document image classification:
problem statement, classifier architecture and performance evaluation,”
IJDAR, vol. 10, no. 1, pp. 1–16, 2007.
D. Doermann, “The indexing and retrieval of document images: A
survey,” Computer Vision and Image Understanding, vol. 70, no. 3,
pp. 287–298, 1998.
Y. Navon, V. Kluzner, and B. Ophir, “Enhanced active contour method
for locating text,” in In Proc. of ICDAR, 2011, pp. 222–226.
J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and
documents: a survey,” IJDAR.
C. Yi, X. Yang, and Y. Tian, “Feature representations for scene text
character recognition: A comparative study,” in In Proc. of ICDAR,
2013.
C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary
orientations in natural images,” in In Proc. of CVPR, 2012.
C. Yao, X. Bai, and W. Liu, “A unified framework for multioriented text
detection and recognition,” IEEE Transactions on Image Processing,
vol. 23, no. 11, pp. 4737–4749, 2014.
Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition:
Recent advances and future trends,” Frontiers of Computer Science,
2015.
E. Indermuhle, H. Bunke, F. Shafait, and T. Breuel, “Text versus nontext distinction in online handwritten documents,” in In Proc. of SAC,
2010.
V. Vidya, T. R. Indhu, and V. K. Bhadran, “Classification of handwritten
document image into text and non-text regions,” in In Proc. of ICSIP,
2012.
N. G. Alessi, S. Battiato, G. Gallo, M. Mancuso, and F. Stanco,
“Automatic discrimination of text images,” in In Proc. of SPIE, 2003.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
L. Fei-Fei and P. Perona, “A bayesian heirarcical model for learning
natural scene categories,” in In Proc. of CVPR, 2005.
S. Bai, X. Wang, C. Yao, and X. Bai, “Multiple stage residual model
for accurate image classification,” in ACCV, 2014, pp. 430–445.
T. Bluche, H. Ney, and C. Kermorvant, “Feature extraction with
convolutional neural networks for handwritten word recognition,” in
In Proc. of ICDAR, 2013, pp. 285–289.
W. Huang, Y. Qiao, and X. Tang, “Robust scene text detection with
convolution neural network induced mser trees,” in In Proc. of ECCV,
2014, pp. 497–511.
M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text
spotting,” in In Proc. of ECCV. Springer, 2014, pp. 512–528.
J. MacQueen et al., “Some methods for classification and analysis
of multivariate observations,” in Proc. Fifth Berkeley Symp. on Math.
Statist. and Prob.
L. Neumann and J. Matas, “Real-time scene text localization and
recognition,” in In Proc. of CVPR, 2012, pp. 3538–3545.
N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in In Proc. of CVPR, vol. 1, 2005, pp. 886–893.
J. Friedman, T. Hastie, and R. Tibshiran, “Additive logistic regression:
a statistical view of boosting,” The annals of statistics, vol. 28, no. 2,
pp. 337–407, 2000.
R. Shekhar and C. Jawahar, “Document specific sparse coding for word
retrieval,” in In Proc. of ICDAR, 2013, pp. 643–647.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“Liblinear: A library for large linear classification,” The Journal of
Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in In Proc. of CVPR,
2010, pp. 3360–3367.
J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid
matching using sparse coding for image classification,” in In Proc. of
CVPR, 2009, pp. 1794–1801.
X. Wang, X. Bai, W. Liu, and L. J. Latecki, “Feature context for image
classification and object detection,” in In Proc. of CVPR, 2011.