Discovering Objects in Videos

Discovering Objects in Videos
Karim Sayed Ahmed
[email protected]
Machine learning project - Final Report
1
Problem
The goal of this project is to identify and localize objects in videos. Given a video V ,
consisting of a set of N frames f1 , .., fN where each frame contains one or more unknown
object, the objective is to identify and localize primary objects in each i-th frame fi in V
that exhibit coherence in both appearance and motion. Figure 1 shows a brief illustration of
discovering objects in videos problem. There are two sub-problems that should be solved;
the first sub-problem is localization of primary objects in each frame (still images). The
second sub-problem is finding the correlation between the detected objects in all video’s
frames.
F1
FN-1
F2
Primary Object ?
FN
Same Objects ?
Figure 1: The problem of discovering objects in videos is divided into two sub-problems. The first
sub-problem is localization of primary objects in each frame. The second sub-problem is finding the
correlation between the detected objects in all video’s frames.
2
Related Work
The problem of discovering objects in videos is closely related to video segmentation. The
following are some of the state-of-the-art related work in this area.
• Representing video by a multi-label Markov Random Field model, and
accomplishing segmentation by finding the minimum energy label [7].
• Fragments-based tracking of non-rigid objects using level sets [2].
• Using layered Directed Acyclic Graph (DAG) based framework for detection and
segmentation of the primary object in video [10].
3
Approach Overview
In order to identify and localize coherent objects in videos, our proposed approach is
divided into two main phases, where each phase solves one of the sub-problems illustrated
in Figure 1. The key contribution of this work is utilizing deep learning and transductive
learning to discover video objectness. The following subsections discusses an overview of
the proposed two-phases approach.
1
3.1
Phase I: Object localization per frame
The objective of this phase is to find the primary object in each frame in the video; this
primary object is represented in a bounding box. To accomplish this task, I used an
approach called STL which is proposed in [1] and it relies on deep learning. This approach
is a new self-taught object localization algorithm for still images that leverages on deep
convolutional neural networks trained for whole image recognition to localize objects
in images without additional human supervision, i.e., without using any ground-truth
bounding boxes for training [1]. This approach can generate many bounding boxes
containing object proposals. After extracting the object proposals in each frame, the next
step is to re-rank the detected top four object proposals using a linear SVM. According
to our experiments, we found that this step is essential and it enhances the quality of the
detected video’s objects as it will be shown later. Figure 2 gives an overview of the steps
included in phase I.
Top four object proposals
(using deep learning STL [1])
The best object Proposal
(using SVM)
Figure 2: Overview of phase I.
3.2
Phase II: Classification of the primary objects in video
After identifying and localizing the primary object in each frame (output of phase I ), the
next step is classification of the primary objects in all video’s frames using Transductive
Support Vector Machine (TSVM). This learning problem could be formulated as detection
of similarity between objects; discarding noise objects.
The are two main issues with learning the similarity between frames of the same video
which are: 1) Few labeled samples in training data especially for positive samples. 2) The
testing data are part of the training data.
By using Transductive Support Vector Machine (TSVM) [4], we can avoid these problems.
The advantage of using Transductive Support Vector Machines is that it trains labeled and
unlabeled data, and the unlabeled data is part of the test data. In other words, TSVM
extends SVM by treating partially labeled data in semi-supervised learning by following
the principles of transduction.
In addition to the training set D, the learner is also given a set D? = {x?i |x?i ∈ Rp }ki=1
of test examples to be classified. Transductive support vector machine is defined by the
following primal optimization problem [4] as follows:
Minimize (over w, b, y? ) : 21 kwk2 ,
subject to (for any i = 1, . . . , n and any j = 1, . . . , k):
yi (w · xi − b) ≥ 1,
yj? (w · x?j − b) ≥ 1,
and
yj? ∈ {−1, 1}.
2
4
Dataset and extraction of image features
For testing and evaluation, we used the SegTrack dataset [7, 8]. It consists six videos
with pixel-level segmentation ground-truth for each video. The standard method used
for evaluation in this dataset is the average per-frame pixel error. This dataset is widely
used by other approaches including some of the recent proposed methods [5, 10, 7, 8], so it
is feasible to compare our results with other approaches. For extracting image features, we
used the Histogram of Gradients (HOGs [3]).
Figure 3: Videos used by Segtrack [7] dataset.
5
Learning and Inference
In this section, we will discuss the various methods and techniques used in the learning
and inference process in the two-phases proposed approach.
5.1
Learning and Inference for Phase I
First, and before going into the details of the learning and inference process. We have to
show why we need extra learning step after executing STL. As mentioned earlier, in this
part I applied STL approach [1] on Segtrack dataset. The testing code for STL has been
made available by the authors of paper [1]. However, selecting the top-1 bounding box
by ranked STL, is not a good option, as it will be shown later in the experimental results
section. But for now, we illustrate Figure 4 which shows a visualization of some video
frames applied on Segtrack dataset, which will bounding box (object proposal) with the
highest confidence score as it is most likely will contain the primary object.
Figure 4: Results of applying STL on Segtrack video frames. Red bounding box in each frame is the
selected Top-1 score object proposal. Average error rate = 64%
It is clear from the results in Figure 4 that the top-1 bounding box miss-classifies the
primary object in the scene many times. My experiments that are conducted on Segtrack
dataset show that the average error rate for selecting STL top-1 bounding box = 64%. In
order to get good results, it is crucial for this phase to select the best possible bounding
3
boxes as the overall performance will depend heavily on these results. To tackle this
problem, we investigated different object proposals generated by STL. Figure 5 shows the
Top-4 bounding boxes generated by STL.
(a) Although red bounding box (Top-1) has the (b) Although red bounding box (Top-1) has the
highest STL confidence score, Top-4 and Top-2 are highest STL confidence score, Top-4 and Top-3 are
much better.
much better.
Figure 5: Top four bounding boxes from STL. Red box is the Top-1, other blue boxes are Top-2,
Top-3, and Top-4.
In Figure 5, STL showed overall good semantic results; however, it is obvious that STL
tends to give higher score for the bigger bounding boxes. This raised a serious problem
that will definitely affect the final project results. To overcome this problem, we have to
execute an extra learning process on the top ranked bounding boxes generated by STL as
it will be shown in the following subsections.
In this phase, we used a linear SVM classifier to re-rank the STL object proposals for each
frame. We propose two different methods for applying SVM. The first method is learning
SVM classifier on Segtrack dataset; and the second method is learning SVM classifier on
top-1 STL bounding boxes. In the following parts, we discuss these two methods in details.
5.1.1
Method(1): Learning with linear SVM on Segtrack ground-truth
To refine the ranking score of STL, we applied linear SVM to re-rank the top-four object
proposals. There are two issues to apply this solution. First, SVM is a two-classes
discriminative learning algorithm. Second, the availability of ground truth dataset. The
following is my proposed and implemented solution to these problems.
First, applying SVM to re-rank object proposals for every frame can be seen as detecting
the object proposal which has the highest abjectness value. In other words, it aims at
discriminating objects from non-objects in every frame. So, class (+1) is assigned to ground
truth object bounding box; that is all objects are alike and there is no difference between
different objects. On the other hand, class (-1) is assigned to every thing else (any non
object bounding box). The negative examples are extracted from the background of each
frame.
Second, as mentioned in the previous paragraph, we have to find a ground-truth dataset
to get the perfect tight bounding box that contains only the primary object in the frame.
Fortunately, the Segtrack dataset includes a segmented ground truth images for each frame
in videos. Figure 6 shows the process of extracting perfect objects (ground truth); and
applying SVM on one frame. At the training time, the perfect bounding box is considered
class (+1) and parts from frame background are considered negative examples class (-1).
5.1.2
Method(2): Learning with linear SVM on top-1 STL
This method is an alternative for Method (1). In this method, we also used a linear SVM
to re-rank the top four STL objects proposals; however in we trained SVM on the top-1
STL bounding box, extracted on all frames for all videos. Figure 7 shows the process of
selecting the top-a STL bounding box as ground truth; and then applying a linear SVM on
4
-1
+1
Segtrack ground-truth
Figure 6: Method(1) Training. Perfect objects are extracted from Segtrack dataset (ground truth),
then we train a linear SVM classifier on these objects in all frames through all the videos. At the
training time, the perfect bounding box is a positive example (class (+1)), and background parts are
a negative (class (-1)).
all boxes. At the training time, the top-1 STL bounding box is considered class (+1) and
parts from frame’s background are considered negative examples class (-1).
Top-1 STL
-1
+1
Figure 7: Method(2) Training. Ground-truth objects considered the top-1 STL bounding boxes,
extracted from Segtrack dataset. We train a linear SVM classifier on these top-1 objects in all frames
through all the videos. At the training time, the top-1 STL bounding box is a positive example (class
(+1)), and background parts are a negative (class (-1)).
5.1.3
Testing for Phase I
After learning a linear SVM classifier as illustrated whether in Method(1) or Method(2), the
next step is to test the trained model on the different STL bounding boxes in each frame,
and for each video. Figure 9 shows testing the SVM model (Method(1) or Method(2)) on
the top four STL bounding boxes. The highest ranked bounding box is selected to be the
primary object in the frame.
Test SVM on Top four STL
Select highest score
Figure 8: Testing for Phase I
5
5.2
Learning and Inference for Phase II
After identifying and localizing the primary object in each frame (output of phase I ), the
next step is classification of the primary objects in all video’s frames using Transductive
Support Vector Machine (TSVM). This learning problem is considered as detection of
similarity between objects. The are two main issues with learning the similarity between
frames of the same video which are: 1) Few labeled samples in training data especially
for positive samples. 2) The testing data are part of the training data. The advantage of
using TSVM is that it trains labeled and unlabeled data, and the unlabeled data is part
of the test data. In other words, TSVM extends SVM by treating partially labeled data
in semi-supervised learning. Figure 9 shows the learning and inference process using
TSVM. The first primary object in the first frame in the video is considered the only positive
example (shown in red rectangle, class(+1)); the negative examples are extracted from the
background parts of the video’s frames; and the other primary objects (shown in yellow
rectangles) in the other frames are considered the unlabeled data ans assigned value of
zeros (unlabeled).
-1
+1
0: (unlabeled)
Figure 9: Learning and inference using Transductive Support Vector Machine (TSVM) for phase II.
The first primary object in the first frame in the video is the only positive example (shown in red
rectangle, class(+1)); the negative examples are extracted from the background parts; and the other
primary objects (shown in yellow rectangles) in the other frames are the unlabeled data.
6
Experimental Results
In this section, we discuss the overall performance of our proposed two-phases approach,
including the two proposed methods “Method (1)” and “Method (2)” (with and without
TSVM). In addition to illustrating experimental results for the other possible methods.
Before discussing the details in the next subsections, we first provide a legend for the
names of the different proposed and implemented methods as follows:
• “Method(1) with TSVM“: This method uses a linear SVM classifier to re-rank
the top four STL bounding boxes in phase I. The ground-truth training dataset
is the segmented object included in Segtrack dataset. Then, in phase II, a TSVM
classifier is applied to find similar objects in each video.
• “Method(2) with TSVM“: This method uses a linear SVM classifier to re-rank the
top four STL bounding boxes in phase I. The ground-truth training dataset is the
top-1 STL bounding according to the ranking of STL . Then, in phase II, a TSVM
classifier is applied to find similar objects in each video.
• “Method(1) without TSVM“: This method uses only a linear SVM classifier to
re-rank the top four STL bounding boxes. The ground-truth training dataset is the
segmented object included in Segtrack dataset. There is no phase II.
• “Method(2) without TSVM“: This method uses only a linear SVM classifier to
re-rank the top four STL bounding boxes. The ground-truth training dataset is the
6
top-1 STL bounding according to the ranking of STL . There is no phase II.
• “Method(3)“: This is a method in which we used the top-1 ranked bounding box
generated by STL directly without any extra learning process. We will use this to
compare the effect of using linear SVM classifier if trained on top-1 STL bounding
boxes as proposed in ”Method(2)”.
6.1
Performance of SVM classifier
In this part, we show some results for evaluating SVM (Method(1) in Phase I) to re-rank the
STL bounding boxes. Figure 11 shows applying SVM to re-rank top four STL bounding
boxes and then selecting the bounding box with the highest score gives much better results
on most of videos than using Top-1 STL bounding box only. The average error rate for
TOP-1 STL = 64%; while using TOP-SVM = 42%.
It should be noted there are no common ground-truth used by researchers for this case, and
the true labels are generated by manually annotating the bounding boxes. Although this
may be inaccurate and many researchers may be skeptical to this approach, we just report
these results to give an approximate overview of the effect of using SVM to re-rank the STL
bounding boxes. For accurate results, please see subsection (6.3 ’Overall performance and
quantitative comparison’).
Figure 10: Testing error rates using STL Top-1 bounding box vs using proposed SVM re-ranking
method applied on top four STL bounding box (Method(1)). Error rates are reported for each video.
6.2
Performance of TSVM classifier
In this part, we show evaluation results for TSVM (used in Phase II). Here we show
the Precision-Recall (PR) curves for each video, for the results of both TSVM applied
on Method(1) and TSVM applied on Method(2) in Figures [11-16]. Precision-Recall (PR)
curve reflects the relative proportions of positive and negative samples directly. The PR
curves are shown side-by-side for both methods in order to easily compare results of both
methods. Also, on the top of each figure, we show the Area Under the Curve (AUC) value,
which can be used to summarize the overall quality of ranking in terms of precision and
recall. The AUC is obtained by trapezoidal interpolation of the precision. An alternative
and usually almost equivalent metric is the Average Precision (AP), also shown on the
top of the figures. This is the average of the precision obtained every time a new positive
sample is recalled. It is the same as the AUC if the precision is interpolated by constant
7
segments. Additionally, we show the 11 points interpolated average precision, on the top
of each figure. This is obtained by taking the average of eleven precision values [9].
It should be noted comparing both Method(1) and Method(2) using the evaluation of
TSVM provided useless. For example, if we got low quality bounding boxes generated
in phase I, and then applied TSVM on them; regardless of how good TSVM classifier is, it
does not matter and will not improve the final results. However, if phase I generated good
bounding boxes (high quality), we then should have a good TSVM classifier to generate
final good results. In summary, having good TSVM classifier is essential but that doesn’t
guarantee good final results; as it all depends on the overall results of phase I and phase II.
In this part, we only report these TSVM performance results for showing the quality of the
classifier. Instead the overall results shown later in the ’overall results’ section, should be
used for building a reasonable comparison between both methods. Additionally, it should
be noted that the testing labels used in this evaluation are automatically generated. In this
case, the label is considered true if there is at least 40% of the ground-truth object overlaps
with the detected (test) object. Although this may not be accurate, but it can provide a
approximate measure for the performance of TSVM classifier.
In case of video ’Girl’: Figure 11 shows that TSVM applied on Method(2) performs better
than TSVM applied on Method(1) in general. In this case, Method(2) shows slightly more
successful results than Method(1). The reason behind this, is that the selected bounding
boxes (primary objects) for Method(2) are very like to each other, in both appearance
and size. Although this is good point for Method(2), but it doesn’t guarantee a better
overall performance in terms of average error pixels metric used in Segtrack dataset; that
is because this is just evaluation of performance TSVM, regardless of the quality of the
primary objects selected in Phase I.
PR (AUC: 53.95%, AP: 54.11%, AP11: 58.07%)
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
precision
precision
PR (AUC: 40.48%, AP: 41.47%, AP11: 43.75%)
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
PR
PR rand.
0
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
PR
PR rand.
0
1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
Figure 11: Precision-Recall (PR) curves of video ’Girl’. On the left, PR curve of TSVM testing
results applied on Method(1). On the right, PR curve of TSVM testing results applied on
Method(2).
In case of video ’Birdfall2’:
Figure 12 shows that TSVM applied on Method(2) also performs better than TSVM applied
on Method(1) in general. In this case, Method(2) shows significant successful results than
Method(1). If we take a deep look at the final output, we will find that most of the detected
bound boxes generated by Method(2) are very large boxes, and almost fills the whole
image. That makes typically the classification task of TSVM to be classification between
whole images, which are all similar and from the same video input; resulting in good
performance as shown in the PR curve.
8
1
PR (AUC: 86.93%, AP: 87.21%, AP11: 86.78%)
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
precision
precision
PR (AUC: 52.07%, AP: 53.80%, AP11: 54.54%)
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
PR
PR rand.
0
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
PR
PR rand.
0
1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
1
Figure 12: Precision-Recall (PR) curves of video ’Birdfall2’. On the left, PR curve of TSVM
testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on
Method(2).
In case of video ’cheetah’: Figure 13 shows that TSVM applied on Method(1) performs
better than TSVM applied on Method(2) in general. Unlike other videos, the primary
objects in this video discovered in Phase I, are small, contained in a tight bounding boxes
and moving through the scene.
PR (AUC: 53.89%, AP: 54.87%, AP11: 62.81%)
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
precision
precision
PR (AUC: 60.70%, AP: 61.45%, AP11: 64.55%)
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
PR
PR rand.
PR
PR rand.
0
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
Figure 13: Precision-Recall (PR) curves of video ’Cheetah’. On the left, PR curve of TSVM
testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on
Method(2).
In case of video ’monkeydog’: Figure 14 shows that TSVM applied on Method(2)
performs better than TSVM applied on Method(1) in general.
9
1
PR (AUC: 59.28%, AP: 59.76%, AP11: 62.14%)
PR (AUC: 50.13%, AP: 50.87%, AP11: 53.32%)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
precision
precision
0.6
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
PR
PR rand.
PR
PR rand.
0
0.5
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
1
Figure 14: Precision-Recall (PR) curves of video ’Monkeydog’. On the left, PR curve of TSVM
testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on
Method(2).
In case of video ’parachute’: Figure 15 shows that TSVM applied on Method(1) performs
slightly better than TSVM applied on Method(2) in general.
PR (AUC: 59.36%, AP: 59.57%, AP11: 61.80%)
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
precision
precision
PR (AUC: 59.99%, AP: 60.45%, AP11: 60.34%)
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
PR
PR rand.
0
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
PR
PR rand.
0
1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
Figure 15: Precision-Recall (PR) curves of video ’Parachute’. On the left, PR curve of TSVM
testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on
Method(2).
In case of video ’penguin’: Figure 16 shows that TSVM applied on Method(2) performs
better than TSVM applied on Method(1) in general.
10
1
PR (AUC: 59.36%, AP: 59.57%, AP11: 61.80%)
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
precision
precision
PR (AUC: 42.60%, AP: 43.39%, AP11: 48.62%)
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
PR
PR rand.
0
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
PR
PR rand.
0
1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
Figure 16: Precision-Recall (PR) curves of video ’Penguin’. On the left, PR curve of TSVM
testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on
Method(2).
6.3
Overall performance and quantitative comparison
In Segtrack dataset [7] used in this work, the overall performance is measured using
average per-frame pixel error rate compared to the ground-truth. Apparently, lower
values of error are better. To calculate the average per-frame pixel error, we use the
following equation:
Error =
XOR(GT, f )
.
F
where f is the segmentation labeling results of the method, GT is the ground-truth labeling
of the video, and F is the number of frames in the video.
Table 1 shows the average number of error pixels per frame compared between (Method(1)
with TSVM), (Method(2) with TSVM), and other different approaches found in the
literature. In this comparison, we used the full implementation of our proposed methods
thats including details illustrated earlier in both phase I and phase II. Best values are shown
in green color. Blue values show the results of our approaches that are near from the best
values, also some of these values are much better than other approaches (but not the best).
Comparison between “Method(1) with TSVM” and “Method(1) without TSVM”
In this part, we compare the performance of Method(1) in case of using TSVM (as proposed
the earlier sections phase I + phase II), against Method(1) without using TSVM (using SVM
only in phase I). In other words, we discover the benefits of using Transductive learning
as proposed in phase II. In order to accomplish this, we used the same average number
of error pixels per frame measure, but in this case it is calculated only on the results
retrieved by SVM classifier applied on STL Top four object proposals as illustrated earlier in
a previous section. Table 2 shows the average number of error pixels per frame compared
between (Method(1) with TSVM), (Method(2) without TSVM). From the shown results, it
11
1
Video
Girl
Birdfall2
Cheetah
Monkeydog
Parachute
Penguin
Method(1)
with TSVM
2113
1259
1013
1094
1457
1890
Method(2)
with TSVM
3452
4204
2442
2254
7016
3095
Tsai et al. [7]
1304
252
1142
563
235
1705
Chockal
al. [2]
1755
454
1217
683
502
6627
et
Zhang et
al. [10]
1488
155
633
365
220
1895
Table 1: Average number of error pixels per frame between different approaches. Lower values are
better. Best values are shown in green color. Blue values show the results of our approaches that are
near from the best values, also some of these values are much better than other approaches (but not
the best).
is clear that “Method(1) with TSVM” outperforms “Method(1) with TSVM” in five out of
six cases (videos: Birdfall2, Cheetah, Cheetah, Monkeydog, Parachute, and Penguin).
Video
Girl
Birdfall2
Cheetah
Monkeydog
Parachute
Penguin
Method(1) with TSVM
2113
1259
1013
1094
1457
1890
Method(1) without TSVM
1600
2504
1136
1427
1792
2495
Table 2: Average number of error pixels per frame of “Method(1) with TSVM” against “Method(1)
without TSVM” approaches. Lower values shown in blue bold are better
Comparison between “Method(2) with TSVM” and “Method(2) without TSVM”
In this part, we compare the performance of Method(2) in case of using TSVM (as proposed
the earlier sections phase I + phase II), against Method(2) without using TSVM (using SVM
only in phase I). Also, in this case we used the average number of error pixels per frame
measure, and it is calculated only on the results retrieved by SVM classifier applied on
STL Top four object proposals taking the top-1 bounding box as ground-truth as illustrated
earlier in a previous section. Table 3 shows the average number of error pixels per frame
compared between (Method(2) with TSVM), (Method(2) without TSVM). From the shown
results, it is clear that “Method(2) with TSVM” outperforms “Method(2) with TSVM” in all
six cases (videos: Girl, Birdfall2, Cheetah, Cheetah, Monkeydog, Parachute, and Penguin).
Video
Girl
Birdfall2
Cheetah
Monkeydog
Parachute
Penguin
Method(2) with TSVM
3452
4204
2442
2254
7016
3095
Method(2) without TSVM
4849
7084
5328
2937
8532
4017
Table 3: Average number of error pixels per frame of “Method(2) with TSVM” against “Method(2)
without TSVM” approaches. Lower values shown in blue bold are better.
12
Comparison between “Method(2) without TSVM” and “Method(3) without TSVM”
In this part, we explore the efficiency of using linear SVM learned on Top-1 STL bounding
box versus Method(3) which assumes no learning and it just takes directly the top-1 STL
bounding box as the best primary object in the current frame.
We compare the performance of Method(2) without TSVM (as proposed the earlier sections
phase I), against Method(3). In other words, we discover the benefits of using linear SVM
learning on top-1 STL bounding box as proposed in phase I . We also used the same
average number of error pixels per frame measure, and it is calculated only on the results
retrieved by SVM classifier applied on STL Top four object proposals as illustrated earlier in
a previous section. Table 4 shows the average number of error pixels per frame compared
between (Method(2) without TSVM), (Method(3) without TSVM). From the shown results,
it is clear that “Method(2) without TSVM” outperforms “Method(3) without TSVM” in four
out of six cases (videos: Birdfall2, Cheetah, Cheetah, Monkeydog, and Penguin).
Video
Girl
Birdfall2
Cheetah
Monkeydog
Parachute
Penguin
Method(2) without TSVM
4849
7084
5328
2937
8532
4017
Method(3) without TSVM
4706
7982
6809
4040
6919
6311
Table 4: Average number of error pixels per frame of “Method(2) without TSVM” against
“Method(3) without TSVM”. Lower values shown in blue bold are better
6.4
Qualitative Comparison
This part provides some visualizations and analysis for the final output of both “Method(1)
with TSVM” and “Method(2) with TSVM”. Figure 17 shows the selected frames for each
method after applying SVM in phase I and TSVM in phase II. Although the improvement
that is achieved by “Method(2)” compared to applying “Method(3)” (using the top-1 STL
bounding box without learning SVM), the generated output still shows that ”Method(2)”
tends to generate bigger bounding boxes if compared to “Method(1)”. This is reasonable
because ”Method(1)” was trained on a much better ground-truth than “Method(1)” which
uses the top-a STL bounding box as a reference; which may be inaccurate bounding box in
many cases. As a result of selecting bigger bounding boxes in “Method(2)” , the average
error pixel per frame for videos dataset are greater than the error of ”Method(1)“, which
was shown earlier in Table 1.
13
Method
Method(1)
(1)
Method (2)
Method (1) with TSVM
frame t
Method (2) with TSVM
frame t'
frame t
frame t'
Figure 17: Selected output frames with the final discovered objects for “Method(1) with TSVM“
(on the left) and “Method(2) with TSVM“ (on the right).
7
Conclusions
From the experimental results shown in the previous section, it is clear that “Method (1)
with TSVM“ is the best approach compared to the others proposed in this work. It also
gives near performance to some of the state-of-the-art methods. Additionally, we showed
that using TSVM as proposed in Phase II, enhances the performance and gives better
results than using SVM only as proposed in phase I. We also showed that the learning
process conducted using SVM in phase I (whether in case of Method(1) or Method(2)),
was essential; and that enhanced the performance greatly than using only the top-1 STL
bounding box as the primary object.
References
[1] Alessandro Bergamo, Loris Bazzani, Dragomir Anguelov, and Lorenzo Torresani.
Self-taught object localization with deep networks. arXiv preprint arXiv:1409.3964, 2014.
[2] Prakash Chockalingam, Nalin Pradeep, and Stan Birchfield. Adaptive fragments-based
tracking of non-rigid objects using level sets. In IEEE International Conference on
Computer Vision (ICCV), pages 1530–1537. IEEE, 2009.
[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
volume 1, pages 886–893. IEEE, 2005.
14
[4] Thorsten Joachims. Transductive inference for text classification using support vector
machines. In ICML, volume 99, pages 200–209, 1999.
[5] Yong Jae Lee, Jaechul Kim, and Kristen Grauman. Key-segments for video object
segmentation. In IEEE International Conference on Computer Vision (ICCV), pages
1995–2002. IEEE, 2011.
[6] David
Tsai,
Matthew
Flagg,
and
coherent tracking with multi-label mrf
http://cpl.cc.gatech.edu/projects/SegTrack/.
James
M.Rehg.
optimization.
BMVC,
Motion
2010.
[7] David Tsai, Matthew Flagg, Atsushi Nakazawa, and James M Rehg. Motion coherent
tracking using multi-label mrf optimization. International journal of computer vision,
100(2):190–202, 2012.
[8] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision
algorithm. In http://www.vlfeat.org/.
[9] Dong Zhang, Omar Javed, and Mubarak Shah. Video object segmentation through
spatially accurate and temporally dense extraction of primary object regions. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 628–635. IEEE,
2013.
15