Bag-of-Audio-Words Feature Representation Using GMM Clustering

FO-1-2-5
Bag-of-Audio-Words Feature Representation Using GMM Clustering
for Sound Event Classification
Hyungjun Lim, Myung Jong Kim, and Hoirin Kim
Department of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST)
{hyungjun.lim, myungjong, hoirkim}@kaist.ac.kr
Abstract
This paper addresses the problem of sound event
classification, focusing on feature representation methods.
Sound events such as screaming and glass breaking show
distinctive temporal and spectral characteristics. Therefore,
extracting appropriate features to properly represent these
characteristics is important in achieving a good
performance. In this paper, we employ bag-of-audio-words
feature representation, which is a histogram representation
of frame-based features, to characterize the time-frequency
patterns in the long-range segment of a sound event. In the
method, Gaussian mixture model-based clustering is
adopted to deal with the inconsistent dynamic range among
frame-based features. Test sounds are classified by using a
support vector machine. The proposed method is evaluated
on a database of several hundred audio clips for fifteen
sound events and the classification results show over 41%
relative improvements compared to conventional bag-ofaudio-words representation methods.
Keywords: Bag-of-audio-words, Gaussian mixture model
(GMM) clustering, sound event classification.
1. Introduction
Sound events are good descriptors in recognizing and
understanding circumstances. In an audio surveillance
application, for example, sound events such as screaming or
explosion may indicate a dangerous situation whereas sound
events such as conversation or music may imply a normal
condition. Hence, a sound event classification method that
produces highly accurate classification results will be very
useful in understanding various situations such as audio
surveillance [1, 2, 3], monitoring in health care [4], and
military [5].
In general, such sound events show distinctive temporal
and spectral characteristics [5]. Therefore, developing a
feature representation method, which is proper to describe
the characteristics of each sound event, is very important in
improving the classification accuracy of the sound events.
Sound event classification was conventionally performed
by using general audio features that include MPEG-7 lowlevel features (LLFs) [1], linear-frequency cepstral
coefficients (LFCCs) [2], Mel-frequency cepstral
coefficients (MFCCs) [4], and their combinations [3, 5].
Kim and Kim [7] proposed segmental two-dimensional
MFCCs which are based on two-dimensional discrete
cosine transform to capture temporal and spectral
characteristics of a sound event. Jonathan et al. [6] utilized
image processing based techniques such as pseudo-coloring
and partitioning in a spectrogram to overcome the noise
sensitivity of MFCC. Lee et al. [8] employed angular radial
transform to extract spectrogram shape features within a
birdsong segment.
In recent years, a bag-of-audio-words (BoAW) feature
representation which is a histogram representation of framebased audio features, such as LLF, in a long-term segment
instead of the frame-based audio features itself is
successfully applied to sound event classification [9, 10]
since the histogram may be suitable for describing the
global characteristics of a sound event. In the method, the kmeans clustering based on the Euclidean distance measure
is generally used to construct the histogram. However, since
the dynamic range of each frame-based feature is diverse
and inconsistent, the clustering result is subject to bias. To
overcome the drawback, this paper presents the sound event
classification method, focusing particularly on BoAW
feature representation using Gaussian mixture model
(GMM)-based clustering, which considers the dynamic
range of each feature. A support vector machine (SVM)
classifier is used to identify the class of a test sound among
fifteen sound event classes.
The remainder of the paper is organized as follows: The
conventional BoAW feature representation is described in
Section 2. In Section 3, we present the proposed
distribution-based clustering method. Section 4 shows the
experiments and finally, our conclusions are summarized in
Section 5.
2. BoAW feature representation
The block diagram of BoAW feature representation is
shown in Figure 1. First, the frame-based features are
- 170 -
ICEIC 2015
Figure 1: Block diagram of BoAW feature representation (Dotted box is used only for the training phase.)
extracted in each sound clip. Using these frame-based
features, we choose a cluster that has a minimum distance
between the frame-based features and the centroids of
clusters. Note that the clusters are obtained using the kmeans clustering based on the Euclidean distance measure
only in the training phase. Given a set of d -dimensional
frame-based feature vectors ( x1 , x 2 , , x N ) , k-means
clustering aims to partition n feature vectors into k sets
S {S1 , S 2 , S k } so as to minimize the within-cluster sum
of squares, i.e.,
k
arg min ¦ ¦ || x ȝ i
S
kmenas
||2
(1)
i 1 xSi
where ȝ i kmenas is the centroid of the i -th cluster obtained
by k-means clustering. Finally, the BoAW feature vector
can be obtained by constructing histograms for selected
clusters in a sound clip,
L
FBoAW_kmeans
¦ [G ( P c, 1), G ( P c, 2), , G ( P c, k )]
T
l
l
l
(2)
l 1
where Plc is the selected cluster for l -th frame-based feature,
L is total number of frames in the sound clip, and G (˜) is the
Kronecker delta function. As a result, the BoAW feature
representation contains all the frame-based features in a
sound clip, so it can be useful to capture the global timefrequency characteristics of a sound event.
3. GMM clustering-based BoAW representation
In the conventional BoAW feature representation
described in Section 2, the Euclidean distance-based k-
means clustering method is generally used. However, since
the frame-based features have diverse dynamic ranges, the
features that have a wide dynamic range are critical on
clustering results. For example, the dynamic range of a
short-time energy is broader than with zero-crossing rates.
Therefore, we propose the BoAW feature representation
based on the GMM clustering which is the one of the widely
used distribution-based clustering methods to tackle this
disadvantage of the k-means clustering [11].
The GMM clustering is a kind of soft clustering that uses
probabilities instead of occurrence counts used in k-means
clustering. It can effectively compensate the various
dynamic ranges of frame-based features by using the
posterior probabilities of each Gaussian component. More
specifically, each Gaussian component in the GMM takes
the role of a cluster in k-means clustering, so distances
between frame-based features and each centroid is replaced
by posterior probabilities of each Gaussian.
Let the GMM has M number of Gaussian components,
then the posterior probability of m-th Gaussian component
is obtained as
GMM
wm ˜ 1 ( x | ȝ m , Ȉ m )
p ( m | x)
(3)
M
GMM
wi ˜ 1 ( x | ȝ i , Ȉ i )
i 1
¦
GMM
where x is the frame-based feature vector, ȝ m
, Ȉm ,
and wm are the mean vector, covariance matrix, and
mixture weight of m-th Gaussian component, respectively.
The GMM is trained using an expectation-maximization
(EM) algorithm [12]. Then the BoAW feature vector, which
is the histogram of each frame-based features in the sound
clip, FBoAW_GMM can be obtained by the summation of
posterior probabilities for all frames in the sound clip as
- 171 -
ICEIC 2015
Table 1: Configurations of the database
#
Clips
Total
duration
(sec)
Avg. clip
duration(Std.)
(sec)
Car crashing
36
154.9
4.3(·2.0)
Crying
66
311.4
4.7(·1.0)
Dog barking
81
372.6
4.6(·1.6)
LLF
67.6
67.6
LFCC
78.7
81.7
MFCC
90.2
91.5
92.5
Classes
Abnormal
Table 2: Average classification accuracies (%) of the
various features according to the number of clusters of the
k-means and GMM clustering methods (Bold face
represents the best result along the row axis.)
Explosion
64
280.7
4.4(·1.7)
Glass breaking
103
233.3
2.3(·1.3)
Screaming
115
228.7
2.0(·0.9)
Air conditioner
68
333.6
4.9(·0.3)
Bird song
92
355.4
3.9(·1.4)
Conversation
48
240.0
5.0(·0.0)
Car horn
96
199.5
2.1(·1.2)
Motorcycle
58
292.0
5.0(·0.5)
Music
72
360.0
5.0(·0.0)
Raining
65
324.2
5.0(·0.1)
Ambulance siren
68
322.1
4.7(·0.5)
Wind
56
350.1
4.9(·0.4)
Framebased
features
k-means clustering
128
256
512
GMM clustering
128
256
512
65.6
85.4
85.7
87.1
76.2
81.5
83.3
85.9
93.3
95.0
95.6
4.1 Experimental setup
4.2 Experimental results
In order to evaluate the proposed methods, we used
fifteen classes of sound events consisting of car crashing,
crying, dog barking, explosion, glass breaking, screaming,
air conditioner, bird song, conversation, car horn,
motorcycle, music, raining, ambulance siren, and wind
which were collected from various sound effect libraries and
the Web. Since the duration of a target sound is different,
sound clips were made with a variable length which is about
1-8 sec long. Table 1 indicates data description in terms of
the number of clips per each sound event class, total
duration, and the average duration of clips. All sound clips
were digitized in 16-bit per sample with 48 kHz sampling
rate in mono-channel.
4.2.1 Effectiveness of GMM clustering
Normal
4. Experiments
To show the effectiveness of the proposed method, we
evaluated the performances of the LLF, LFCC, and MFCC
with the k-means and GMM clustering methods. The LLF
consisted of a short-time energy, zero-crossing rate, spectral
centroid, spectral bandwidth, sub-band energy, sub-band
energy ratio, spectral flux, spectral flatness, and spectral
roll-off. All the features were extracted from a short frame
of 25 msec with 50% overlap. For clustering, we used 128,
256, and 512 clusters and Gaussians for the k-means and
GMM clustering, respectively. A 5-fold cross validation
was performed with the database that was split randomly
into five equal-sized for reliable results. The classifier we
used was a support vector machine (SVM) with a linear
kernel [13].
For the application point of view, we tried to additional
two experiments: distant environments and surveillance
scenario. First, to generate additional distant sound database,
each sound data were re-recorded by playing the original
recording back on a loudspeaker with distances of 1m or
10m in a quiet outdoor environment. Second, to perform the
experiments under the surveillance scenario, the fifteen
classes of sound events were categorized into two classes:
abnormal and normal. The abnormal class consists of car
crashing, crying, dog barking, explosion, glass breaking,
and screaming, and others were mapped into the normal
class as shown in Table 1.
L
FBoAW_GMM
¦ [ p(m
1 | x l ), p ( m
2 | x l ),
, p(m
M | x l )] .
(4)
l 1
T
Consequently, the BoAW feature representation based
on distribution clustering may be more appropriate than the
conventional method by compensating the inconsistent
dynamic range of each feature to capture the distinct
characteristics of sound events.
Table 2 shows the performance comparison between the
proposed GMM clustering and k-means clustering-based
BoAW feature representation with the various frame-based
features in terms of the average classification accuracy (CA)
using original database. Here, the CA was averaged across
5-fold experiments. These results show that the GMM
clustering outperformed the conventional k-means
clustering in most cases, especially obtaining a 55.9%
relative improvement when using the LLF as frame features
and 256 clusters. Note that the relative improvement is
computed by
- 172 -
ICEIC 2015
ERR %
Conversation
Crying
Dog barking
Explosion
Glass
breaking
Car horn
Motorcycle
Music
Raining
Screaming
Ambulance
siren
Wind
Air conditioner
Bird song
Car crashing
Conversation
Crying
Dog barking
Explosion
Glass breaking
Car horn
Motorcycle
Music
Raining
Screaming
Ambulance siren
Wind
Car crashing
Actual
Bird song
Prediction
Air
conditioner
Table 3: Confusion matrix for fifteen classes of sound event classification (The entry represents the percentage of clips belonging
to the actual class and predicted by the system.)
98.5
0.0
0.0
0.0
0.0
0.0
3.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2.6
0.0
0.0
0.0
0.0
69.4
0.0
0.0
0.0
1.6
1.9
1.0
3.4
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100.0
0.0
0.0
1.6
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
97.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.5
0.0
5.6
0.0
0.0
0.0
92.2
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
8.3
0.0
0.0
0.0
1.6
95.1
0.0
3.4
0.0
0.0
0.9
0.0
0.0
0.0
0.0
0.0
0.0
3.0
0.0
0.0
0.0
91.7
0.0
1.4
0.0
5.2
0.0
0.0
0.0
0.0
11.1
0.0
0.0
0.0
0.0
0.0
1.0
93.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
98.6
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100.0
0.0
0.0
0.0
0.0
0.0
5.6
0.0
0.0
0.0
0.0
0.0
2.1
0.0
0.0
0.0
91.3
1.5
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
98.5
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
100.0
CER baseline CER proposed
u 100%
Table 4: Average classification accuracies (%) of the k-means
and GMM clustering methods on various distant
environments (original, 1m distance, and 10m distance) for
distance matched condition and multi-condition
(5)
CER baseline
where ERR % and CER mean an error reduction rate and a
classification error rate, respectively. This implies that the
GMM clustering is more suitable for the BoAW framework
by effectively dealing with various dynamic ranges of the
frame features. We can also observe that the MFCC is more
superior to the LLF and LFCC as frame-based features,
showing 65.0% and 70.1% relative improvements when
using 512 clusters of the GMM clustering method,
respectively. This indicates that the MFCC is more effective
in expressing the characteristics of sound events in the
BoAW method. Therefore, the frame-based MFCC features
and 512 clusters were used as the default setting in the
following experiments.
We analyze the best classification results (the MFCC
and the GMM with 512 clusters) using the confusion matrix
as shown in Table 3. As can be seen, most classes have very
small amount of confusion except the car crashing class. We
can interpret this result as two points of view: insufficient
data and/or complex characteristics of the car crashing
sound class. It can be seen that the car crashing class has the
smallest amount of data in Table 1 (about 150 sec of total
duration) which can cause poor modeling in training phase.
Furthermore, we can simply imagine that the car crashing
event composed of the ‘tire skid’ and ‘crash’ sounds which
are similar to motorcycle, glass breaking, and explosion.
Therefore, the higher misclassification rate is observed in
the car crashing class compared to other sound classes.
k-means
clustering
Conditions
Distance matched
condition
Multi-condition
GMM
clustering
Original
92.5
95.6
1m
90.5
94.3
10m
90.6
94.5
Original
92.3
92.9
1m
90.4
91.5
10m
88.7
91.2
4.2.2 Evaluation of the proposed method on various distant
environments
It is important to measure the performance of the distant
environment in the audio surveillance because the sound
related to the dangerous situation is likely to enter the
system distantly. Table 4 shows the CA performances of the
k-means and GMM clustering-based BoAW feature
representation on the various distant environments: distance
matched condition means that the acoustic model is trained
using only distance matched training data with test data
whereas multi-condition means the acoustic model is
trained using all training data regardless of distance. As can
be seen, the proposed method consistently shows better CA
performances than with the conventional BoAW method for
all distant environments and training conditions. Although
the time-frequency characteristics of a sound event are
- 173 -
ICEIC 2015
event classifier. In order to evaluate the proposed features,
experiments were performed in the aspect of the CA across
fifteen sound classes. The experimental results show that the
proposed feature representation method outperformed
conventional BoAW representation based on k-means
clustering, achieving a CA of 95.6% when using MFCC
frame features and 512 clusters of the GMM clustering.
Furthermore, additional experiments were performed
related to the areas of audio surveillance. Our work verifies
a possibility that the proposed method can be successfully
applied to audio surveillance systems.
Table 5: Average classification accuracies (%) for seven
classes: car crashing, crying, dog barking, explosion, glass
breaking, screaming, and normal classes
k-means
clustering
GMM
clustering
Original
92.9
96.0
Conditions
Distance matched
condition
Multi-condition
1m
91.5
94.8
10m
91.4
95.3
Original
93.3
93.6
1m
91.4
92.3
10m
90.8
93.0
6. Acknowledgements
Table 6: Average classification accuracies (%) for two classes:
abnormal and normal classes
k-means
clustering
Conditions
Distance matched
condition
Multi-condition
GMM
clustering
Original
94.8
97.3
1m
94.3
96.4
10m
94.7
97.3
Original
95.5
95.6
1m
94.1
94.0
10m
93.7
95.0
This work was supported by the Technology Innovation
Program of the Ministry of Trade, Industry & Energy.
[10047788, Development of Smart Video/Audio
Surveillance SoC & Core Component for Onsite Decision
Security System]
References
[1] A. Harma, M. F. McKinney, and J. Skowronek,
"Automatic surveillance of the acoustic activity in our living
environment," in Proc. IEEE Int. Conf. Mult. Expo, Jul.
2005.
distorted in distant environments because of significantly
reduced power, the proposed method gives fairly good
performances. This result obviously proves that the
proposed method is more robust to distant environments.
[2] P. K. Atrey, N. C. Maddage, and M. S. Kankanhalli,
"Audio based event detection for multimedia surveillance,"
in Proc. IEEE Int. Conf. Acoust. Speech, and Signal
Process., May 2006, pp. 813-816.
4.2.3 Evaluation of the proposed method under surveillance
scenario
[3] C. Clavel, T. Ehrette, and G. Richard, "Events detection
for an audio-based surveillance system," in Proc. IEEE Int.
Conf. Mult. Expo, Jul. 2005, pp. 1306-1309.
Under the surveillance scenario, confusions between the
mundane sounds are not considered because the only
interest is to capture the dangerous situations. In this point
of view, we perform additional experiments by mapping the
classification results into normal or abnormal class. Table 5
presents the classification accuracy of normal and other
abnormal sound events, i.e., 7-way classification: car
crashing, crying, dog barking, explosion, glass breaking,
screaming, and normal classes. Table 6 also presents the
classification accuracy of normal and abnormal classes, i.e.,
2-way classification. In the same context of previous
experiments, the proposed method is more accurate than the
conventional BoAW method which can be successfully
applied to the surveillance applications.
5. Conclusion
We proposed a feature representation method that
employs BoAW based on the GMM clustering to effectively
represent the distinct time-frequency patterns of sound
events. An SVM with a linear kernel was adopted as a sound
[4] Y. T. Peng, C. Y. Lin, M. T. Sun, and K. C. Tsai,
"Healthcare audio event classification using hidden Markov
models and hierarchical hidden Markov models," in Proc.
IEEE Int. Conf. Mult. Expo, Jun. 2009, pp. 1218-1221.
[5] S. Ntalampiras, I. Potamitis, and N. Fakotakis, "On
acoustic surveillance of hazardous situations," in Proc.
IEEE Int. Conf. Acoust. Speech, and Signal Process., Apr.
2009, pp. 165-168
[6] J. Dennis, H. D. Tran, and H. Li, "Image representation
of the subband power distribution for robust sound
classification," in Proc. Interspeech 2011, Aug. 2011, pp.
2437-2440.
[7] M. J. Kim and H. Kim, "Audio-based objectionable
content detection using discriminative transforms of timefrequency dynamics," IEEE Trans. Multimedia, vol. 14, no.
5, pp. 1390-1400, Oct. 2012.
- 174 -
ICEIC 2015
[8] C. H. Lee, S. B. Hsu, J. L. Shih, and C. H. Chou,
"Continuous birdsong recognition using Gaussian mixture
modeling of image shape features," IEEE Trans.
Multimedia, vol. 15, no. 2, pp. 454-464, Feb. 2013.
[9] S. Pancoast and M. Akbacak, "Bag-of-audio-words
approach for multimedia event classification", in Proc.
Interspeech 2012, Sep. 2012, pp. 2105-2108.
[10] V. Carletti, P. Forggia, G. Percannella, A. Saggese, N.
Strisciuglio, and M. Vento, "Audio surveillance using a bag
of aural words classifier," in Proc. IEEE Int. conf. Adv.
Video and Signal Based Surveillance, Aug. 2013, pp. 81-86.
[11] C. M. Bishop, Pattern recognition and machine
learning, Springer, 2006.
[12] T. K. Moon, "The expectation-maximization
algorithm," IEEE Signal Process. Magazine, vol. 13, no. 6,
pp. 47-60, Nov. 1996.
[13] C. C. Chang and C. Lin, "LIBSVM: a library for
support vector machines", ACM Trans. on Intelligent
Systems and Technology, 2011. Software available at
http://www.csie.ntu.edu. tw/~cjlin/libsvm.
- 175 -