Download Report

AUDIO SEGMENTATION &
CLASSIFICATION
Alex Kucherov, Yulia Rabinsky
Supervisor: Ronen Talmon
In association with Memories
Spring 2008
PROBLEM DESCRIPTION:
Segmentation of audio signals to speech, music
and silence periods
EXAMPLE:
SPEECH
SILENCE
MUSIC
‫תקציר‬
‫הפרוייקט מתאר מערכת עבור סגמנטציה וסיווג של אותות אודיו לשלושה סוגים‬
‫בסיסיים‪ :‬מוזיקה‪ ,‬דיבור ושקט‪ .‬עבודה רבה נעשיתה בנושא סיווג קטעים הומוגניים‬
‫ידועים תחת תנאי יחס אות לרעש גבוהים‪ .‬הפתרון המוצע מתמודד עם קבצי אודיו‬
‫לא מעובדים‪ ,‬על ידי חלוקתם לאזורים דומים אקוסטית לפני שלב הסיווג‪ .‬בעבודה‬
‫מומשו שלושה מסווגים שונים ונבדקו תחת תנאי יחס אות לרעש שונים‪ .‬נבדקו‬
‫פרמטרים בתחום הזמן ובתחום התדר‪ ,‬כאשר האחרונים התגלו כיותר יעילים בסיווג‬
‫האותות‪.‬‬
‫המסווג הטוב ביותר בעבודה זו משיג דיוק של ‪ 98%‬באופן עצמאי‪ ,‬בזמן שהמערכת‬
‫הכוללת משיגה דיוק של ‪.92%‬‬
ABSTRACT
This project describes a system for segmenting and classifying audio
signals into three basic classes: music, speech and silence. Great
amount
of work has been done on the subject of classifying known homogoneous
segments under high SNR conditions. The proposed solution deals with
raw
audio signals by dividing them into acousticly similar regions, prior to
classification. Three different classifiers were implemented and tested
under various SNR conditions. Both time and frequency domain features
were
used, where the latter proved to be more useful in discriminating the
classes.
The best classifier achieves an accuracy of 98 % when operating "stand
alone", while the total system's accuracy was about 92 %.
IMPLEMENTED APPROACH
Transition detection
Audio
signal
Feature
extraction
Change
point
detection
Classification
Feature
extraction
Class
Decision
The audio is first divided into acoustically homogeneous parts
using a segmentation algorithm (described below). Afterwards,
each part is classified into three classes: music, speech and
silence. The silence decision is based on a simple energy
threshold, whereas the speech/music decision is based on
feature extraction from the signal's spectrogram.
TRANSITION DETECTION PROCEDURE
|
Divide signal into 1 sec frames.
|
Calculate 50 RMS values (20ms) for each frame:
RMS =
N
∑ x ( n)
2
n =1
|
RMS histogram is modeled by a χ2 distribution
x a e − bx
p ( x) = a +1
b Γ(a + 1)
μ2
a,b are: a = 2 − 1
σ
x≥0
and
σ2
b=
μ
o Distance for frame i defined as:
D(i ) = 1 − ρ ( pi −1 , pi +1 )
ρ ( p1 , p2 ) = ∫ p1 ( x) p2 ( x)dx
ρ - Similarity measure
o Frames with large D may contain a transition
o Normalization to reduce false alarms
o Refinement of detection to 20ms frames
CLASSIFICATION
Classification is based on feature extraction. The two main features are "Variation
of low band energy ratio" and "continuous frequency activation". Both features are
based on differences in the spectrograms of speech and music. As seen from the
figures below, that of music is of horizontal nature which is a result of continuous
notes (frequencies) which are present in music. Whereas speech is a combination
of periodic signal, noise-like signal and silence, resulting in a vertical spectrogram.
VARIATION OF LOW BAND ENERGY RATIO (VLER)
|
Based on the different frequency content of silence,
voiced and unvoiced speech
|
Music’s frequency content is more stationary
|
Variance is larger for speech
|
Calculation algorithm ->
Variance calculation
Input audio
segment
Division to
frames
LPF – 1500Hz cut off frequency
Input energy
calculation of
the frame
Output energy
calculation of
the frame
LER values
vector
Output energy
/ Input energy
CONTINUOUS FREQUENCY ACTIVATION(CFA) CALCULATION
|
Power spectrum computation: Windowing and FFT
|
Emphasize local peaks:
y Divide to frequency beans
y Subtract from each bin the running average
|
Binarizaion: If Lower than threshold set the frequency bin to zero
|
Sum the binarizaion for each frequency bin
CLASSIFIER STRUCTURE: K-NEAREST NEIGHBOR
Computation of
VLER
Audio in
KNN
Music/Speech
Computation of
CFA
|
Classification results:
Clean signal
SNR=20
SNR=15
SNR=10
SNR=5
Misdetected Music
2%
2%
3%
5%
7%
Misdetected Speech
2%
2%
2%
2%
2%
|
Good accuracy is achieved even under low SNR conditions
|
Disadvantage: Long computation time is required.
SEGMENTATION + KNN CLASSIFICATION EXAMPLE
FINAL SEGMENTATION RESULTS
The whole system was tested on a ten minute audio, consisting of
multiple transitions between the classes. The accuracy achieved here
was 92% for cleans signals and 85% for low SNR (5dB).
The error is mainly caused by drum music which has the non stationary
nature of speech, and by speech with background music which results in
a more horizontal spectrogram like music.
CONCLUSIONS
|
Many features were tested. The spectrogram based
ones achieved the best results
|
Using transition detection prior to classification enables
both high resolution and high classification accuracy
|
Speech with background sounds or music remains
unsolved