AUDIO SEGMENTATION & CLASSIFICATION Alex Kucherov, Yulia Rabinsky Supervisor: Ronen Talmon In association with Memories Spring 2008 PROBLEM DESCRIPTION: Segmentation of audio signals to speech, music and silence periods EXAMPLE: SPEECH SILENCE MUSIC תקציר הפרוייקט מתאר מערכת עבור סגמנטציה וסיווג של אותות אודיו לשלושה סוגים בסיסיים :מוזיקה ,דיבור ושקט .עבודה רבה נעשיתה בנושא סיווג קטעים הומוגניים ידועים תחת תנאי יחס אות לרעש גבוהים .הפתרון המוצע מתמודד עם קבצי אודיו לא מעובדים ,על ידי חלוקתם לאזורים דומים אקוסטית לפני שלב הסיווג .בעבודה מומשו שלושה מסווגים שונים ונבדקו תחת תנאי יחס אות לרעש שונים .נבדקו פרמטרים בתחום הזמן ובתחום התדר ,כאשר האחרונים התגלו כיותר יעילים בסיווג האותות. המסווג הטוב ביותר בעבודה זו משיג דיוק של 98%באופן עצמאי ,בזמן שהמערכת הכוללת משיגה דיוק של .92% ABSTRACT This project describes a system for segmenting and classifying audio signals into three basic classes: music, speech and silence. Great amount of work has been done on the subject of classifying known homogoneous segments under high SNR conditions. The proposed solution deals with raw audio signals by dividing them into acousticly similar regions, prior to classification. Three different classifiers were implemented and tested under various SNR conditions. Both time and frequency domain features were used, where the latter proved to be more useful in discriminating the classes. The best classifier achieves an accuracy of 98 % when operating "stand alone", while the total system's accuracy was about 92 %. IMPLEMENTED APPROACH Transition detection Audio signal Feature extraction Change point detection Classification Feature extraction Class Decision The audio is first divided into acoustically homogeneous parts using a segmentation algorithm (described below). Afterwards, each part is classified into three classes: music, speech and silence. The silence decision is based on a simple energy threshold, whereas the speech/music decision is based on feature extraction from the signal's spectrogram. TRANSITION DETECTION PROCEDURE | Divide signal into 1 sec frames. | Calculate 50 RMS values (20ms) for each frame: RMS = N ∑ x ( n) 2 n =1 | RMS histogram is modeled by a χ2 distribution x a e − bx p ( x) = a +1 b Γ(a + 1) μ2 a,b are: a = 2 − 1 σ x≥0 and σ2 b= μ o Distance for frame i defined as: D(i ) = 1 − ρ ( pi −1 , pi +1 ) ρ ( p1 , p2 ) = ∫ p1 ( x) p2 ( x)dx ρ - Similarity measure o Frames with large D may contain a transition o Normalization to reduce false alarms o Refinement of detection to 20ms frames CLASSIFICATION Classification is based on feature extraction. The two main features are "Variation of low band energy ratio" and "continuous frequency activation". Both features are based on differences in the spectrograms of speech and music. As seen from the figures below, that of music is of horizontal nature which is a result of continuous notes (frequencies) which are present in music. Whereas speech is a combination of periodic signal, noise-like signal and silence, resulting in a vertical spectrogram. VARIATION OF LOW BAND ENERGY RATIO (VLER) | Based on the different frequency content of silence, voiced and unvoiced speech | Music’s frequency content is more stationary | Variance is larger for speech | Calculation algorithm -> Variance calculation Input audio segment Division to frames LPF – 1500Hz cut off frequency Input energy calculation of the frame Output energy calculation of the frame LER values vector Output energy / Input energy CONTINUOUS FREQUENCY ACTIVATION(CFA) CALCULATION | Power spectrum computation: Windowing and FFT | Emphasize local peaks: y Divide to frequency beans y Subtract from each bin the running average | Binarizaion: If Lower than threshold set the frequency bin to zero | Sum the binarizaion for each frequency bin CLASSIFIER STRUCTURE: K-NEAREST NEIGHBOR Computation of VLER Audio in KNN Music/Speech Computation of CFA | Classification results: Clean signal SNR=20 SNR=15 SNR=10 SNR=5 Misdetected Music 2% 2% 3% 5% 7% Misdetected Speech 2% 2% 2% 2% 2% | Good accuracy is achieved even under low SNR conditions | Disadvantage: Long computation time is required. SEGMENTATION + KNN CLASSIFICATION EXAMPLE FINAL SEGMENTATION RESULTS The whole system was tested on a ten minute audio, consisting of multiple transitions between the classes. The accuracy achieved here was 92% for cleans signals and 85% for low SNR (5dB). The error is mainly caused by drum music which has the non stationary nature of speech, and by speech with background music which results in a more horizontal spectrogram like music. CONCLUSIONS | Many features were tested. The spectrogram based ones achieved the best results | Using transition detection prior to classification enables both high resolution and high classification accuracy | Speech with background sounds or music remains unsolved
© Copyright 2024