Short-time Viterbi for online HMM decoding: Evaluation on a real-time phone recognition task Fernando Torre, David Pitchford, Phil Brown, Loren Terveen ICASSP 2008,IEEE 1 Outline ● Authors ○ Julien Bliot ○ Xavier Rodet ● Abstract ○ The algorithm derives an online version of the Viterbi algorithm on successive variable length windows, iteratively storing portions of the optimal state path. ● Introduction ○ The algorithm derives an online version of the Viterbi algorithm on successive variable length windows, iteratively storing portions of the optimal state path. 2 Author-Julien Bloit I am a research engineer and musician working at the intersection of digital media, new musical experience, and live performance. I’m a hands-on person who likes to build things to help me think, and repurpose everyday objects for digital interaction. I specialize in audio software but I love taking a breeze from the headphones to enjoy the smell of solder, burning acrylic on a laser cutter, or learning about animal communication. My most challenging job so far was being a steward on the Eiffel Tower elevators. Deep skills : ● PhD. in Statistical modelling of temporal sequences. Real time sound analysis and synthesis, Model selection, Sound morphology. More on my research page at Ircam. ● Max/MSP, Matlab, C/C++, Java. 3 Author-Xavier Rodet Xavier Rodet's research interests are in the areas of signal and pattern analysis, recognition and synthesis. He has been working particularly on digital signal processing for speech, speech and singing voice synthesis and automatic speech recognition. Computer Music is his other main domain of interest. He has been working on understanding spectrotemporal patterns of musical sounds and on synthesis-by-rules. He has been developing new methods, programs and patents for musical sound signal analysis, synthesis and control. He is also working on physical models of musical instruments and nonlinear dynamical systems applied to sound signal synthesis. 4 Introduction ● Viterbi algorithm is unsuitable ○ a streamed input for which there is potentially no ending to the sequence. ● Solution ○ applying the Viterbi algorithm on successive windows ○ partial path hypotheses on an expanding window until they converge towards the same solution ● Contribution ○ combine both approaches by forcing a suboptimal state label output when the latency exceeds a predefined threshold. ○ study the influence of the model’s topology on potential paths convergence, accuracy and latency. 5 Outline ● VITERBI DECODING ○ Standard Viterbi ○ Short-time Viterbi ● INFLUENCE OF THE MODEL’S TOPOLOGY ○ Necessary Condition for Fusion Points ○ Influence on Latency 6 Viterbi decoding- standard The decoding step of a recognition system consists in retrieving a state sequence that maximizes the maximum a posteriori probability, the optimal path s∗. ● a time-synchronous forward pass to update the partial likelihoods δt(i), ∀i ∈ S : the score of the best path ending in state i at time t. The best state predecessors are stored in a matrix of backpointers ψt(i). ● a backtracking on ψt(i), starting from the state with the highest score at time T 7 Viterbi decoding- standard 辨識系統的解碼步驟是由擷取狀態序列,最大化 posteriori 的 機率 這樣的序列被參考為最佳路徑s∗ Viterbi 演算法是一種產 生最佳路徑解的方法,伴著兩個主要的重複性步驟: ● 一個同步時間前進的回合去更新部分相似集合δt(i), ∀i ∈ S:最佳路徑的分數在狀態i時間t時結束。一個最佳狀態先前 解被儲存在一個回指標的陣列ψt(i)。 ● 一個在ψt(i)上的回溯,從最高分的狀態在時間T開始 8 Viterbi decoding- standard The backtrack step implies that the algorithm is not timesynchronous since the last observation frame at time T is needed in order to decode the global optimal state path s∗ from T back to the first time index. 9 Viterbi decoding- standard 回溯的步驟暗示這個演算法並非時間同步,因為最後一個觀察 框在時間T是被需要的,如此一來,才能解碼從時間T退回到第 一個時間索引全域最佳狀態路徑s∗ 10 Viterbi decoding- short time Let us define s(a, b, i) as the state sequence obtained by computing δ and ψ values on an observation window delimited by time indices a and b (such that a < b) and backtracking from an arbitrary state i at time b. 11 Viterbi decoding- short time A fusion point τ has the attractive following property : the local paths up to τ are always identical to the global Viterbi path between a and τ . 一個熔合點有下列吸引人的特質: 區域路徑們到t在a到τ 間全域viterbi路徑永遠都是相同的 12 Viterbi decoding- short time From this observation, it is straightforward to derive an online short-time Viterbi (STV) algorithm, based on local Viterbi decodings between successive fusion points, used as left bounds for variable size observation windows : 13 Influence of the model’s topology Necessary condition for fusion points The above algorithm assumes that a fusion point occurs inside a time interval [a, b] smaller than [1, T].However this is not always allowed by the topology of the model. 14 Influence of the model’s topology Necessary condition for fusion points Local paths cannot converge towards state 4 (on optimal path) because backtracking from states 1 and 2 can only lead to states 1 and 2. Thus the algorithm is doomed to stay stuck as the observation window expands until the final observation T. 15 Influence of the model’s topology Influence on Latency ● L= b-a, observation window ● earliest state: a+1 ● minimum latency of the state: L-1 ● A fusion point occurs at a+1 if all states in ● distance d(i, j) = count of the shortest backward path from j to i are connex according to matrix A ● ● : set of these lengths for any pair ( i , j ) in a minimum lower bound of the latency = max(DI). 16 Influence of the model’s topology Influence on Latency bound For example, in the model of Figure 1.b), the two largest distances are d(1, 4)=3 and d(3, 2)=3. So we can affirm that in the general case, we can not achieve optimal decoding with a better latency than 3 frames. 17 Experimentations Database and Models ● evaluated the performance of STV on a real-time phoneme recognition task ○ database: 3794 sentences split to 3640 sentences for the training set (131967 phonemes), and to 154 sentences (3134 phonemes) for the test set ○ sound file: sliced into 25 ms windows every 5 ms ○ HTK tools : 37 monophone models ○ Each monophone was modeled by a five states left-right HMM ○ The observation densities are modeled using Gaussian mixture models with diagonal covariance and J components per state, yielding a total of eight sets of models for J=1, 2, 4, ..., 128. 18 Experimentations Topology modifications ● Two modifications of model from λref to ensure connexity among the model states ○ one that guarantees connexity by adding transitions at the phone-level (thus denoted λP , with max(DS) = 5) ○ another (λS) that adds transitions at the state-level (max(DS) = 1) 19 Experimentations Topology modifications 20 Experimentations Recognition performance N: phonemes number D: deletion errors number S: substitution errors number I: insertion errors number 21 Experimentations Latency performance ● J=8, utterances lengths in the test set range from 249 to 1470 frames, mean 524 frames ● maximum latency with λp: 236. this means that all online decodings managed to find a fusion point ● minimum latency with λp:7, 32-128 ● minimum latency with λs:2, all 22 Experimentations Hard constraint on Latency As C gets shorter, accuracy decreases because of the suboptimal decoding procedure employed when forcing the output. 23 Conclusion and perspective ● It is possible to operate modifications on a model and make it obey these constraints ● As the connexity lengths reduce, the lower bound on best possible latency decreases 24 Comment ● It is nice concept about looking for a optimal path. ● Topology model matters ● Cost definition for dijkstra 25
© Copyright 2024