Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Vocal-tract computation: how to make it more robust and faster Lin, Q. journal: volume: number: year: pages: STL-QPSR 33 4 1992 029-042 http://www.speech.kth.se/qpsr STL-QPSR 4/ 1992 VOCAL-TRACT COMPUTATION: HOW TO MAKE IT MORE ROBUST AND FASTER Qiguang Lin Abstract Numerical methods are proposed to speed up the frequency-domain calculation of the vocal-tract response and effectively determine the associated pole/zero patterns. These methods include algorithms to properly decompose the numerator and the denominator of the transferfknction, to estimate formant patterns of a lossy vocaltract systemfrom its lossless counterpart by means of linear interpolation, and to utilize the information ofresidues at the system poles to ensure that no pole/zero is missed. It is found that the proposed methods reduce the computation time at least by a factor of 2 while maintaining the accuracy. At the same time the risk of missing a pole or a zero is minimised. These algorithms have been implemented in a computer program, TRACTTALK, an articulation-based speech synthesizer. A brief description of TRACTTALK is presented. 1. INTRODUCTION Vocal-tract computation, to calculate the acoustic output of a given area function, has been one of the important themes in acoustic theory of speech production (Fant, 1960; Flanagan, 1965; Sondhi, 1974; Atal, & al., 1978). The techniques are nowadays well established for vowel-like sounds (all-pole systems) and follow two major approaches: a time-domain method and a frequency-domain method (see for instance Lin (1990) for a brief overview). In both approaches, a plane wave propagation inside the vocal tract is usually assumed and the tract is divided into a number of short sections, each of which is approximated by a cylindrical tube. This paper sets out to study how to make the vocal-tract computation more robust and faster. Emphasis is placed on a more complex system which is no longer of allpole type. The vocal tract is simulated in the frequency domain so that the frequency-dependency of various elements can be accurately and effectively modelled and the overall tract length may vary continuously. We shall in Section 2 discuss how to decompose the zero part and pole part of the transfer function. This decomposition is necessary, otherwise the determination of a pole will be contaminated by the presence of a zero in the vicinity, and vice versa. If the system is an all-pole one or if formant- and antiformant patterns are not to be determined, there is then no need for such an explicit decomposition. The calculation of mode patterns from a given transfer function is discussed in Section 3. We shall also compare different root-searching methods. In Section 4 we shall demonstrate how to expedite the vocal-tract computation. The lossy vocal-tract system is initially treated as lossless. That is, the resistive part of all elements is set to zero. The computation of the lossless system is much faster than that of a lossy one. The pole/zero patterns of the original lossy system can, as will be shown, be linearly interpolated from those of the lossless one. A lossless simulation possesses the additional advantage of bringing out heavily damped formants which would be hard to detect by a direct lossy simulation. However, a heavily damped formant does indeed present a problem during the linear STL-QPSR 4/ 1992 interpolation, especially when two formants are closely spaced. In this case, an adjacent and known formant is eliminated from the true vocal-tract response to recover the obscured formant peak before applying the interpolation scheme. This is exemplified in Section 5. In addition to the center frequencies and bandwidths of poles/zeros, residues at the poles are also calculated. The residues are mainly used as control parameters to a recently proposed synthesis scheme (Lin, 1990; Lin & Fant, 1990; 1992), but they are also used to check whether there is a missing pole or zero, e.g., resulting from a too large frequency scanning step, Section 6. We have implemented the above robust and fast algorithms in a computer model of the vocal tract: TRACTTALK. The computation time is at least halved while the results remain sufficiently accurate. The chance of missing poles is also greatly minimized. These are of vital importance to TRACTTALK, which is an articulationbased speech synthesizer using a parallel formant structure (derived directly from the vocal-tract computation). A brief description of TRACTTALK is presented in Section 7. 2. DECOMPOSITION OF THE TRANSFER FUNCTION Figure 1 shows a short cylindrical tube and its network representation. The inputs and outputs of the section are related by: Pi = Pi-l. cosh Ti + U,-,- sinh Ti , where Pi and Pi-1 are the pressure at the upstream and downstream ends of the section, respectively. Ui and Ui-1 are the corresponding volume flow. Ti is the transfer constant and Zi the characteristic impedance of the section i. They are normally complex numbers. In a lossless simulation, however, Ti reduces to a pure imaginary quantity, and Zi to a real number, where c is the sound velocity and p the air density. The hyperbolic functions in Eq. (1) also reduce to circular functions under the lossless assumption. It should be noted that even in a lossy simulation, Zi is independent of the section length. The real and imaginary parts of Ti are a function of the section area and its length, except for the lossless case, Eq. (2a). From Eq. (1)we derive the transfer function of the individual section Ui-l/Ui. Its reciprocal Ui/Ui-l is given by: where ZB=Pi-I/Ui-Irepresents the downstream impedance, see Fig. 1. It is updated section-wise. The updated impedance in Fig. 1is: STL-QPSR 4/ 1992 It is well known that Eq. (3) contains no poles, that is, no zeros in individual transfer function Ui-l /Ui. C) Fig. 2. (a): A cylindrical section of length li and area Ail and (b): its equivalent network representation under the assumption of hard wall. Z B = Pi-z/Ui-z is the downstream impedance to the section, seen toward the lip end. Z, = Zi.tanh(Ti /2), Zb = Zi 1sinh ( r i ) ,see text. By abutting together the sections shown in Fig. 1, the network representation of the entire vocal tract is obtained, except for boundary conditions such as the radiation impedance and wall impedance. The computation begins at the lip end, where ZB is equal to the radiation impedance, and progressively iterates to the back end. For voiced sounds, the transfer function of the system is defined as the ratio of the lip volume flow Uo to the glottal volume flow Ug. The iteration terminates at the glottal end. The system transfer function is a product of the transfer function of individual sections. If the vocal tract consists of only cascading sections of Fig. 1, it is then an all-pole system (assuming an infinite glottal impedance). The poles can be determined by a root-searching method. Zeros arise when shunting branches, for instance, the wall impedance arms, are introduced. A direct determination of poles would then be contaminated by the presence of adjacent zeros, and vice versa. A prerequisite to calculation is to properly decompose the transfer function into its zero part (numerator) and pole part (denominator). In Fig. 2, a shunt of lumped impedance Z, is inserted between two sections. The transfer function between the flows before and after the shunt is: From conventional circuit theory, it is known that a zero is brought into the transfer function at the frequency where a shunt has a zero impedance (refer to the ladder filter system in Fig. 2) or where a serial element has an infinite impedance. Therefore, Z, is a factor of the zero function, see Eq. (5). If there are more such shunts, the final zero function will be the product of these elements, provided that there is only one output port (Lin, 1990). For the case of the simultaneous radiation, the volume STL-QPSR 4/ 1992 "?I$' i: Z ?.-- rj a ' + Ui- I flows at the radiation ports are linearly sum- med up, spatial disregarding phase difference . . .-. . --. . . . -. -....... ...-.and radiation interference. The fiFig. 2. Network representation (a ladder filter) of two contiguous sections nal zero function with an extra shunt in between. is a sum of the zero part of the transfer function pertinent to each radiation port and zeros will be redistributed. Z,=tanh(x)=sinh(x)/cosh(x) is a serial element in Fig. 1. Although Z, has poles at frequencies where cosh(x)=O and so does the transfer function, these singularities are exactly cancelled when the circuit in Fig. 1is considered as a whole. Shunting elements encountered in the vocal-tract modelling may result from the following extra shunts: a) Sinuses, such as the sinus piriformis and nasal sinuses; b) Wall impedance arms (at low frequencies); c) Lateral coupling, as in the /1/ sound and nasal/oral coupling, as in nasal murmurs and nasalized vowels. A nasalized sound represents a most complex case. The oral cavity constitutes a shunt to the pharyngeal + nasal passage, while the nasal cavity a shunt to the pharyngeal + oral passage. It is clear that to decompose the transfer function the numerator and denominator of Eq. (5) should be treated separately when a shunt is met during the recursive computation. The final Uo/Ug contains two components. One has exclusively poles while the other exclusively zeros, if there is any. For unvoiced fricative sounds, the transfer function is defined as the ratio of the lip flow to a constant pressure source. This ratio is determined by two iterations, one from the lip end to the glottal end (Uo/Ug) and the other from the position where the pressure source is located to the glottal end (Ug/Ec) (Badin & Fant, 1984; Lin, 1990). If there is no other shunt, the pole part of the transfer function Uo/Ec equals Ug/Uo, and the zero part Ug/Ec. Zeros of the transmission occur at frequencies where the impedance of the back cavity system as seen from the source is infinite (a serial element). If additional shunts occur, care should be taken to distinguish the shunts anterior and posterior to the pressure source. The shunts posterior to the pressure source are cancelled from the numerator in the product of (Uo/Ug)(Ug/Ec). Pi-i .. .. . . 3. FROM TRANSFER FUNCTION TO MODE PATTERN In a lossless system, the distinction between the resonance frequency, the oscillatory frequency, and the spectral peak frequency is cancelled. The difference between these frequencies is, however, small and negligible for a high-valued Q-factor and/or for a multi-mode system with well separated poles. Accordingly, we shall simply refer to the frequency of a pole or a zero. In the following, we shall only illustrate how to determine poles from a decomposed transfer function. Zeros can be treated in a similar way, see Lin (1990, pp. 17-21). Our criterion of a successful de- STL-QPSR 4 / 1992 tection of mode patterns is that the true vocal-tract transfer function can be regenerated by a parallel formant synthesizer (Lin, 1987; 1990; Lin & Fant, 1990). Let us consider an all-pole system and let the transfer function be: where the numerator H,(f) is in this case unity. The denominator Hp(f) contains an infinite number of roots and is normally a complex number: For a lossless vocal tract N,(f)=O and for a lossy system N,(f) is small compared with Nb(f). Consequently, the roots of the complex function H (f) must be located in the neighbourhood of the roots of Nb(f). The procedure to Jetermine poles of the transmission is as follows. At a given frequency, the pole function is first computed and the polarity of the Nb(f) term is noted. The frequency is next increased by a certain amount (the socalled scanning step) and the corresponding pole function at the new frequency is computed. If Nb(f) changes signs within this interval there is a root to be detected. Refer to Fant (1960; 1985), Badin & Fant (1984), and Lin (1990) for more details. The effect of a finite N, term is accounted for by means of a first-order linear interpolation: A o , = -on N,' / Nbl, where ' denotes taking derivatives with respect to frequency. The bandwidth B, is given as a by-product of the interpolation: The final pole frequency is: where F,, is the frequency at which Nb=O. This method is henceforth referred to as the Nb method. Lin (1988; 1990) proposed an alternative approach for searching the roots of Hp(f). Instead of using Nb(f), the following function is defined: where A f is a small increment in frequency. Poles of the transmission occur at frequencies where D(f) upwards intercepts the frequency axis. The method has been called the peak method for simplicity. The peak method requires more computation time in relation to the Nb method. It has, however, several advantages. First, there is no need to decide which part of Hp(f) is the dominating term. For instance, when a lumped wall impedance arm &=R,+ jwL, is inserted in the pharyngeal region, Hp(f) after the decomposition is dominated by its imaginary part, instead of the real part. In other words, poles of the transmission are located in the vicinity of the roots of the imaginary part. The Nb STL-QPSR 4/ 1992 method requires to first find how many extra shunts are present in the simulation and how their impedance is characterized, in order to determine which part is predominating. The second advantage of the peak method lies in the fact that it also makes use of information of the dominating term in the linear interpolation. The corresponding frequency correction term is: and the real part of the pole is now given by: Note that Am, is by definition equal to 0 at the frequency where D(f)=O. If one has at outset considered a lossy vocal tract, it is unnecessary to compute this correction term. However, the above interpolation procedure, Eq. (12), is found useful in deriving mode patterns of a lossy system from those of the lossless variant. This is because D(f) of Eq. (11) will take a different value after the losses are put back. Consequently Am, is not equal to 0. It can be seen that Eqs. (13) and (9) are identical when Nb=O. 4. A FAST ALGORITHM There are two major iterative operations in the vocal-tract computation. At a given frequency, as we have seen already, the computation for a vowel sound iterates from the lip end to the glottal end to compute the transfer function. The frequency is next increased by a certain step and the transfer function is computed at the new frequency. The procedure repeats until the whole frequency range of interest is scanned. The components involved in the lossy simulation are mainly complex numbers, except a few such as frequency, cross-sectional areas, and section lengths. The computation is accordingly time-consuming. We seek algorithms that can reduce the computation amount. A possible solution is to replace all complex numbers by their real part or pure imaginary part, by removing the dissipative part of the vocal-tract elements. Recall Eqs. (2a) and (2b). In this way a complex product, four multiplications and two additions of real numbers, is approximated by a single multiplication of real numbers and thus the computation can be expedited. Table I lists vowel formants calculated under different simulation conditions. The area functions of these vowels have been taken from Fant (1960). (A) pertains to a lossy vocal-tract simulation with a closed glottis. The radiation impedance is approximated by the SKF model (Stevens, Kasowski, & Fant, 1953). The wall impedance is based the one of Fant, Nord, & Branderud (1976). The surface losses are twice as large as those predicted by the classical formulas (Fant, 1960; Flanagan, 1965) to account for the deviation of the vocal-tract cross-sectional shape from a perfect circle. (B) is the same as (A) but the vocal-tract losses are all removed, i.e., a lossless simulation. (C) is the same as (A) but with a small open glottis. The glottal impedance is Rg=3.947p-cg/cm4/s and Lg=0.013g/cm4. The corresponding lung pressure is 8 cm H20 and the glottal opening area is 0.03 cm2. (D) The same as in (C) but again the loss components are all removed. STL-QPSR 4/ 1992 It is clear from Table I that when the glottal impedance is infinite, the dissipative part has only a little effect on the mode parameters. F5 of the vowel /e/ has the largest absolute difference, 146 Hz. See also Badin & Fant (1984) and Lin (1990, p. 35). With a finite glottal coupling, the difference becomes larger. This conforms well with the derivation of Flanagan (1965, p. 64). The extent to which formant frequencies differ depends on particular modes. Appreciable errors are found for lower formants, except for F5 of /e/ (261 Hz). The absolute difference is otherwise less than 100 Hz. Based on the result of Table I, a fast algorithm is suggested to infer, by means of linear interpolation, the mode pattern of a lossy vocal tract from a lossless simulation. The fast algorithm carries out the vocal-tract computation in two stages (Lin, 1992). First, it determines pole frequencies for the lossless vocal tract. At each pole, all losses are then restored and Eq. (12) is applied to determine the pole frequency corresponding to the lossy vocal tract. The pole frequency is updated, fnt=fn+~q,/2-n and Eq. (12) is then again used to compute Am,, at the new frequency. The recursive calculation ends when 1 A y , 1 <I. The bandwidth of the pole is calculated according to Eq. (13) at the final pole frequency. The formant patterns obtained this way are presented in Table 11, for the same vowels as in Table I. Rows (A) and (C) in Table I have been duplicated in Table I1 for the sake of an easy comparison. In Table I1 an excellent agreement can be seen between (A) and (B) and between (C) and (D). But the computation time has been greatly reduced. A comparison of normalized computation time for different pole-searching methods is given in Table 111. The Nb method requires twice as much of computation time as the fast method does. The peak method is 650h times slower than the Nb method. The difference depends of course on the section number of the area function and also on the scanning step. The more sections and/or the smaller step, the greater the difference is. An example of nasalized vowels is presented in Fig. 3. The formants and antiformants from the lossy and lossless simulations are tabulated in Table IV. Fig. 3 is one of the typical displays of TRACTTALK. In the top left panel the area functions of the vocal-tract and nasal tract are plotted, with the left end denoting the glottal termination. (The area functions have not been optimally prepared in this example. They serve to illustrate the proposed algorithms.) The calculated poles and zeros are tabulated in the penal beneath the area function. Frequency and bandwidth are both in Hertz. The simulation conditions are recorded in the bottom right corner. Two sinuses have been considered. They are modelled as Helmholtz resonators with resonance frequencies tuned to 499 Hz and 1400 Hz, respectively. The sinus with the lower resonance frequency is inserted at the position 6 cm from the nostrils, while the other is inserted 8 cm from the nostrils. They are origin of the zeros of the coupled system at 445 Hz and 1426 Hz. The zeros have been re-distributed when the volume flows at the lips and at the nostrils are superposed. Recall the discussion in Section 2. The transfer function is shown in the top right panel. In Fig. 3, the true transfer functions, H(f), from a lossy (Curve A) and a lossless simulation (Curve B) are superimposed. STL-QPSR 4/ 1992 Table I. Formant frequency and bandwidth (in Hz) of vowels /a/, /i/, /u/, and /e/. Four different simulation conditions have been considered. (A) and (C)pertain to a lossy simulation of the vocal tract and ( B ) and (D) to a lossless simulation. The bandwidths are all zero for the simulation of (B)and (D) and hence are not tabulated here. In (A)and (B), the glottis is closed while in (C)and (D), a glottal opening of 0.03 cm2 is assumed. See text. Approach The Nb method The peak method The resent fast method time used Table I l l . Comparison of computation time. For each method, the used time is estimated on a basis of a total of 207 runs and is normalized. Table 11. Formant frequency and bandwidth of vowels /a/, /ill /u/, and /e/, in Hz. They have been calculated either by a direct lossy simulation of the vocal tract ( A and C) or by the present fast algorithm (B and D). A closed glottis is assumed in (A) and ( B ) while in (C) and (D) the glottal opening area is 0.03 cm2. STL-QPSR 4/ 1992 Curve C is the resynthesized H(f), based on the mode data calculated by the fast algorithm. A good fit between Curve A and Curve C is observed. Note that the amplilde would go to infinity in a lossless simulation if it is evaluated exactly at a pole frequency. Otherwise only a sharp peak is shown. Occasionally, such sharp peaks are not visible because of a too large scanning step. This is the case for the pole at 1417 Hz in Fig. 3. From Table IV it is evident that the mode patterns obtained by the fast algorithm are the same as those calculated from a direct lossy simulation, within the accuracy of a couple of Hertz. (A) Note that in a lossless simulation, Hp(f) in Eq. (B) (A) (B) (7) reduces to a real number. One may utilize the Fn Fn Bn Bn Nb method without the need to decide which part 34 377 378 34 is dominating. It should also be pointed out in this 769 769 45 45 conjunction that a lossless vocal-tract computation 1047 47 47 1047 requires not only a less time but also can find those 1162 54 53 1163 poles which would have been missed in a direct 1414 1417 123 118 lossy simulation due to their large damping. 2162 61 2162 61 3020 3021 90 90 5. SPECIAL CARE FOR "HIDDEN"POLES 3573 202 202 3575 110 110 4049 A heavily damped pole does, however, present 4048 4343 109 108 4344 problems to the linear interpolation procedure when all losses have been put back. This is exemBZ, FZ, FZ, BZ, plified in Fig. 4. Curve A is the calculated transfer 42 42 445 445 function of the vowel /a/. The simulation condi- 1116 53 1117 53 tion is the same as (A) in Table I, but now an open 1426 101 1426 102 glottis of 0.08 cm2 is assumed. (The subglottal 3287 137 3288 140 coupling is not considered in the present work.) As illustrated in Tables I and V, the glottal coupling Table IV. Formant and antiformant has resulted in an upward shift in frequencies of patterns o f a nasalized sound, in Hz. all formants and resulted in larger bandwidths. (A): obtained from a direct 1 0 s ~ ~ The second formant in Fig. 4 is most apparently simulation; 03): obtained by the fast algorithm. FZ and BZ denote damped, a strong indication that the formant is frequency and bandwidth of zeros, mainly affiliated with the back cavities. Curve B is respectively. " the lossless version of Curve A. The computation indicates that the lossless system has a pole around 1370 Hz. If we put back all losses and begin to calculate, by linearly interpolating from this frequency, the F2 of the corresponding lossy system, Eq. (12) does not converge. This is because F2 of Curve B in Fig. 4 does not possess a well defined peak, which is related to the close proximity of F1 and F2 and that F2 is heavily damped. However, the true peak shape med by recovered by eliminating the F1 response from the true transfer function, as illustrated in Fig. 5. A peak at 1180 Hz is now clearly seen. Therefore, special care is needed to recover the peak shape for a heavily damped pole before the interpolation. In the example of Fig. 4, this means that the F1 response, whose frequency and bandwidth are already known, should first be removed. A more complicated case is when the response of a higher formant has to be left out, because its frequency and bandwidth are not known yet. We shall return to this problem later. STL-QPSR 4/ 1992 The calculated formant patterns for Curves A and B in Fig. 4 are given in Table V. To this table is added the re-calculated B1 when the F2 component has been eliminated, after the determination of F2. Curve C in Fig. 4 is the reTable V. Formant frequency and bandwidth of the vowel synthesized transfer function. /a/, in Hz. The glottal area is 0.08 cm2 and the lung pressure 8 cm H20. The glottal resistance R = 1.32 p . c From this figure it is found that an (g/cm4/s) and the glottal inductance L = 6.005 (g/cm4). increased B1 tends to improve the The other conditions are the same as in Table I. (a): fit of the F1 peak. Although a 10 lossless simulation without the interpolation; (b):with the Hz difference in B1 of 157 Hz is interpolation; (c): with elimination of F1 component and not of significance, it is expected then the interpolation; (d): Re-calculation of BI after the that such a re-calculation would determination of F2. See text. be necessary when two heavily damped formants are closely located and their amplitude is -on the same order df magnitude. In Atal, & al. (1978), responses of lower (and known) formants were all removed. They did not re-calculate lower formants afterwards. h) 6. FURTHER CON- SIDERATION A lossless simulation of the vocal tract has been shown capable of saving the computation time and detecting all formants within the frequency range of interest. We shall, however, discuss some exceptions in this section and propose methods to overcome the difficulty. First consider the choice of the frequenFig. 4. Calculated and resynthesized transferfunctions of the vowel /a/. cy scanning step. The larger the step the less computation is needed to complete the scanning over a given frequency interval. However, if this step is chosen too large, there is a risk that more than one pole (or zeros if the numerator is concerned) fall in the same increment interval. This is more likely the case for nasalized sounds that have more densely spaced formants. In TRACTTALK, the choice of the scanning step is made adaptive to partially alleviate this problem. A greater step is used for vowel-like sounds while for nasalized sounds, a smaller step is used. STL-QPSR 4/ 1992 been missed. The method outlined above has been implemented in TRACTTALK and has been found to work satisfactorily. At present, the hyperbolic functions and for a lossless system the circular functions in Eqs. (3) and (4) are directly evaluated. The computation can be made still faster if one resorts to a table look up. We have two alternatives. Either a table for hyperbolic functions and a table for circular ones, or, by rewriting Eqs. (3) and (4), a table for exponential functions and a table for circular functions. 7. BRIEF DESCRIPTION OF TRACTTALK TRACTTALK is a computer model of the vocal tract, implemented both in Fortran and in C-code. It is a comprehensive research tool for studies of the production mechanism and an integral part of a synthesizer. The vocal tract is simulated in the frequency domain. All important components of the vocal system are represented in the model, such as the subglottal system, wall impedance, and nasal sinuses. Different radiation impedance models, proposed by Stevens, & a1 (1953), by Fant (1960), by Flanagan (1965), and by Wakita & Fant (1978), and different wall impedance models, proposed by Fant (1972), by Ishizaka, & al, (1975), and by Fant, & al. (1976) are implemented. When the cross-sectional areas change abruptly, an inner length correction is introduced to cope with inner radiation. If the area of a section is sufficiently small, a cascaded aerodynamic resistance at the upstream entry is accommodated. TRACTTALK can be used to model various categories of speech sounds, ranging from vowels, fricatives, voiced occlusion before release, liquids, nasal murmurs, to nasalized sounds (Lin, 1990). There are basically two input formats. One is a direct input of the vocal-tract area function (including the area function of the nasal tracts for nasal sounds), and the other is an input of area-function parameters which in terms of some parameterization models specify the vocal-tract configuration. Two parameterization models have been implemented, a cosine-function based (Lin, 1990) and a polynomial function based (Fant, 1992). The latter is at present adopted for vowels only, while the former has been used for modelling of nasal sounds and apical articulation (Lin & Fant, 1992). TRACTTALK can also be used to infer from the first three formant frequencies the underlying area-function parameters and the vocal-tract configuration with a parameterization model as constraint. This procedure is known as the inverse vocaltract transform. Nomograms of formant frequency and formant bandwidth can be easily generated as a function of the location of the constriction &, the constriction area A,, and the lip parameter lo/Ao. Formant-cavity affiliations may effectively be examined. Finally, one can study the effects of the acoustic interaction between the glottal source and the vocal-tract acoustic load. The reader is referred to Lin (1990) for details. (In Lin (1990), several independent programs were presented instead of a single program package.) TRACTTALK offers the facilities of on-line graphic display and playback of the synthesized speech. It also offers, after each run, an interactive mode in which a number of simulation conditions including the area function itself can be altered and the program will then recompute the new transfer function. STL-QPSR 4/ 1992 8. CONCLUDING REMARKS We have in the above sections described several fast and robust algorithms for vocaltract computations in the frequency domain. They are 1) algorithms for properly separating the zero part of the transfer function from the pole part. The determination of a pole frequency will not be influenced by its neighbouring zeros, and vice versa; 2) algorithms for efficiently and effectively calculating the transfer function and the associated resonance modes. Interpolation schemes are proposed; 3) algorithms for recovering a peak shape of a heavily damped formant so that the proposed interpolation schemes can work satisfactorily; 4) algorithms for deciding whether a pole/zero has been missed. In Section 2 we have also briefly addressed the iterative algorithm for the vocal-tract computation in the frequency domain and analysed the origin of the zero part of the transfer function. A brief presentation of a computer model of the vocal tract, TRACTTALK, is given in Section 7. We have not discussed the acoustic realization. Readers are referred to the works listed in the reference. Vocal-tract computation techniques have played an important role in acoustic theory of speech production. They have long been used to study the relationship between articulation and acoustics. They have also been used in the inverse vocal-tract transform (see, for instance, Atal, & al., 1978; Lin & Fant, 1989). In both cases, however, studies have mainly been concentrated on vowels and the computation speed is often not of concern. Recently, there is a growing interest in the articulatory speech synthesis based on the simulation in the frequency domain (Sondhi & Schroeter, 1987; Lin, 1990). To meet this new application, more robust and faster algorithms of the vocal-tract computation are needed. The present fast algorithm described in Section 4 makes use of the fact that the resonance modes of the lossy vocal tract can be linearly interpolated from those of the lossless variant. The computation time is remarkably reduced while the accuracy is preserved, see Table 11. The proposed algorithm can also be adopted in the system of Sondhi & Schroeter (1987). They first sampled the transfer function and then applied the inverse discrete Fourier transform to derive the impulse response of the vocal tract. If the transfer function calculation is based on the lossless version of the vocal tract and if only the samples around spectrum peaks and sharp valleys are replaced by those calculated from the lossy counterpart, the computation amount is reduced. In our synthesis algorithm (Lin, 1990), the transfer function is expressed as a sum of a few formant responses. It is therefore of vital importance that all formants in the frequency range of interest are successfully detected. Various factors have been considered in this paper so that the risk of missing a pole is maximally minimized. ACKNOWLEDGEMENTS This work was performed as a part of an ESPRIT, Basic Research Prioject, "SPEECH MAPS. " The author would like to thank Professor Gumar Fant for valuable comments and discussions.
© Copyright 2024