Download Report

Dept. for Speech, Music and Hearing
Quarterly Progress and
Status Report
Vocal-tract computation: how
to make it more robust and
faster
Lin, Q.
journal:
volume:
number:
year:
pages:
STL-QPSR
33
4
1992
029-042
http://www.speech.kth.se/qpsr
STL-QPSR 4/ 1992
VOCAL-TRACT COMPUTATION: HOW TO MAKE IT MORE ROBUST
AND FASTER
Qiguang Lin
Abstract
Numerical methods are proposed to speed up the frequency-domain calculation of
the vocal-tract response and effectively determine the associated pole/zero patterns.
These methods include algorithms to properly decompose the numerator and the
denominator of the transferfknction, to estimate formant patterns of a lossy vocaltract systemfrom its lossless counterpart by means of linear interpolation, and to
utilize the information ofresidues at the system poles to ensure that no pole/zero is
missed. It is found that the proposed methods reduce the computation time at least
by a factor of 2 while maintaining the accuracy. At the same time the risk of
missing a pole or a zero is minimised. These algorithms have been implemented in
a computer program, TRACTTALK, an articulation-based speech synthesizer. A
brief description of TRACTTALK is presented.
1. INTRODUCTION
Vocal-tract computation, to calculate the acoustic output of a given area function, has
been one of the important themes in acoustic theory of speech production (Fant,
1960; Flanagan, 1965; Sondhi, 1974; Atal, & al., 1978). The techniques are nowadays
well established for vowel-like sounds (all-pole systems) and follow two major approaches: a time-domain method and a frequency-domain method (see for instance
Lin (1990) for a brief overview). In both approaches, a plane wave propagation inside the vocal tract is usually assumed and the tract is divided into a number of short
sections, each of which is approximated by a cylindrical tube.
This paper sets out to study how to make the vocal-tract computation more robust
and faster. Emphasis is placed on a more complex system which is no longer of allpole type. The vocal tract is simulated in the frequency domain so that the frequency-dependency of various elements can be accurately and effectively modelled
and the overall tract length may vary continuously.
We shall in Section 2 discuss how to decompose the zero part and pole part of the
transfer function. This decomposition is necessary, otherwise the determination of a
pole will be contaminated by the presence of a zero in the vicinity, and vice versa. If
the system is an all-pole one or if formant- and antiformant patterns are not to be
determined, there is then no need for such an explicit decomposition. The calculation
of mode patterns from a given transfer function is discussed in Section 3. We shall
also compare different root-searching methods.
In Section 4 we shall demonstrate how to expedite the vocal-tract computation.
The lossy vocal-tract system is initially treated as lossless. That is, the resistive part
of all elements is set to zero. The computation of the lossless system is much faster
than that of a lossy one. The pole/zero patterns of the original lossy system can, as
will be shown, be linearly interpolated from those of the lossless one.
A lossless simulation possesses the additional advantage of bringing out heavily
damped formants which would be hard to detect by a direct lossy simulation. However, a heavily damped formant does indeed present a problem during the linear
STL-QPSR 4/ 1992
interpolation, especially when two formants are closely spaced. In this case, an adjacent and known formant is eliminated from the true vocal-tract response to recover
the obscured formant peak before applying the interpolation scheme. This is exemplified in Section 5.
In addition to the center frequencies and bandwidths of poles/zeros, residues at
the poles are also calculated. The residues are mainly used as control parameters to a
recently proposed synthesis scheme (Lin, 1990; Lin & Fant, 1990; 1992), but they are
also used to check whether there is a missing pole or zero, e.g., resulting from a too
large frequency scanning step, Section 6.
We have implemented the above robust and fast algorithms in a computer model
of the vocal tract: TRACTTALK. The computation time is at least halved while the
results remain sufficiently accurate. The chance of missing poles is also greatly
minimized. These are of vital importance to TRACTTALK, which is an articulationbased speech synthesizer using a parallel formant structure (derived directly from
the vocal-tract computation). A brief description of TRACTTALK is presented in Section 7.
2. DECOMPOSITION OF THE TRANSFER FUNCTION
Figure 1 shows a short cylindrical tube and its network representation. The inputs
and outputs of the section are related by:
Pi = Pi-l. cosh Ti + U,-,- sinh Ti ,
where Pi and Pi-1 are the pressure at the upstream and downstream ends of the section, respectively. Ui and Ui-1 are the corresponding volume flow. Ti is the transfer
constant and Zi the characteristic impedance of the section i. They are normally
complex numbers. In a lossless simulation, however, Ti reduces to a pure imaginary
quantity,
and Zi to a real number,
where c is the sound velocity and p the air density. The hyperbolic functions in Eq.
(1) also reduce to circular functions under the lossless assumption. It should be noted
that even in a lossy simulation, Zi is independent of the section length. The real and
imaginary parts of Ti are a function of the section area and its length, except for the
lossless case, Eq. (2a).
From Eq. (1)we derive the transfer function of the individual section Ui-l/Ui. Its
reciprocal Ui/Ui-l is given by:
where ZB=Pi-I/Ui-Irepresents the downstream impedance, see Fig. 1. It is updated
section-wise. The updated impedance in Fig. 1is:
STL-QPSR 4/ 1992
It is well known that Eq. (3) contains no poles, that is, no zeros in individual transfer
function Ui-l /Ui.
C)
Fig. 2. (a): A cylindrical section of length li and area Ail and (b): its equivalent network representation under the assumption of hard wall. Z B = Pi-z/Ui-z is the downstream impedance to the section,
seen toward the lip end. Z, = Zi.tanh(Ti /2), Zb = Zi 1sinh ( r i ) ,see text.
By abutting together the sections shown in Fig. 1, the network representation of
the entire vocal tract is obtained, except for boundary conditions such as the radiation impedance and wall impedance. The computation begins at the lip end, where
ZB is equal to the radiation impedance, and progressively iterates to the back end.
For voiced sounds, the transfer function of the system is defined as the ratio of the
lip volume flow Uo to the glottal volume flow Ug. The iteration terminates at the
glottal end. The system transfer function is a product of the transfer function of individual sections. If the vocal tract consists of only cascading sections of Fig. 1, it is
then an all-pole system (assuming an infinite glottal impedance). The poles can be
determined by a root-searching method.
Zeros arise when shunting branches, for instance, the wall impedance arms, are
introduced. A direct determination of poles would then be contaminated by the
presence of adjacent zeros, and vice versa. A prerequisite to calculation is to properly
decompose the transfer function into its zero part (numerator) and pole part
(denominator).
In Fig. 2, a shunt of lumped impedance Z, is inserted between two sections. The
transfer function between the flows before and after the shunt is:
From conventional circuit theory, it is known that a zero is brought into the transfer function at the frequency where a shunt has a zero impedance (refer to the ladder
filter system in Fig. 2) or where a serial element has an infinite impedance. Therefore, Z, is a factor of the zero function, see Eq. (5). If there are more such shunts, the
final zero function will be the product of these elements, provided that there is only
one output port (Lin, 1990). For the case of the simultaneous radiation, the volume
STL-QPSR 4/ 1992
"?I$'
i:
Z
?.--
rj
a
'
+
Ui- I
flows
at
the
radiation
ports
are linearly sum-
med up, spatial
disregarding
phase difference
.
.
.-.
.
--.
.
.
.
-.
-.......
...-.and radiation interference. The fiFig. 2. Network representation (a ladder filter) of two contiguous sections nal zero function
with an extra shunt in between.
is a sum of the zero part of the transfer function pertinent to each radiation port and zeros will be redistributed.
Z,=tanh(x)=sinh(x)/cosh(x) is a serial element in Fig. 1. Although Z, has poles at
frequencies where cosh(x)=O and so does the transfer function, these singularities are
exactly cancelled when the circuit in Fig. 1is considered as a whole.
Shunting elements encountered in the vocal-tract modelling may result from the
following extra shunts:
a) Sinuses, such as the sinus piriformis and nasal sinuses;
b) Wall impedance arms (at low frequencies);
c) Lateral coupling, as in the /1/ sound and nasal/oral coupling, as in nasal
murmurs and nasalized vowels.
A nasalized sound represents a most complex case. The oral cavity constitutes a
shunt to the pharyngeal + nasal passage, while the nasal cavity a shunt to the pharyngeal + oral passage.
It is clear that to decompose the transfer function the numerator and denominator
of Eq. (5) should be treated separately when a shunt is met during the recursive
computation. The final Uo/Ug contains two components. One has exclusively poles
while the other exclusively zeros, if there is any.
For unvoiced fricative sounds, the transfer function is defined as the ratio of the
lip flow to a constant pressure source. This ratio is determined by two iterations, one
from the lip end to the glottal end (Uo/Ug) and the other from the position where the
pressure source is located to the glottal end (Ug/Ec) (Badin & Fant, 1984; Lin, 1990).
If there is no other shunt, the pole part of the transfer function Uo/Ec equals Ug/Uo,
and the zero part Ug/Ec. Zeros of the transmission occur at frequencies where the
impedance of the back cavity system as seen from the source is infinite (a serial element). If additional shunts occur, care should be taken to distinguish the shunts
anterior and posterior to the pressure source. The shunts posterior to the pressure
source are cancelled from the numerator in the product of (Uo/Ug)(Ug/Ec).
Pi-i
..
.. .
.
3. FROM TRANSFER FUNCTION TO MODE PATTERN
In a lossless system, the distinction between the resonance frequency, the oscillatory
frequency, and the spectral peak frequency is cancelled. The difference between
these frequencies is, however, small and negligible for a high-valued Q-factor
and/or for a multi-mode system with well separated poles. Accordingly, we shall
simply refer to the frequency of a pole or a zero. In the following, we shall only
illustrate how to determine poles from a decomposed transfer function. Zeros can be
treated in a similar way, see Lin (1990, pp. 17-21). Our criterion of a successful de-
STL-QPSR 4 / 1992
tection of mode patterns is that the true vocal-tract transfer function can be regenerated by a parallel formant synthesizer (Lin, 1987; 1990; Lin & Fant, 1990).
Let us consider an all-pole system and let the transfer function be:
where the numerator H,(f) is in this case unity. The denominator Hp(f) contains an
infinite number of roots and is normally a complex number:
For a lossless vocal tract N,(f)=O and for a lossy system N,(f) is small compared
with Nb(f). Consequently, the roots of the complex function H (f) must be located in
the neighbourhood of the roots of Nb(f). The procedure to Jetermine poles of the
transmission is as follows.
At a given frequency, the pole function is first computed and the polarity of the
Nb(f) term is noted. The frequency is next increased by a certain amount (the socalled scanning step) and the corresponding pole function at the new frequency is
computed. If Nb(f) changes signs within this interval there is a root to be detected.
Refer to Fant (1960; 1985), Badin & Fant (1984), and Lin (1990) for more details.
The effect of a finite N, term is accounted for by means of a first-order linear interpolation:
A o , = -on N,' / Nbl,
where ' denotes taking derivatives with respect to frequency.
The bandwidth B, is given as a by-product of the interpolation:
The final pole frequency is:
where F,, is the frequency at which Nb=O. This method is henceforth referred to as
the Nb method.
Lin (1988; 1990) proposed an alternative approach for searching the roots of Hp(f).
Instead of using Nb(f), the following function is defined:
where A f is a small increment in frequency. Poles of the transmission occur at frequencies where D(f) upwards intercepts the frequency axis. The method has been
called the peak method for simplicity.
The peak method requires more computation time in relation to the Nb method. It
has, however, several advantages. First, there is no need to decide which part of
Hp(f) is the dominating term. For instance, when a lumped wall impedance arm
&=R,+ jwL, is inserted in the pharyngeal region, Hp(f) after the decomposition is
dominated by its imaginary part, instead of the real part. In other words, poles of the
transmission are located in the vicinity of the roots of the imaginary part. The Nb
STL-QPSR 4/ 1992
method requires to first find how many extra shunts are present in the simulation
and how their impedance is characterized, in order to determine which part is predominating.
The second advantage of the peak method lies in the fact that it also makes use of
information of the dominating term in the linear interpolation. The corresponding
frequency correction term is:
and the real part of the pole is now given by:
Note that Am, is by definition equal to 0 at the frequency where D(f)=O. If one has
at outset considered a lossy vocal tract, it is unnecessary to compute this correction
term. However, the above interpolation procedure, Eq. (12), is found useful in deriving mode patterns of a lossy system from those of the lossless variant. This is because
D(f) of Eq. (11) will take a different value after the losses are put back. Consequently
Am, is not equal to 0.
It can be seen that Eqs. (13) and (9) are identical when Nb=O.
4. A FAST ALGORITHM
There are two major iterative operations in the vocal-tract computation. At a given
frequency, as we have seen already, the computation for a vowel sound iterates from
the lip end to the glottal end to compute the transfer function. The frequency is next
increased by a certain step and the transfer function is computed at the new frequency. The procedure repeats until the whole frequency range of interest is
scanned. The components involved in the lossy simulation are mainly complex
numbers, except a few such as frequency, cross-sectional areas, and section lengths.
The computation is accordingly time-consuming. We seek algorithms that can reduce the computation amount. A possible solution is to replace all complex numbers
by their real part or pure imaginary part, by removing the dissipative part of the vocal-tract elements. Recall Eqs. (2a) and (2b). In this way a complex product, four
multiplications and two additions of real numbers, is approximated by a single multiplication of real numbers and thus the computation can be expedited.
Table I lists vowel formants calculated under different simulation conditions. The
area functions of these vowels have been taken from Fant (1960). (A) pertains to a
lossy vocal-tract simulation with a closed glottis. The radiation impedance is approximated by the SKF model (Stevens, Kasowski, & Fant, 1953). The wall impedance is based the one of Fant, Nord, & Branderud (1976). The surface losses are twice
as large as those predicted by the classical formulas (Fant, 1960; Flanagan, 1965) to
account for the deviation of the vocal-tract cross-sectional shape from a perfect circle.
(B) is the same as (A) but the vocal-tract losses are all removed, i.e., a lossless simulation. (C) is the same as (A) but with a small open glottis. The glottal impedance is
Rg=3.947p-cg/cm4/s and Lg=0.013g/cm4. The corresponding lung pressure is 8 cm
H20 and the glottal opening area is 0.03 cm2. (D) The same as in (C) but again the
loss components are all removed.
STL-QPSR 4/ 1992
It is clear from Table I that when the glottal impedance is infinite, the dissipative
part has only a little effect on the mode parameters. F5 of the vowel /e/ has the largest absolute difference, 146 Hz. See also Badin & Fant (1984) and Lin (1990, p. 35).
With a finite glottal coupling, the difference becomes larger. This conforms well with
the derivation of Flanagan (1965, p. 64). The extent to which formant frequencies differ depends on particular modes. Appreciable errors are found for lower formants,
except for F5 of /e/ (261 Hz). The absolute difference is otherwise less than 100 Hz.
Based on the result of Table I, a fast algorithm is suggested to infer, by means of
linear interpolation, the mode pattern of a lossy vocal tract from a lossless simulation. The fast algorithm carries out the vocal-tract computation in two stages (Lin,
1992). First, it determines pole frequencies for the lossless vocal tract. At each pole,
all losses are then restored and Eq. (12) is applied to determine the pole frequency
corresponding to the lossy vocal tract. The pole frequency is updated, fnt=fn+~q,/2-n
and Eq. (12) is then again used to compute Am,, at the new frequency. The recursive
calculation ends when 1 A y , 1 <I. The bandwidth of the pole is calculated according
to Eq. (13) at the final pole frequency.
The formant patterns obtained this way are presented in Table 11, for the same
vowels as in Table I. Rows (A) and (C) in Table I have been duplicated in Table I1 for
the sake of an easy comparison.
In Table I1 an excellent agreement can be seen between (A) and (B) and between
(C) and (D). But the computation time has been greatly reduced. A comparison of
normalized computation time for different pole-searching methods is given in Table
111. The Nb method requires twice as much of computation time as the fast method
does. The peak method is 650h times slower than the Nb method. The difference depends of course on the section number of the area function and also on the scanning
step. The more sections and/or the smaller step, the greater the difference is.
An example of nasalized vowels is presented in Fig. 3. The formants and antiformants from the lossy and lossless simulations are tabulated in Table IV. Fig. 3 is
one of the typical displays of TRACTTALK. In the top left panel the area functions of
the vocal-tract and nasal tract are plotted, with the left end denoting the glottal
termination. (The area functions have not been optimally prepared in this example.
They serve to illustrate the proposed algorithms.) The calculated poles and zeros are
tabulated in the penal beneath the area function. Frequency and bandwidth are both
in Hertz.
The simulation conditions are recorded in the bottom right corner. Two sinuses
have been considered. They are modelled as Helmholtz resonators with resonance
frequencies tuned to 499 Hz and 1400 Hz, respectively. The sinus with the lower
resonance frequency is inserted at the position 6 cm from the nostrils, while the other
is inserted 8 cm from the nostrils. They are origin of the zeros of the coupled system
at 445 Hz and 1426 Hz. The zeros have been re-distributed when the volume flows at
the lips and at the nostrils are superposed. Recall the discussion in Section 2. The
transfer function is shown in the top right panel. In Fig. 3, the true transfer functions,
H(f), from a lossy (Curve A) and a lossless simulation (Curve B) are superimposed.
STL-QPSR 4/ 1992
Table I. Formant frequency and bandwidth (in
Hz) of vowels /a/, /i/, /u/, and /e/. Four different
simulation conditions have been considered. (A)
and (C)pertain to a lossy simulation of the vocal
tract and ( B ) and (D) to a lossless simulation.
The bandwidths are all zero for the simulation of
(B)and (D) and hence are not tabulated here. In
(A)and (B), the glottis is closed while in (C)and
(D), a glottal opening of 0.03 cm2 is assumed.
See text.
Approach
The Nb method
The peak method
The resent fast method
time used
Table I l l . Comparison of computation time. For
each method, the used time is estimated on a basis of a total of 207 runs and is normalized.
Table 11. Formant frequency and bandwidth of
vowels /a/, /ill /u/, and /e/, in Hz. They have been
calculated either by a direct lossy simulation of
the vocal tract ( A and C) or by the present fast
algorithm (B and D). A closed glottis is assumed
in (A) and ( B ) while in (C) and (D) the glottal
opening area is 0.03 cm2.
STL-QPSR 4/ 1992
Curve C is the resynthesized H(f), based on the mode data calculated by the fast
algorithm. A good fit between Curve A and Curve C is observed. Note that the
amplilde would go to infinity in a lossless simulation if it is evaluated exactly at a
pole frequency. Otherwise only a sharp peak is shown. Occasionally, such sharp
peaks are not visible because of a too large scanning step. This is the case for the pole
at 1417 Hz in Fig. 3.
From Table IV it is evident that the mode patterns obtained by the fast algorithm
are the same as those calculated from a direct lossy simulation, within the accuracy
of a couple of Hertz.
(A)
Note that in a lossless simulation, Hp(f) in Eq.
(B)
(A) (B)
(7) reduces to a real number. One may utilize the
Fn
Fn
Bn
Bn
Nb method without the need to decide which part
34
377
378
34
is dominating. It should also be pointed out in this
769
769
45
45
conjunction that a lossless vocal-tract computation 1047
47
47
1047
requires not only a less time but also can find those 1162
54
53
1163
poles which would have been missed in a direct 1414
1417
123
118
lossy simulation due to their large damping.
2162
61
2162
61
3020
3021
90
90
5. SPECIAL CARE FOR "HIDDEN"POLES
3573
202
202
3575
110
110
4049
A heavily damped pole does, however, present 4048
4343
109
108
4344
problems to the linear interpolation procedure
when all losses have been put back. This is exemBZ,
FZ,
FZ,
BZ,
plified in Fig. 4. Curve A is the calculated transfer
42
42
445
445
function of the vowel /a/. The simulation condi- 1116
53
1117
53
tion is the same as (A) in Table I, but now an open 1426
101
1426
102
glottis of 0.08 cm2 is assumed. (The subglottal 3287
137
3288
140
coupling is not considered in the present work.) As
illustrated in Tables I and V, the glottal coupling Table IV. Formant and antiformant
has resulted in an upward shift in frequencies of patterns o f a nasalized sound, in Hz.
all formants and resulted in larger bandwidths. (A): obtained from a direct 1 0 s ~ ~
The second formant in Fig. 4 is most apparently simulation; 03): obtained by the fast
algorithm. FZ and BZ denote
damped, a strong indication that the formant is frequency and bandwidth of zeros,
mainly affiliated with the back cavities. Curve B is respectively.
"
the lossless version of Curve A.
The computation indicates that the lossless system has a pole around 1370 Hz. If
we put back all losses and begin to calculate, by linearly interpolating from this frequency, the F2 of the corresponding lossy system, Eq. (12) does not converge. This is
because F2 of Curve B in Fig. 4 does not possess a well defined peak, which is related
to the close proximity of F1 and F2 and that F2 is heavily damped.
However, the true peak shape med by recovered by eliminating the F1 response
from the true transfer function, as illustrated in Fig. 5. A peak at 1180 Hz is now
clearly seen. Therefore, special care is needed to recover the peak shape for a heavily
damped pole before the interpolation. In the example of Fig. 4, this means that the F1
response, whose frequency and bandwidth are already known, should first be removed. A more complicated case is when the response of a higher formant has to be
left out, because its frequency and bandwidth are not known yet. We shall return to
this problem later.
STL-QPSR 4/ 1992
The calculated formant patterns
for Curves A and B in Fig. 4 are
given in Table V. To this table is
added the re-calculated B1 when
the F2 component has been eliminated, after the determination of
F2. Curve C in Fig. 4 is the reTable V. Formant frequency and bandwidth of the vowel synthesized transfer function.
/a/, in Hz. The glottal area is 0.08 cm2 and the lung pressure 8 cm H20. The glottal resistance R = 1.32 p . c From this figure it is found that an
(g/cm4/s) and the glottal inductance L = 6.005 (g/cm4). increased B1 tends to improve the
The other conditions are the same as
in Table I. (a): fit of the F1 peak. Although a 10
lossless simulation without the interpolation; (b):with the Hz difference in B1 of 157 Hz is
interpolation; (c): with elimination of F1 component and not of significance, it is expected
then the interpolation; (d): Re-calculation of BI after the
that such a re-calculation would
determination of F2. See text.
be necessary when two heavily
damped formants are closely located and their amplitude is -on the same order df
magnitude. In Atal, & al. (1978), responses of lower (and known) formants were all
removed. They did not re-calculate lower formants afterwards.
h)
6. FURTHER CON-
SIDERATION
A lossless simulation
of the vocal tract has
been shown capable
of saving the computation time and detecting all formants within the frequency
range of interest. We
shall, however, discuss some exceptions
in this section and
propose methods to
overcome the difficulty.
First consider the
choice of the frequenFig. 4. Calculated and resynthesized transferfunctions of the vowel /a/.
cy scanning step. The
larger the step the
less computation is needed to complete the scanning over a given frequency interval.
However, if this step is chosen too large, there is a risk that more than one pole (or
zeros if the numerator is concerned) fall in the same increment interval. This is more
likely the case for nasalized sounds that have more densely spaced formants. In
TRACTTALK, the choice of the scanning step is made adaptive to partially alleviate
this problem. A greater step is used for vowel-like sounds while for nasalized
sounds, a smaller step is used.
STL-QPSR 4/ 1992
been missed. The method outlined above has been implemented in TRACTTALK and
has been found to work satisfactorily.
At present, the hyperbolic functions and for a lossless system the circular functions in Eqs. (3) and (4) are directly evaluated. The computation can be made still
faster if one resorts to a table look up. We have two alternatives. Either a table for
hyperbolic functions and a table for circular ones, or, by rewriting Eqs. (3) and (4), a
table for exponential functions and a table for circular functions.
7. BRIEF DESCRIPTION OF TRACTTALK
TRACTTALK is a computer model of the vocal tract, implemented both in Fortran
and in C-code. It is a comprehensive research tool for studies of the production
mechanism and an integral part of a synthesizer.
The vocal tract is simulated in the frequency domain. All important components
of the vocal system are represented in the model, such as the subglottal system, wall
impedance, and nasal sinuses. Different radiation impedance models, proposed by
Stevens, & a1 (1953), by Fant (1960), by Flanagan (1965), and by Wakita & Fant
(1978), and different wall impedance models, proposed by Fant (1972), by Ishizaka,
& al, (1975), and by Fant, & al. (1976) are implemented. When the cross-sectional
areas change abruptly, an inner length correction is introduced to cope with inner
radiation. If the area of a section is sufficiently small, a cascaded aerodynamic resistance at the upstream entry is accommodated.
TRACTTALK can be used to model various categories of speech sounds, ranging
from vowels, fricatives, voiced occlusion before release, liquids, nasal murmurs, to
nasalized sounds (Lin, 1990). There are basically two input formats. One is a direct
input of the vocal-tract area function (including the area function of the nasal tracts
for nasal sounds), and the other is an input of area-function parameters which in
terms of some parameterization models specify the vocal-tract configuration. Two
parameterization models have been implemented, a cosine-function based (Lin,
1990) and a polynomial function based (Fant, 1992). The latter is at present adopted
for vowels only, while the former has been used for modelling of nasal sounds and
apical articulation (Lin & Fant, 1992).
TRACTTALK can also be used to infer from the first three formant frequencies the
underlying area-function parameters and the vocal-tract configuration with a
parameterization model as constraint. This procedure is known as the inverse vocaltract transform. Nomograms of formant frequency and formant bandwidth can be
easily generated as a function of the location of the constriction &, the constriction
area A,, and the lip parameter lo/Ao. Formant-cavity affiliations may effectively be
examined. Finally, one can study the effects of the acoustic interaction between the
glottal source and the vocal-tract acoustic load. The reader is referred to Lin (1990)
for details. (In Lin (1990), several independent programs were presented instead of a
single program package.)
TRACTTALK offers the facilities of on-line graphic display and playback of the
synthesized speech. It also offers, after each run, an interactive mode in which a
number of simulation conditions including the area function itself can be altered and
the program will then recompute the new transfer function.
STL-QPSR 4/ 1992
8. CONCLUDING REMARKS
We have in the above sections described several fast and robust algorithms for vocaltract computations in the frequency domain. They are 1) algorithms for properly
separating the zero part of the transfer function from the pole part. The determination of a pole frequency will not be influenced by its neighbouring zeros, and vice
versa; 2) algorithms for efficiently and effectively calculating the transfer function
and the associated resonance modes. Interpolation schemes are proposed; 3) algorithms for recovering a peak shape of a heavily damped formant so that the proposed interpolation schemes can work satisfactorily; 4) algorithms for deciding
whether a pole/zero has been missed. In Section 2 we have also briefly addressed
the iterative algorithm for the vocal-tract computation in the frequency domain and
analysed the origin of the zero part of the transfer function. A brief presentation of a
computer model of the vocal tract, TRACTTALK, is given in Section 7. We have not
discussed the acoustic realization. Readers are referred to the works listed in the
reference.
Vocal-tract computation techniques have played an important role in acoustic theory of speech production. They have long been used to study the relationship between articulation and acoustics. They have also been used in the inverse vocal-tract
transform (see, for instance, Atal, & al., 1978; Lin & Fant, 1989). In both cases, however, studies have mainly been concentrated on vowels and the computation speed is
often not of concern.
Recently, there is a growing interest in the articulatory speech synthesis based on
the simulation in the frequency domain (Sondhi & Schroeter, 1987; Lin, 1990). To
meet this new application, more robust and faster algorithms of the vocal-tract computation are needed. The present fast algorithm described in Section 4 makes use of
the fact that the resonance modes of the lossy vocal tract can be linearly interpolated
from those of the lossless variant. The computation time is remarkably reduced
while the accuracy is preserved, see Table 11. The proposed algorithm can also be
adopted in the system of Sondhi & Schroeter (1987). They first sampled the transfer
function and then applied the inverse discrete Fourier transform to derive the impulse response of the vocal tract. If the transfer function calculation is based on the
lossless version of the vocal tract and if only the samples around spectrum peaks and
sharp valleys are replaced by those calculated from the lossy counterpart, the computation amount is reduced.
In our synthesis algorithm (Lin, 1990), the transfer function is expressed as a sum
of a few formant responses. It is therefore of vital importance that all formants in the
frequency range of interest are successfully detected. Various factors have been considered in this paper so that the risk of missing a pole is maximally minimized.
ACKNOWLEDGEMENTS
This work was performed as a part of an ESPRIT, Basic Research Prioject, "SPEECH
MAPS. "
The author would like to thank Professor Gumar Fant for valuable comments and
discussions.