Vocal-Tract Inversion
VT shape estimation from vocalized acoustics

Thanks to the measurement of vocal-tract shapes by MRI, we developed an inversion model (implemented as a real-time demo) to calculate and display the normally invisible shape of the human vocal tract during vocalization, from standard acoustic parameters.


BACKGROUND For decades, speech scientists have been fascinated with the so-called "inverse problem" of mapping backwards from the acoustic speech signal, to estimate the positions of vocal-tract (VT) articulators or even the entire shape of the VT airway from glottis to lips (e.g., Schroeder, 1967; Atal et al., 1978; Wakita, 1979). Potential applications include pronunciation training for foreign-language learners, rehabilitation training for the speech impaired, physiologically-informed representations for automatic speech recognition and synthesis, or simply teaching and demonstration of acoustic-phonetics principles.

VT-SHAPE MODELLING As the VT shape is normally invisible to the unaided eye, special instruments are required to obtain an initial set of reference shapes with which to train a system to learn the relations between acoustics and physiology. Our approach was to take advantage of magnetic resonance imaging (MRI) which is a relatively safe and non-invasive method able to capture the VT shape not only during sustained phonations but also, thanks to a then newly-developed synchronized sampling method (Masaki et al., 1999), during repeated articulatory movements representing phonetic sequences. Our data then consisted of a male Japanese speaker's MRI-measured VT shapes for each of the 5 Japanese vowels, as well as a dynamic sequence of the 5 vowels.

Consistent with several earlier studies on American English (Shirai & Honda, 1976; Harshman et al., 1977; Story & Titze, 1998), we found in our Japanese data that at least 94% of the total variance in the VT shapes was accounted for by just 2 principal components: the first contrasting a reciprocal constriction/dilation in pharynx vs oral cavity (see upper row in the figure below), and the second representing simultaneous constrictions in the labial and velar regions (see lower row in the figure below). Our results supported the cross-linguistic validity of these two underlying components of vowel production, and were also compatible with Perrier et al.'s (2000) hypothesis of biomechanical dependence.

In the figure below, note that in addition to 44 VT cross-dimensions, we included VT length as a parameter in the analysis; thus, the panels on the right show how the first two principal components also captured the natural phonetic variations in VT length.

VT Area Function PCs

INVERSION FROM CEPSTRUM Cepstrum parameters capture the entire spectral shape and are most successful in automatic speech recognition, while the more difficult to measure formants are more closely related to physiology because they are the resonances of the vocal-tract. We trained multiple linear regression models to estimate the first two PCs of the MRI-measured VT shapes, from various combinations of either cepstrum or formant parameters.

The best formant-based model used all 4 formant frequencies and resulted in a mean adjusted correlation of 0.93 and mean absolute errors of 0.187 cm2 in VT area and 0.131 cm in VT length. The best cepstrum-based model used 24 cepstral coefficients defined in the frequency band 0-4 kHz and resulted in a mean adjusted correlation of 0.92 and mean absolute errors of 0.102 cm2 in VT area and 0.082 cm in VT length.

Thus we showed that vowel production features (the PCs of VT shapes) can be mapped with high accuracy from acoustic parameters (either the formants, or the more easily measured cepstra).

The figure below shows a snapshot of our real-time demo. It illustrates the entire chain of transformations from the acoustic speech signal, to the power spectrum, to the cepstrum coefficients, to the first two PC weights (here showing the 5 Japanese vowels of 2 speakers), to the VT area function, which can then be visualized more intuitively in terms of a simplified midsagittal cross-section of the vocal tract.

Inversion Demo snapshot

LIMITATIONS Our initial study was limited to the vowels produced by just one (Japanese male) speaker. As shown in the figure above, we subsequently developed a real-time demonstration using MRI data of two male speakers and just 9 cepstral coefficients. While estimation accuracy would undoubtedly suffer for voices unfamiliar to the system, we have been greatly encouraged by numerous informal tests in which physiologically plausible and phonetically appropriate VT shapes were estimated from many different voices, even a female singing voice!

Our current system is in principle limited to vocalic sounds; application to unrestricted continuous speech would require investigations into the relations between acoustic parameters and VT shapes for other phonetic categories such as consonants, both voiced and voiceless. In such extended phonetic context, the cepstrum would offer a significant advantage as it models the spectral shape even for speech sounds that do not have a clear formant structure.

For wider applicability, our approach would eventually need to be freed from having to measure every individual's VT shape using MRI.

REFERENCES (chronological) M. R. Schroeder (1967)
Determination of the geometry of the human vocal tract by acoustic measurements
J. Acoust. Soc. Am., 41, 1002-1010.

K. Shirai & M. Honda (1976)
An articulatory model and the estimation of articulatory parameters by nonlinear regression method
Electronics and Communications in Japan, 59-A(8), 35-43.

R. Harshman, P. Ladefoged & L. Goldstein (1977)
Factor analysis of tongue shapes
J. Acoust. Soc. Am., 62(3), 693-707.

B. S. Atal, J. J. Chang, M. V. Mathews & J. W. Tukey (1978)
Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique
J. Acoust. Soc. Am., 63, 1535-1555.

H. Wakita (1979)
Estimation of vocal-tract shapes from acoustical analysis of the speech wave: the state of the art
IEEE Trans. on Acoust. Speech & Sig. Process., 27, 281-285.

B. H. Story & I. R. Titze (1998)
Parameterization of vocal tract area functions by empirical orthogonal modes
J. Phonetics, 26(3), 223-260.

S. Masaki, M. K. Tiede, K. Honda, Y. Shimada, I. Fujimoto, Y. Nakamura & N. Ninomiya (1999)
MRI-based speech production study using a synchronized sampling method
J. Acoust. Soc. Japan (E), 20(5), 375-379.

P. Perrier, J. Perkell, Y. Payan, M. Zandipour, F. Guenther & A. Khalighi (2000)
Degrees of freedom of tongue movements in speech may be constrained by biomechanics
in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Beijing, China, Vol.II, 162-165.

P. Mokhtari, T. Kitamura, H. Takemoto & K. Honda (2005)
Vocal tract area function inversion by linear regression of cepstrum
in Proc. Interspeech, Lisbon, Portugal, 3201-3204.

P. Mokhtari, T. Kitamura, H. Takemoto & K. Honda (2007)
Principal components of vocal-tract area functions and inversion of vowels by linear regression of cepstrum coefficients
J. Phonetics, 35(1), 20-39.



Copyright ©Parham Mokhtari 2000-2019 Updated: 04 October 2016