Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

Similar documents
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Mel Spectrum Analysis of Speech Recognition using Single Microphone

High-speed Noise Cancellation with Microphone Array

Speech Enhancement using Wiener filtering

Calibration of Microphone Arrays for Improved Speech Recognition

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Speech Synthesis using Mel-Cepstral Coefficient Feature

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Prediction of airblast loads in complex environments using artificial neural networks

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

MAGNITUDE-COMPLEMENTARY FILTERS FOR DYNAMIC EQUALIZATION

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

Chapter 4 SPEECH ENHANCEMENT

A Novel Adaptive Method For The Blind Channel Estimation And Equalization Via Sub Space Method

Speech Enhancement Using a Mixture-Maximum Model

RECENTLY, there has been an increasing interest in noisy

Improved Detection by Peak Shape Recognition Using Artificial Neural Networks

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Kalman Filtering, Factor Graphs and Electrical Networks

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Transactions on Information and Communications Technologies vol 1, 1993 WIT Press, ISSN

Experiments with Noise Reduction Neural Networks for Robust Speech Recognition

Application of Feed-forward Artificial Neural Networks to the Identification of Defective Analog Integrated Circuits

CHAPTER 4 PV-UPQC BASED HARMONICS REDUCTION IN POWER DISTRIBUTION SYSTEMS

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

A Spectral Conversion Approach to Single- Channel Speech Enhancement

NOISE ESTIMATION IN A SINGLE CHANNEL

Voiced/nonvoiced detection based on robustness of voiced epochs

Pose Invariant Face Recognition

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

MLP for Adaptive Postprocessing Block-Coded Images

Chapter IV THEORY OF CELP CODING

Acoustic Emission Source Location Based on Signal Features. Blahacek, M., Chlada, M. and Prevorovsky, Z.

Generating an appropriate sound for a video using WaveNet.

Chapter 5. Signal Analysis. 5.1 Denoising fiber optic sensor signal

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

Performance Analysis of Equalizer Techniques for Modulated Signals

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

COMP 546, Winter 2017 lecture 20 - sound 2

I D I A P. On Factorizing Spectral Dynamics for Robust Speech Recognition R E S E A R C H R E P O R T. Iain McCowan a Hemant Misra a,b

Neural Network Synthesis Beamforming Model For Adaptive Antenna Arrays

MATLAB SIMULATOR FOR ADAPTIVE FILTERS

Figure 1. Artificial Neural Network structure. B. Spiking Neural Networks Spiking Neural networks (SNNs) fall into the third generation of neural netw

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Speech Synthesis; Pitch Detection and Vocoders

IBM SPSS Neural Networks

NEURO-ACTIVE NOISE CONTROL USING A DECOUPLED LINEAIUNONLINEAR SYSTEM APPROACH

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Sound Modeling from the Analysis of Real Sounds

Epoch Extraction From Emotional Speech

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Multiple-Layer Networks. and. Backpropagation Algorithms

A Novel Adaptive Algorithm for

CHAPTER 4 MONITORING OF POWER SYSTEM VOLTAGE STABILITY THROUGH ARTIFICIAL NEURAL NETWORK TECHNIQUE

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Application of Artificial Neural Networks System for Synthesis of Phased Cylindrical Arc Antenna Arrays

FOURIER analysis is a well-known method for nonparametric

VOL. 3, NO.11 Nov, 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast

Using RASTA in task independent TANDEM feature extraction

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Input Reconstruction Reliability Estimation

Time-Frequency Distributions for Automatic Speech Recognition

Drum Transcription Based on Independent Subspace Analysis

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

THE USE OF ARTIFICIAL NEURAL NETWORKS IN THE ESTIMATION OF THE PERCEPTION OF SOUND BY THE HUMAN AUDITORY SYSTEM

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

CLASSLESS ASSOCIATION USING NEURAL NETWORKS

On the design and efficient implementation of the Farrow structure. Citation Ieee Signal Processing Letters, 2003, v. 10 n. 7, p.

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Can binary masks improve intelligibility?

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Fundamental frequency estimation of speech signals using MUSIC algorithm

An Approach to Very Low Bit Rate Speech Coding

CHAPTER 8: EXTENDED TETRACHORD CLASSIFICATION

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Networks 1

EXPERIMENTAL INVESTIGATION INTO THE OPTIMAL USE OF DITHER

Wavelet-based Voice Morphing

Design of Parallel Algorithms. Communication Algorithms

Power Function-Based Power Distribution Normalization Algorithm for Robust Speech Recognition

Speech Processing. Simon King University of Edinburgh. additional lecture slides for

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Application of Affine Projection Algorithm in Adaptive Noise Cancellation

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

for Single-Tone Frequency Tracking H. C. So Department of Computer Engineering & Information Technology, City University of Hong Kong,

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

Chapter - 7. Adaptive Channel Equalization

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Transcription:

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks C. S. Blackburn and S. J. Young Cambridge University Engineering Department (CUED), England email: csb@eng.cam.ac.uk ABSTRACT We present a novel method for generating additional pseudo-articulator trajectories suitable for use within the framework of a stochastically trained speech production system recently developed at CUED. The system is initialised by inverting a codebook of (articulator, spectral vector) pairs, and the target positions for a set of pseudo-articulators and the mapping from these to speech spectral vectors are then jointly optimised using linearised Kalman filtering and an assembly of neural networks. A separate network is then used to hypothesise a new articulator trajectory as a function of the existing articulators and the output error of the system. The techniques used to initialise and train the system are described, and preliminary results for the generation of new pseudo-articulatory inputs are presented. 1. Introduction Articulatory speech synthesis from text requires the specification of a set of articulator trajectories corresponding to a time-aligned phoneme string, together with a mapping from these trajectories to output speech. This mapping is frequently an explicit model of the human vocal tract [6, 8, 1], which theoretically provides the ability to produce very high quality speech waveforms incorporating time-domain modelling of co-articulation. In practice however, the performance of such systems is limited by model inaccuracies and in this paper we propose an alternative system in which a stochastically-trained model learns the mapping from articulatory to acoustic space [1]. We therefore relax the constraint that the system exactly mimic human physiology and instead use a set of pseudoarticulators [7] which fulfil roles similar to those of human articulators but whose positions are iteratively reestimated from the training data. Initial articulator trajectory specification is achieved using an inverse model to map parametrised speech into articulator positions or vocal tract areas. We use a Kelly-Lochbaum synthesiser [5, 8] to generate a codebook of (articulator vector, spectral vector) pairs [9] which we invert using dynamic programming (DP) incorporating both acoustic and geometrical constraints on the articulator trajectories. Target positions for the pseudo-articulators for each phoneme are estimated from the initial trajectories obtained from the DP algorithm and are used to re-construct trajectories corresponding to the training speech, incorporating an explicit model of co-articulation. These target positions are then iteratively re-estimated using linearised Kalman filtering and an assembly of neural networks which map from articulator positions to output speech. Since the system is not constrained to the use of physiologically plausible articulators, it is possible to improve modelling accuracy by adding new articulators during the training process. We use a novel extension of the backpropagation algorithm to allow an artificial neural network to learn a new input signal, which when combined with the original pseudo-articulator inputs provides a significant reduction in training error. While several architectures have previously been proposed for the addition of hidden layer units to a network [4], the generation of a new input signal in this way appears to be novel. A brief overview of the basic speech production system will be given before providing the details of the generation of new articulators.. Speech production system Five pseudo-articulators as used in [7] were sampled at regular intervals and used to determine a set of vocal tract area functions suitable for use in a Kelly-Lochbaum synthesiser which incorporates a transmission loss model and separate oral and nasal tracts. A sampling frequency of 16kHz was used and in all 1488 speech waveforms were generated, each of which was parametrised as a 1-dimensional liftered cepstral vector to give a codebook of 1488 (articulator vector, spectral vector) pairs. A training speech database comprising 6 sentences of one adult male from the speaker-dependent training portion of the Defence Advanced Research Projects Agency Resource Management corpus was also coded into 1- dimensional cepstral vectors, and dynamic programming was used to find the best pseudo-articulator trajectory corresponding to each vector sequence. The cost function used incorporates both the acoustical mismatch between the parametrised training speech vector and the codebook acoustic vectors and the geometrical mismatch between successive articulatory vectors. To reduce the computational load, a sub-optimal search was used in which only the 5 codebook vectors with the best acoustic match were considered at each step.

The result of this process is a set of pseudo-articulator trajectories corresponding to the parametrised training speech vector sequences. Statistics describing the observed position of each of the pseudo-articulators during the production of each phoneme are determined by sampling the values of the pseudo-articulator trajectories at the midpoint of each occurrence of each phoneme to give initial estimates of target mean positions and covariance matrices. Although the word target is used here, we are in fact measuring the achieved position of each pseudo-articulator at the phonemic midpoints; the underlying target towards which an articulator was heading may never be reached in practice. The pseudo-articulator trajectory corresponding to any arbitrary time-aligned phoneme string can then be determined by applying an explicit co-articulation model to the phonemic target means and using piece-wise linear interpolation constrained to pass through the average of two adjacent target means at the phonemic boundary [1, ]..1. System training The system is trained using the following iterative re-estimation process: Repeat: Train a separate neural network to approximate the function from the pseudo-articulator trajectories of each phoneme to the output speech. Re-estimate the position of each pseudo-articulator at the phonemic midpoints using the linearised Jacobian matrices of the networks and linearised Kalman filtering. Compute the statistics of the new articulator positions for each phoneme and generate new articulator trajectories corresponding to the training speech from these new statistics. The performance and architecture of the networks used are not crucial to the training process since their purpose is only to approximate the function from articulatory to acoustic space so that the linearised Jacobian matrix can be used to re-estimate the phonemic targets; once the re-estimation is completed however, their performance is optimised as far as possible. We trained feed-forward multi-layer perceptrons (MLPs) with 5 inputs, 3 hidden units, 4 outputs and sigmoid non-linearities at the hidden units using resilient back-propagation (RPROP) for 1 batch update epochs, giving mean errors in estimated spectral coefficients of around 1%. The training set output vectors were 4-dimensional mel-scaled log spectral coefficients. The global error covariance matrix for each network mapping is estimated from its performance on an unseen test set, and the Jacobian matrix is found by extending the usual error back-propagation formulae to evaluate the derivative of each output with respect to each input: where are the outputs of nodes in the input, hidden and output layers respectively and are the input-hidden and hidden-output weights respectively. If the initial estimate of a phoneme s articulatory target mean vector is denoted, with associated covariance matrix and corresponding parametrised speech vector, and if the neural mapping is denoted with Jacobian matrix at the target estimate and output error covariance matrix, the target vector can be re-estimated using linearised Kalman filtering as: "! $# $#! &%(' This gives a re-estimated target vector for each occurrence of each phoneme, from which new target mean and covariance statistics are computed. Updated pseudo-articulator trajectories are then derived and the networks retrained. This process is iterated to obtain an optimum set of phoneme targets from which speech is synthesised. 3. Generation of new inputs In the speech production model described above, a partitioning of the input space into a discrete set of sub-spaces corresponding to 47 different phoneme classes is known a priori, allowing us to divide the problem of determining the mapping from articulator space to acoustic space into 47 sub-tasks, each of which is approximated by a separate neural network. We shall show that this knowledge of a partitioning of the input space can also be exploited to generate new input trajectories for the networks which lead to an overall increase in model accuracy. If a neural network is trained using mean squared error (MSE) as a cost function to approximate a mapping from smooth functions at its inputs to smooth functions at its outputs, we expect the error at each output to be roughlyzero

' mean over the entire training set. We trained a single network to approximate the mapping from pseudo-articulator trajectories to output speech vectors for all phonemes, and a typical plot of the output error signals during a single sentence is shown in Figure 1, where phonemic boundaries are marked as vertical lines. Error magnitude 4 k l ih r w ih n d ow z 1 3 4 5 6 7 8 9 1 Speech frame index (1msec units) Figure 1: Variation in error at each of 4 network outputs over the course of the sentence clear windows. The mean error for each output over the course of the sentence is approximately zero, however within each phoneme there are clearly systematic variations in the error signal. It should therefore be possible to derive a new input for the network which is correlated with these systematic variations, in which case we would expect the overall network training error to decrease if we re-train the network on a data set augmented by this input. By examining many error plots such as the above we find that different instances of the same phoneme have similar error signals for example the the two occurrences of /ih/ in Figure 1 which are in general not constant over the duration of a phoneme, but follow trajectories which are affected by the context in which the phoneme occurs. Therefore, while some reduction in the error magnitude could be achieved either by subtracting the mean error for each phoneme from the output signal according to the phonemic class of the current input, or by providing an additional input which identifies this phonemic class, a preferable solution would be to generate a new trajectory which incorporates this contextual variation. If a suitable set of means for such an input were determined for each phoneme, a context-sensitive trajectory could be defined using piece-wise linear interpolation as described above, allowing a new input trajectory to be generated for an arbitrary input phoneme string. We use a single neural network trained on all the speech data to learn this new input since to do so with 47 different networks would result in a highly discontinuous solution. 3.1. System architecture The architecture shown in Figure was used, in which a conventional feed-forward MLP represented by the solid nodes and connecting links is trained to approximate a mapping between the known inputs and the outputs. The parameters of this network (weights and biases) are then frozen, and the additional structure indicated by hollow node symbols and dashed lines is added. A number of new hidden nodes are provided, which are connected to both the original inputs and the single new input, as well as to the output nodes of the original network. Output nodes... Original hidden nodes......... New hidden nodes... Original network inputs New network input Figure : Network architecture for the generation of a new input. The parameters of the new structure are then initialised to small random values, with the connections from the original inputs to the new hidden nodes setting an initial operating point in the weight space of the new hidden layer. The error signal at the output of the original (fixed) network is then back-propagated through the new network structure An effect implicitly achieved when using 47 separate networks.

to the new input values, which are initialised to zero for all training frames. The partial derivative of the error with respect to the input node value is derived in a similar way to the expression for the derivative of each output with respect to each input, and is: where are the input- is the target output. are the outputs of nodes in the input, hidden and output layers respectively, hidden and hidden-output weights respectively, is the sum squared error at the outputs and The new inputs are then updated using: where is analogous to the learning rate used in standard error back-propagation. Once a new input value has been computed in this way for every training frame, the parameters of the new network structure are optimised to produce a signal at the output nodes which approximates the negative of the original error. After some number of iterations of this optimisation, the new (reduced) output error is once again back-propagated through to the new network inputs, which are updated once again. This process, similar to the Expectation-Maximisation (EM) algorithm [3], is continued until an optimum set of new input values has been determined. Due to the noisy nature of most output error signals, this new input will itself in general be noisy, so that some smoothing is required before extracting its systematic characteristics for use in generating the new input signal for new data. Unlike standard back-propagation, this technique is not sensitive to changes in the value of which simply affects the magnitude of the new input signal but is sensitive to the number of epochs of parameter optimisation performed per input update. After each update of the input values, the parameters of the new network structure are optimised to reduce the output error. If this optimisation is allowed to continue for a large number of iterations the parameters become highly tuned to the current input values, so that when the input values are next updated there is likely to be a large mis-match between these and the highly optimised parameters. The solutionto this problem for difficult learning tasks is to use only a small number of optimisationepochs per input update so that the new input values and the parameters of the new network structure jointly converge to a solution in a smooth sense, a process analogous to generalised EM. The new input learned will in general be a non-linear function of both the output signals of the original network and the original network inputs. 3.. Example for an artificial system To investigate the properties of the algorithm just described, an artificial data set was generated by taking two nonlinear combinations of three basic functions: '! for values of ranging from to in steps of size ; the two non-linear functions used were: ' '! '! "!#%$ to which we added zero-mean white noise of maximum absolute value.. An MLP with inputs, 4 hidden nodes and two outputs was trained using RPROP for 1 batch update epochs on a data set consisting of the two inputs ' and and the noisy outputs ' and. The input was chosen to be a sinusoid to produce systematic variations in the error signal, and was not supplied to the network. Two outputs functions were used since training with a single output results in the trivial solution of the error signal being reproduced as the new input. The technique described above was used to generate a new input for the network by using 3 new hidden nodes, a learning rate of 1. and training for 5 input re-estimation iterations, each of which incorporated 1 epochs of parameter optimisation for the new network. The noisy new input signal generated was then smoothed using a third order butterworth low-pass filter with cutoff frequency one tenth of the Nyquist frequency. A separate neural network was then re-trained using the two original inputs ' and and the new input, again with 4 hidden nodes and the two output targets ' and. Due to its extra input node, the latter network has more parameters than the original system: 3 inputs, 4 hidden nodes and outputs gives 6 parameters, whereas (,4,) gives only. Hence we trained an additional network for comparison on the two inputs ' and, this time with 5 hidden nodes, where a (,5,) structure gives 7 parameters which is one more than the network with 3 inputs. Table 1 shows the results for the two networks, where the network incorporating the new input has resulted in a reduction in MSE of approximately 56%. These results are shown graphically in Figure 3. The first sub-figure shows the target output functions before and after adding noise, while the second shows both noisy and smoothed generated network inputs together with the

Network structure Number of parameters MSE improvement (%) (,5,) 7.119 - (3,4,) 6.56 55.8 Table 1: Performance of networks on artificial data. original missing input (dashed), where a strong correlation can be seen between the two. The final sub-figure shows the target and actual network outputs. The dotted plot is the noisy target output, while the dashed plot corresponds to the original -input network, and the solid plot to the new 3-input network. Clearly the 3-input network has learned a greatly superior mapping despite having one less parameter. 3.4 3 Output function magnitude 1 noisy target smooth target f1 f Input magnitude 1 (i) Training vector index...4 original new 1 (ii) Training vector index Output function magnitude 1 target o/p new o/p old o/p 1 (iii) Training vector index Figure 3: (i) Two network target outputs, both before and after adding zero mean white noise (ii) The new (learned) network input signal both before (dotted) and after (solid) smoothing, and original missing input (dashed) (iii) Noisy network target outputs, and outputs of both the original network and that trained with the new input. 3.3. Application to speech production In applying the input generation technique to our speech production system, we need to show not only that it is possible to generate a new input which results in a reduction in the training error as in the previous example, but also that this new input can be characterised in a general way such that it can be generated for any arbitrary vector of ordinary network inputs, given the information as to the input phoneme class at any point in time. An MLP with 5 inputs, 3 hidden nodes and 4 outputs was trained, once again using 1 batch update epochs of the RPROP algorithm, to learn the mapping from co-articulated pseudo-articulator trajectories to speech spectral vectors for 4 sentences of training data comprising a total of 11 vectors. We then added 1 new hidden nodes and trained the new network structure to learn a new input for this system. During this stage of the training the constant was set to 1. and 1 re-estimation iterations were performed, with 1 epoch of RPROP optimisation of the new network parameters for each update of the input values. This minimal amount of parameter optimisation per iteration was necessary to ensure smooth convergence to an optimal set of input values, and the MSE decreased during training from 1.57 to.5497. The new input trajectory obtained was smoothed in individual sections corresponding to input training sentences using a third order butterworth low-pass filter with cutoff frequency one fifth of the Nyquist frequency, and then sampled at the phoneme midpoints to obtain statistics for the mean position of the new input for each phoneme. New input trajectories were constructed from these means using the same piece-wise continuous interpolation used for the original inputs, and the new trajectory so formed was added to the original data set. A network with 6 inputs, 31 hidden nodes and 4 outputs was then trained using 1 batch update epochs of the RPROP algorithm, to learn the mapping from the augmented input set to the original outputs. This (6,31,4) network has 985 parameters, so to ensure a fair comparison a separate network with 3 hidden nodes was trained on the original input set, giving a (5,3,4) structure comprising 984 parameters. Network structure Number of parameters MSE improvement (%) (5,3,4) 984 1.54 - (6,31,4) 985 1.75 16.3 Table : Performance of networks on speech data. The results are given in Table, where we see that with a comparable number of parameters the system which uses the new input has 16.3% less MSE than the system trained on the original inputs. We emphasise that the new input trajectory used was not that learned directly by the augmented network structure, but was generated from the statistics of this trajectory. Hence a new input trajectory such as this can be generated for an arbitrary input phoneme sequence. Figure 4 shows the MSE for both the (5,3,4) network trained on the original data and the

(6,31,4) network trained on the augmented data set. The plots exclude the first 1 epochs of training to provide reasonable scaling in the y-axis..5 MSE 1.5 (6,31,4) network (5,3,4) network 4. Conclusions 4 6 8 1 Training epochs Figure 4: Network error curves for original and augmented data sets. This paper has presented a novel technique for generating a new input for an artificial neural network which has been trained to learn the mapping from a set of smooth input functions to a corresponding set of smooth output functions, under the condition that a subdivision of the input space into distinct classes is known a priori at each time step. If the error of such a system shows systematic variations which are correlated with changes in the input class and dependent upon the input context, statistics describing the form of the new input can be computed which allow such an input to be generated given any set of original trajectories. The technique has been demonstrated on both an artificial system and in the context of a pseudo-articulatory speech production model recently developed at CUED, and in both cases was seen to provide a significant reduction in output error. Other fields where this technique may have applications include slowly parameter-varying control systems, in which an interpolation is performed between a number of linear models which approximate a non-linear mapping. If the output error of the system has systematic variations which are correlated with the particular linear model being used, a new signal could be derived as a function of the output error and the slowly-varying parameter so as to reduce the overall system error. This system is still under development, and many questions have yet to be resolved. The convergence and stability criteria of the re-estimation technique for generating new inputs need to be investigated, as does the sensitivity of the system to the initialisation conditions. The viability of the model as applied to our speech production system seems excellent however, and the initial results obtained are extremely encouraging. 5. References [1] C. S. Blackburn and S. J. Young. A novel self-organising speech production system using pseudo-articulators. Int. Congr. Phon. Sc., 1995. Accepted for publication. [] C. S. Blackburn and S. J. Young. Towards improved speech recognition using a speech production model. Europ. Conf. Sp. Comm. Tech., 1995. Accepted for publication. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc., 39(B1):1 38, 1977. [4] S. E. Fahlman and C. Lebiere. The Cascade-Correlation learning architecture. Technical Report CMU-CS- 9-1, Carnegie Mellon University, 199. [5] J. L. Kelly Jr. and C. Lochbaum. Speech synthesis. In Sp. Comm. Sem., Stockholm, 196. [6] P. Mermelstein. Articulatory model for the study of speech production. J. Acoust. Soc. Am., 53(4):17 18, 1973. [7] P. Meyer, R. Wilhelms, and H. W. Strube. A quasiarticulatory speech synthesizer for German language running in real time. J. Acoust. Soc. Am., 86():53 539, 1989. [8] P. Rubin, T. Baer, and P. Mermelstein. An articulatory synthesizer for perceptual research. J. Acoust. Soc. Am., 7():31 38, Aug. 1981. [9] J. Schroeter and M. M. Sondhi. Techniques for Estimating Vocal-Tract Shapes from the Speech Signal. IEEE Trans. Sp. Aud. Proc., (1):133 15, Jan. 1994. [1] M. M. Sondhi and J. Schroeter. A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Sp. Sig. Proc., ASSP-35(7):955 967, July 1987.