Phoneme Recognizer Based Verification of the Voice Segment Type Determination

Size: px

Start display at page:

Download "Phoneme Recognizer Based Verification of the Voice Segment Type Determination"

Angel Barton
5 years ago
Views:

1 honeme Recognizer Based Verification of the Voice Segment Type Determination OLDŘICH HORÁK Faculty of Economics and Administration, Institute of System Engineering and Informatics University of ardubice Studentská 84, ardubice CZECH REUBLIC Abstract: - This document describes the verification of the result of voice segment type determination. The verification is performed by the honeme Recognizer. The voice segment type determination is based on the presence of fundamental frequency in the voice segment signal, and the several methods are used for the fundamental frequency presence demonstration. The fundamental frequency of speaker s voice, which can be extracted from the voiced segment of the speech signal, is one of the basic characteristic features of the speech used in the speaker recognition process. Key-Words: - fundamental frequency, signal processing, speaker recognition, voice signal, phoneme recognizer 1 Introduction The information system s user identification is the very important task of the information systems common security. One of the unusual biometric methods is speaker recognition. This method uses an extraction of the speaker s voice tract anatomy parameters from the voice signal. The sound timbre of the speaker s voice is dependent on and given by these parameters. In the voice signal processing, the signal is divided in blocks short segments of the speech with time duration in tens of milliseconds. The voice characteristic is based on features extracted from the segments. Some features can be extracted from the voiced segments only, other from surd segments. Therefore, the voice segment type determination is important for these tasks. [1, 2] The fundamental frequency is one of the basic features of the voice. It is the common tone level of the speaker s voice. This characteristic feature can be extracted from the voiced segment only. The presence of the fundamental frequency in the voice segment can be also used to determinate the type of given segment. The detection of the fundamental frequency is sufficient to determination of the voiced segment, and the exact value of the frequency is not important in this case. [11] 2 rinciples The voice segment type depends on the voice signal part, which is cut from. Simplified, the segment from vowel is voiced, and the part of consonant signal is surd. It is only the approximation; the exact determination of the segment type has to be evaluated by a signal processing method. 2.1 Segment Type Determination methods There are more methods to distinguish the voice segment type: Fundamental Frequency resence It is the method used, verified, and published by more authors [1, 2, 4, 5, 7, 10, 14]. The dependency is simple. If the fundamental frequency is detected in the voice segment, the segment is voiced, else the segment is surd. The presence of the fundamental frequency can be verified by more methods of signal processing. It can be found as a peek in the real cepstrum of the signal [1, 2, 11]. The calculation of the cepstrum is slow. It can be substituted by faster methods, i.e. the autocorrelation [1, 2, 6, 11, 14-18] Comparison of the Energy Spread There is three or more frequency ranges defined, and the spread of energy in the all of these ranges leads to proper segment type determination. Each segment type has a typical spread of energy in the given frequency sub-ranges. The preprocessing of the typical spreads is needed before use of this method; it leads to more time consumption [1, 2, 7, 10]. ISBN:

2 2.1.3 ZCR to Short-Time Energy Relation Other method uses a relation of the mean value of the zero-crossing rate to short-time energy of the voice signal segment. The voiced segments have higher value of the short-time energy, and lower mean value of the zero-crossing rate. Both of the characteristics have relative values, and are defined without units. [1, 2, 3, 10] Statistical Methods Another methods use a statistical processing as well [8, 9, 14, 15]. The autocorrelation function applied on the signal segment provides an option to detect the presence of fundamental frequency and to estimate its value. [2, 3] 2.2 honeme Recognizer honeme recognizer based on long temporal context was developed at Brno University of Technology, Faculty of Information Technology [12, 13]. It was successfully applied to tasks including language identification, indexing and search of audio records, and keyword spotting. Outputs from this phoneme recognizer can be used as a baseline for subsequent processing. It is the open source tool. Source codes and binaries can be redistributed and/or modified under the terms of the GNU General ublic License as published by the Free Software Foundation; either version 2 of the License, or any later version. Model files can be used for research and educational purposes [12]. There are model files for the Czech, Hungarian, Russian, and English languages. The tool is able to recognize the phonemes from the audio signal (human voice, given language), and provides text output. The output consists of the information about the recognized phonemes, and its position in the processed sample of the voice signal. It is usable for the extraction of the particular phonemes from the given signal. 3 Experimental Verification The voice segment type determination experiment was processed by two methods, and both result sets was compared and evaluated. There are six files of cardinal numerals of the Czech language recorded and used. These words are sound different each other. It is usable for the sufficient variability of the signal see [11] for more details and additional information. 3.1 rocess of Verification The Fig.1 shows the process of verification. Each voice sample has the same duration, and the signal is divided to 70 segments. The voice segment type determination is processed by autocorrelation and by cepstral method. The same sample of signal is processed by honeme Recognizer [12] to recognize phonemes of the word and its position in the given sample. Each phoneme is mapped to segments to be the segment type determination verified. 3.2 Results of Verification Results of both determination method and the phoneme recognition are verified by mapping (see Fig.2 to 7 for all the used words). The first rows in these tables mean the number of the segment N. Next two rows show the voiced segments (marked as X ) determined by autocorrelation A and cepstral method C. Cepstral Method Segmentation Autocorrelation Signal honeme Recognizer Verification Fig.1 Schema of the Verification rocess ISBN:

3 A X X X X X X X X X X X X X X X X X X C X X X X X X X X X X X X X X X X X <sil> j e d n a <sil> Fig.2 Verification for word jedna A X X X X X X X X X X X X X X X X X X X X X X X X C X X X X X X X X X X X X X X X X X X X X X <sil> d v ě [je] <sil> Fig.3 Verification for word dvě A X X X X X X X X X X X X X X X X X X X C X X X X X X X X X X X X X X X X X <sil> t ř i <sil> Fig.4 Verification for word tři A X X X X X X X X X X X X X X X X X X X X X X X X X X X C X X X X X X X X X X X X X X X X X X X X X X X X X X X <sil> č t y ř i <sil> Fig.5 Verification for word čtyři A X X X X X X X X X X C X X X X X X X <sil> p ě [je] t <sil> Fig.6 Verification for word pět A X X X X X X X X X X X C X X X X X X X X X <sil> š e s t <sil> Fig.7 Verification for word šest In the last row, the position of the recognized phonemes is mapped. The <sil> symbol means the silence was recognized in the position. The Czech language character ě is phonetically identical with phoneme couple je in many words; it is corrected in the verification tables. The verification denotes the results of the segment type determination and phoneme recognition match at first sight. The voiced segments are determined significantly in the position of vowels. It conforms to the expectation as is written above. In the auxiliary view, some segments (i.e. Fig.6, segment no. 62) were determined as voiced ones, but the phoneme recognition denotes, there was silence. It means the recorded voice signal had a higher noise level in that time. However, the rate of false determined segments is low by this type of error, because there are four cases only in the 420 processed segments. ISBN:

4 For the recap of the previous experiments [11] comparison of the voice segment type determination methods shows the difference in units of segments. The counter value of error rate is about less than 10% (see Tab.1 cited from [11]). It means, the difference between both methods can be omitted for the verification using phoneme recognition. Next work is to analyze the verification results. Compared word Total Autocorrelation Cepstral method Difference segments Voiced Surd Voiced Surd Absolute Relative 1 jedna % 2 dvě % 3 tři % 4 čtyři % 5 pět % 6 šest % Tab.1 Comparison Results [11] The verification results analyses are provided in Tab.2 and Tab.3. There are noted counts of false determinations of the voiced type segment in the signal part recognized as the silence and as the surd phoneme. The absolute and corresponding relative values are denoted in the tables as well as the total values. The different results are highlighted in both tables to be well telling. Verified word Total False voiced type determ. out of the voiced signal ranges segments Silence erc. Surd erc. Total erc. 1 jedna % 3 4.3% 3 4.3% 2 dvě % 3 4.3% 5 7.1% 3 tři % 1 1.4% 2 2.9% 4 čtyři % 1 1.4% 1 1.4% 5 pět % 2 2.9% 3 4.3% 6 šest % 0 0.0% 0 0.0% Tab.2 Verification Results (Autocorrelation) Verified word Total False voiced type determ. out of the voiced signal ranges segments Silence erc. Surd erc. Total erc. 1 jedna % 3 4.3% 3 4.3% 2 dvě % 3 4.3% 5 7.1% 3 tři % 1 1.4% 2 2.9% 4 čtyři % 1 1.4% 1 1.4% 5 pět % 1 1.4% 2 2.9% 6 šest % 0 0.0% 0 0.0% Tab.3 Verification Results (Cepstral method) ISBN:

5 4 Conclusion and Future Work The results of the verification show that the counter value of false determination rate given by the voiced segment determination is under the error rate of both used determination methods [11]. It means the determination of the voice segment type using both methods corresponds to the phoneme recognition with the error rate fewer than 10%, and methods can be used very well. This result will be used for the support of theoretical base for the information system s user identification by speaker s voice. The next steps will be in the area of sufficient set of voice features extraction from the speaker s speech. There are several methods that provide feature extraction from both types of the voice segment (voiced and surd). Therefore, the determination is the important part of preprocess for the extraction. This type of user identification is simple for users. It is not necessary to remember passwords. The users can use standard multimedia inputs for the identification, none expensive equipment needed. The communication using multimedia is very popular nowadays. Multimedia applications as interactive systems of digital media are favored by many users [19]. The multimedia information can be selected by the user himself according to his individual needs, and the system can recognize the user by his/her voice. It can be the goal of this method. 5 Acknowledgement This work was supported by the project No. CZ.1.07/2.2.00/ Innovation and support of doctoral study program (INDO), financed from EU and Czech Republic funds. References: [1] H. Atassi, Metody detekce základního tónu řeči. Elektrorevue, Vol.4, 2008, ISSN [2] J. sutka, et al., Mluvíme s počítačem česky. raha, Academia, 2006, ISBN [3] Y. Tadokoro, et al., itch Estimation for Musical Sound Including ercussion Sound Using Comb Filters and Autocorrelation Function, roceedings of the 8th WSEAS International Conference on Acoustics & Music: Theory & Applications, Vancouver, Canada, June 19-21, 2007, pp [4] C. Moisa, H. Silaghi, A. Silaghi, Speech and Speaker Recognition for the Command of an Industrial Robot, roceedings of the 12th WSEAS international conference on Mathematical methods and computational techniques in electrical engineering, Stevens oint, Wisconsin, USA, 2010, pp , ISBN: [5] M. Vondra, Kepstrální analýza řečového signálu. Elektrorevue. Vol.48, 2001, ISSN [6] M. E. Torres, et al., A Multiresolution Information Measure approach to Speech Recognition, roceedings of the 6 th WSEAS International Conference on Signal, Speech and Image rocessing, Lisbon, ortugal, September 22-24, 2006, pp [7] E. Marchetto, F. Avanzini, and F. Flego, An Automatic Speaker Recognition System for Intelligence Applications, roceedings of the 17 th European Signal rocessing Conference (EUSICO 2009), Glasgow, Scotland, August 24-28, 2009, pp [8] J. Sohn, N. S. Kim, and W. Sung, A Statistical Model-Based Voice Activity Detection, IEEE Signal rocessing Letters, vol. 6, no. 1, January 1999, pp [9] A. Stolcke, S. Kajarekar, and L. Ferrer, Nonparametric Feature Normalization for SVM-based Speaker Verification, IEEE International Conference on Acoustics Speech and Signal rocessing (ICASS 2008), vol. 104, no. 23, pp [10] J.. Campbell, Jr. Speaker recognition: a tutorial. IEEE 85, 1997, pp [11] O. Horák, The Voice Segment Type Determination using the Autocorrelation Compared to Cepstral Method, WSEAS Transactions on Signal rocessing, vol. 8, issue 1, January 2012, pp [12]. Schwarz,. Matejka, L. Burget, O. Glembek, Description honeme recognizer based on long temporal context, Brno 2012, online: [13]. Schwarz, honeme Recognition based on Long Temporal Context, hd Thesis, Brno University of Technology, [14] A. Kabir, Sh. Md. M. Ahsan, Vector Quantization In Text Dependent Automatic Speaker Recognition Using Mel-frequency Cepstrum Coefficient, 6th WSEAS International Conference on Circuits, Systems, ISBN:

6 Electronics, Control & Signal rocessing, Cairo, Egypt, Dec 29-31, 2007, pp [15] A. etry, S. da S. Soares, G. F. Marchioro, A. S. M. de Franceschi, A Distributed Speaker Authentication System, Applied Computing Conference (ACC '08), Istanbul, Turkey, May 27-30, 2008, pp [16] W. Al-Sawalmeh, Kh. Daqrouq, and A.-R. Al- Qawasmi, Multistage Speaker Feature Tracking Identification System Based on Continuous and Discrete Wavelet Transform, roceedings of the 9th WSEAS International Conference on Multimedia Systems & Signal rocessing, Hangzhou,China, May 20-22, 2009, pp [17] W. H. Abdulla, Auditory based feature vectors for speech recognition systems, Advances in Communications And Software Technologies, WSEAS ress, 2002, pp [18] J. S. Jung, J. K. Kim, and M. J. Bae, Speaker Recognition System Using the rosodic Information, WSEAS Transactions on Systems, Vol. 3, Issue 3, May 2004, pp [19] Milková, E., What can multimedia add to the optimization of teaching and learning at universities? 7th WSEAS International Conferences on Advances on Applied Computer & Applied Computational Science (ACACOS 08), Hangzhou, China, April 6-8, 2008, pp ISBN:

Mel Spectrum Analysis of Speech Recognition using Single Microphone

International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree