Optical Channel Access Security based on Automatic Speaker Recognition

Size: px

Start display at page:

Download "Optical Channel Access Security based on Automatic Speaker Recognition"

Willa Davidson
5 years ago
Views:

1 Optical Channel Access Security based on Automatic Speaker Recognition L. Zão 1, A. Alcaim 2 and R. Coelho 1 ( 1 ) Laboratory of Research on Communications and Optical Systems Electrical Engineering Department Instituto Militar de Engenharia (IME), {lzao,coelho}@ime.eb.br. ( 2 ) Center for Telecommunication Studies (CETUC) Pontífica Universidade Católica do Rio de Janeiro (PUC-Rio), alcaim@cetuc.puc-rio.br. Abstract A robust optical channel access system based on a speaker identification authentication is proposed and demonstrated in this paper. The solution also enables optical access with remote speaker identification. A set of speech features and classifiers were defined to achieve the best recognition rates for local and remote optical access. The experiments showed the feasibility and importance of using the biometric technology for the optical communications security. Index Terms optical communications, optical channel access, optical security, speaker identification. I. INTRODUCTION In the last decades, communications security has become a very important issue for private and public organizations [1]. Moreover, home communications systems provided by the broadband access technologies, have also security requirements. Fiber-to-the-home (FTTH) technology has been largely deployed in the recent years. The number of FTTH users must increase from 2.8 million in May 2005, to 30 million by the end of 2010 [2] [3] [4]. The FTTH market growth is a reality worldwide. Currently, in 14 countries more than 1% of households have a FTTH broadband access [5]. Thus, the provision of optical access systems must be attained to guarantee the communications security. One major challenge of communications access systems is to avoid non-authorized access or intruders. Access solutions based on passwords identification, showed to be inappropriate for communications with strict security requirements. On the other hand, biometrics systems has been considered a promising solution for communications access security applications [6] [7] [8]. In these solutions the identity recognition is based on human features such as fingerprints, face, signature, retinal and voice.

2 Therefore, this paper proposes a robust channel access scheme for optical communications based on a textindependent automatic speaker identification for applications in a closed-user group with strict access issues, e.g., forensic and private groups. The optical channel access request is only authorized after the speaker identification. An intruder speaker was included in the tests in order to avoid potential attackers and possible threats. If the intruder speaker is identified, the optical channel access request is not authorized. Another important contribution of this work includes the feasibility of using the automatic speaker recognition technology for remote identification. The voice biometric feature was selected in this work since its extraction is considered simple, non-invasive and it can be obtained by using the available technology. In a speaker identification process, a speech utterance has to be identified as to which of the registered speakers it belongs. Generally, the Mel frequency cepstral coefficient (MFCC) physiological features are not robust to the channels acoustic distortion and their extraction from the speech signal requires a high computational load. This is due to the fact that these features model the spectral characteristics of the human vocal mechanism. The statistical feature (ph) consists of a vector of Hurst parameters proposed for speaker identification systems [9]. Unlike the physiological features, the ph feature tends to be robust to channel distortions, since it models the stochastic behavior of the speech signal. The ph feature is not related to the transfer functions of the vocal tract and needs less complex extraction/estimation methods. Additionally, it can be obtained in real-time, i.e., during speakers activity. For the optical access authentication experiments it was considered the MFCC, the Hurst parameters vector (ph) and also a fusion use of these speech features (ph+mfcc). For the local and remote speaker identification tasks, it was investigated the Gaussian Mixture Model (GMM) [10] and the Multidimensional fractional Brownian Motion (M-dim-fBm) [9] classifiers. These classifiers present the best recognition rates considering the ph and MFCC speech features of the recent speaker recognition literature. The M-dim-fBm classifier is based on the fractional Brownian motion (fbm) stochastic process. However, the speech signal was not considered as a fractal or self-similar process. These classifiers can be applied to any feature matrix. The M-dim-fBm exploits the relationship and the evolution of the matrix elements to derive a speaker model. The experiments enabled to attain the speaker identification system, i.e. speech features and classifiers, that achieves the best local identification rate (LIR, training and test phases placed at the same optical access point or device) and remote identification rate (RIR, training and test phases at different access points). The biometric identification results are presented for 10, 5 and 1 seconds speech segments or test duration (TD) and 95% accuracy. For the experiments it was considered two access point (AP) devices implemented at field-programmable gate array (FPGA) boards placed at two different PC host stations. For the local tests, the identification is performed at the AP where the access to the communication system was requested (AP of origin). The speech features are bit-serially encoded at the AP in which the access is requested, before transmission over a 1.5km single-mode fiberoptics between the AP/PC stations, i.e., access with remote identification. After the photodetection, the encoded bit sequences are demodulated and reconverted to the electronic domain. The speech feature matrix is then recomposed

3 AP 1 / FPGA 1 Noise Generator AP 2 / FPGA 2 Speaker ID Local Bit Encoding Remote Feature Extraction FSK SWITCH 600 MHz 300 MHz VCO PLL APC Driver Laser nm Isolator Coupler Photodetector 1.5 Km CLK 150 MHz Q Counter T Decision Circuit Bits Sequence Feature Matrix Composition BER Speaker ID Fig. 1. Local and remote access authentication for a 1.5Km fiber optical transmission between 2 APs. to proceed the speaker identification at the remote AP. A Gaussian-noise generator was included in the experiments in order to evaluate its impact on the remote speaker identification system. The RIR results are also presented for different BER measures and noise power level values. In a preliminary study [11] the proposed system was evaluated by simulations using the RSOFT/OptSim simulator version 4.6 and considering speech features transmission over a 25Km fiber-optics channel. In this paper, an experimental setup was developed to demonstrate the optical access proposal and to examine its performance in pratical conditions. This paper is organized as follows. Section II describes the optical authentication access system proposed in this paper and the experimental setup. Section III presents the main characteristics of the speaker identification schemes considered in this work. The experimental results obtained for the evaluated authentication systems are reported and discussed in Section IV. Finally, Section V presents the main conclusions of this work. II. OPTICAL ACCESS AUTHENTICATION WITH SPEAKER IDENTIFICATION AND EXPERIMENT SETUP The proposed optical channel access system is based on a speaker modeling function that is located at the AP devices of the optical communication system. The solution involves three steps: the speech acquisition, feature extraction and speaker identification [12]. The intruder speaker model was constructed similar to the Universal Background Model (UBM) [13] generally applied to speaker verification systems. It uses speech material of 20 speakers that do not belong to the set of 70 speakers used for the testing experiments. If the intruder speaker is identified the access request is not authorized. For the local recognition these three steps are performed at the AP device at which the access is requested. On the other hand, for the remote recognition, the speaker identification step is performed at a remote point of the communication system. The speaker model is obtained during the training phase. In the remote recognition system the speaker model is stored in another point rather than in the access point (where the speech features are extracted). The access to the optical communication system is just authorized after the identification of the speaker as a member of the registered speakers. The experiment setup of proposed optical access system is illustrated in Fig.1. For the remote identification tests, the speech elements of the speech feature matrix are stored and bit-serially encoded at the FPGA (Altera Stratix EP1S25 Development Kits using the Quartus II suite for Linux) of AP1.

4 Optical pulses (26 ps pulsewidth) are generated by a laser source, operating at nm, at a repetition rate (150 MHz) equal to encoded bits rate. The optical pulses are externally modulated by the speech feature encoded bit sequences using the frequency-shift keying (FSK) technique. At each bit slot, the FSK is applied to select the 600 MHz and 300 MHz frequencies for the bit 1 and for the bit 0, respectively. A laser modulation driver with automatic power control (APC) for high-speed and low-voltage (with a single +3.3 V supply and 30 ma) is used in the experiment. The isolator is necessary to preserve the laser integrity. The FSK modulation was adopted since it showed robustness against fiber nonlinear effects for experiments using short fiber distances ( 100 km) and data rates 10 Gb/s [14]. After the photodetection (receiver sensitivity of -28dBm and 70mW power dissipation at 3.3V), the demodulation is done by a simple decision and clock recovery circuit implemented at the FPGA (AP2). During each clock period, the counter counts the number of rising edges and sets the input of the Flip-Flop D corresponding to bit 1 and bit 0, otherwise. The bit sequences are then reconverted to the electronic domain and the speech feature matrix is recomposed to proceed the remote speaker identification. The 0 db value of the Gaussian noise generator is considered as the reference noise (BER ) and it corresponds to 15 dbm power level. Other four different values were added to this 0 db noise reference (+1.0, +2.0, +3.0 and +4.5 db) as additive noise. This is very important to evaluate the remote identification under noisy conditions. BER measurements were also collected for these noise levels. III. CHARACTERISTICS OF THE SPEAKER IDENTIFICATION SCHEMES This section presents the main characteristics of the speaker identification schemes considered for the optical access authentication system. A. ph and MFCC Features For the ph extraction [9] it was considered Daubechies wavelets filters [15] with 12 coefficients, 6 decomposition scales and a coefficient range from 3 to 5. The speech feature matrix is composed of 7 elements for the ph vectors and 15 MFCC coefficients obtained from each speech frame. The estimation of the ph feature demanded less computational complexity (O(n)) than the extraction of the MFCC coefficients (the fast Fourier transform (FFT) computational complexity is O(nlog(n)). In order to define the best identification system it was also considered a feature matrix obtained from the fusion of the ph vectors and the MFCC (ph+mfcc). The features extraction is done at the AP1 resulting coefficients (Fig.1). The resulting coefficients are further bit encoded and modulate the laser source in the remote tests. B. GMM Classifier A mixture of Gaussian probability densities is a weighted sum of M densities, and is given by

5 M p( x λ) = p i b i ( x) (1) i=1 where x is a random vector of dimension D, b i ( x), i = 1,...,M, are the density components, and p i, i = 1,...,M, are the mixture weights. Each component density is a D variate Gaussian function of the form b i ( x) = e( 1 2 ( x µ)t K 1 i ( x µ)) (2π) D 2 Ki with mean vector µ i and covariance matrix K i, where T denotes the transpose operation and. is the determinant. The Gaussian mixture model, λ, is parametrized by mean vectors, covariance matrices, and mixture weights. These parameters are jointly represented by the following notation: (2) λ = {p i, µ i,k i } i = 1,...,M. (3) The model parameters are estimated for a set of training data as the ones that maximize the likelihood of the GMM. In this paper, the parameter estimates were obtained by using a special case of the expectation-maximization (EM) algorithm [10]. For a sequence of T independent training vectors X = { x 1,..., x T }, the normalized log-likelihood of the GMM is given by log p(x λ) = 1 T log p( x t λ) (4) T t=1 The decision rule for the speaker identification system chooses the speaker model for which this value is maximum. C. M-dim-fBm Classifier The M-dim-fBm classification scheme [9] is also based on the input features models. The speaker model is generated according to the following steps: 1) Pre-processing: the feature matrix formed from the input speech is split into r regions. This matrix contains c rows, where c is the number of feature coefficients per frame, and N columns, where N is the number of frames. 2) Decomposition: for each row of the feature matrix in a certain region the wavelet decomposition is applied in order to obtain the detail sequences. 3) Parameters Extraction/Estimation: from each set of detail sequences obtained from each row of step 2, estimate the mean, the variance and the H parameters of the features being used by the identification system. For the H parameter estimation, the reader can use the wavelet-based estimator proposed in [16].

6 4) Generation of fbm Processes: using the Random Midpoint Displacement (RMD) algorithm [17] and the three parameters computed in step 3, generate the fbm processes. Therefore, c fbm processes are obtained for a given region. 5) Determining the Histogram and Generating the Models: compute the histogram of each fbm process of the given region. The set of all histograms defines the speaker c-dimensional model for that region. 6) Speaker Model: the process is repeated for all of the r regions. This means that a r.c-dimensional fbm process is obtained, which defines the speaker M-dim-fBm model. In this work, r = 1 was used in the tests. In the phase of tests, the histograms of the speaker, obtained from the M-dim-fBm model, are used to compute the probability that a certain c-dimensional feature vector x belongs to that speaker. This is performed to the N feature vectors, resulting in N probability values: p 1,p 2,...p N. Adding these values, the measure of the maximum likelihood that the set of feature vectors under analysis belongs to that particular speaker is obtained. Note that the M-dim-fBm is characterized by only 3 scalar parameters (m, σ 2 and H) while the GMM needs 32 Gaussian functions, each one characterized by 1 scalar parameter, 1 mean vector and 1 covariance matrix, to achieve comparable performance results. Thus, the M-dim-fBm classifier achieves lower computational load for the speaker modeling. The GMM and M-dim-fBm classifiers are evaluated for the first time, for remote recognition. IV. EXPERIMENTAL RESULTS AND DISCUSSIONS In this section the main results of the proposed optical access system are presented and discussed for the local and remote experiments. The local and remote rates results are presented in terms of the identification or recognition accuracy. A. The Speech Database The speech database used in the experiments is composed of a subset of 70 speakers (male and female, 2 : 1) from 27 Brazilian regions that read 2 different texts (for training and tests). The speakers called a free automatic communication center using fixed phones to record the speech signals. This was also important to attain a complex text-independent speaker recognition experimental setup. The intruder speaker model was defined by the speech material of 20 speakers that do not belong to the set of 70 speakers used for the identification testing experiments. If the intruder speaker is identified the access request is not authorized. An intruder identification was computed as a speaker recognition error. A separate speech segment of 1 minute duration was used to train a speaker model. The speech average duration has 196 seconds for the test phases. The experiments were applied to 10, 5 and 1 seconds speech segments. They are referred to as test durations (TD). The number of tests was 1470, 2950 and for TD of 10, 5 and 1 seconds, respectively. For these TD values and considering Chebyshev inequality test, the identification rates accuracy is 0.057, and for a confidence degree of 95%. The speech signal was split into N frames of 25ms with 50% overlapping. The ph vectors and MFCC coefficients features were extracted along the resulting frames.

7 TABLE I LOCAL IDENTIFICATION ACCURACY-LIR(%) M-dim-fBm GMM TD ph MFCC ph+mfcc ph MFCC ph+mfcc 10s s s TABLE II REMOTE IDENTIFICATION ACCURACY-RIR(%) AFTER THE 1.5 KM OPTICAL TRANSMISSION. M-dim-fBm GMM TD ph MFCC ph+mfcc ph MFCC ph+mfcc 10s s s The bit encoding of the feature matrix elements (implemented at the FPGAs) generated bits for ph, MFCC and bits for the fusion of the ph and MFCC features. B. LIR and RIR Results Table I shows the LIR accuracy results obtained for M-dim-fBm and GMM classifiers considering the different speech features and test durations. It is important to remark that the ph feature used only 7 elements per speech frame. This implies in a lower complexity of the classifiers as compared to the systems operating on 15 MFCC coefficients per speech frame. The classifiers presented quite similar performance considering the different speech features. Table II presents the RIR accuracy results for the M-dim-fBm and GMM classifiers considering the different speech features and test durations. Here, the steps of the access system were performed at the different APs. These RIR results were obtained after the 1.5 Km optical transmission (i.e., at the remote point of the communications system) of the encoded bits and the speech feature matrix recomposition. The results are here presented for the BER measure of 1.0 x (0 db) that is a typical value for optical communications. Note that for TD=10s and TD=5s the RIR results were slightly reduced compared to the LIR results. It can also be seen that the recognition rates for TD=1s were significantly affected by the speech signal short duration. From the LIR and RIR results it can be verified that the best performance was achieved for the joint use of ph and the MFCC speech features. The GMM and M-dim-fBm classifiers provided similar identification results with a slight superiority of the latter one.

8 TD 10s TD 5s TD 1s RIR (%) M dim fbm ph+mfcc GMM ph+mfcc M dim fbm MFCC GMM MFCC M dim fbm ph GMM ph Optical Noise Level (db) (a) RIR (%) M dim fbm ph+mfcc GMM ph+mfcc M dim fbm MFCC GMM MFCC M dim fbm ph GMM ph Optical Noise Level (db) (b) RIR (%) M dim fbm ph+mfcc GMM ph+mfcc M dim fbm MFCC GMM MFCC M dim fbm ph GMM ph Optical Noise Level (db) (c) Fig. 2. RIR x Noise power level considering ph, MFCC and ph+mfcc for: (a) TD=10s (b) TD=5s and (c) TD=1s. C. RIR Results versus Optical Noise Level The BER measures collected for the transmission of the ph, MFCC and ph+mfcc encoded bits over a 1.5km optical fiber and noise power level are shown in Tab.III. TABLE III BER X NOISE POWER LEVEL FOR THE PH, MFCC AND PH+MFCC ENCODED BITS BER Noise ph MFCC ph + MFCC 0dB 1.0 x x x dB 1.6 x x x dB 2.3 x x x dB 2.0 x x x dB 2.2 x x x 10 6 Although optical communications achieve very low data information losses ( ), bit errors can occur due to mistakes on the photodetection or receiver devices. This photodetection sensibility problem is also referred to as quantum noise or shot noise. Generally, most optical receivers operates at higher values than the accepted quantum limit of 20 db [18]. However, other noise sources or devices can also be present in practical optical communications systems. These tests also enabled to find the noise limits to achieve an interesting authentication rate without the usage needs of optical amplification devices to reduce the BER. The RIR versus noise power level curves considering the ph, MFCC and ph+mfcc features and the different TD values are illustrated in Fig.2. The 0 db value means RIR values without or very low noise insertion by the optical communication channel and devices. It can be seen that the RIR results were very affected by noise levels greater than +2 db. The MFCC coefficients were the most affected by these noise levels (i.e., greater than +2 db) leading to an important decrease of the RIR values, specially for TD values 5s and 1s. For the ph feature the RIR values

9 decreased 16% (TD=5s) and 12% (TD=1s) from noise values + 1dB to +4.5 db, respectively. For the MFCC features, the RIR values decreased 20% (TD=5s) and 15% (TD=1s) considering the same noise values. However, for the fusion of the ph and MFCC features the RIR values decreased 12.5% (TD=5s) and 16% (TD=1s). Physiological features such as MFCC coefficients, models the spectral characteristics of the vocal tract mechanism and are generally not robust to the acoustic distortion caused by channels. This could explain the RIR results obtained with the MFCC coefficients for noise levels greater than +2 db. Once more the best RIR results for both M-dim-fBm and GMM classifiers were obtained for the fusion use of the ph and the MFCC speech features. V. CONCLUSION This paper presented a robust access for optical communications based on a speaker identification. The experiments demonstrated the feasibility and importance of using the biometric technology for the optical communications access security. The results showed that the best local and remote speaker recognition rates were achieved for the fusion use of the ph and MFCC features considering the M-dim-fBm and GMM classifiers. They also showed that the MFCC feature can be affected by noisy channels while the ph feature seems to be more robust for low test durations (5s and 1s). The best results under noisy conditions were obtained by the fusion of the ph and MFCC speech features. The proposed access system can also be used with other features recognition schemes. And therefore composing a multimodal biometric device in order to improve the identity recognition rates and so the optical channel access security. REFERENCES [1] R. Kuhn, M. Tracy, and S. Frankel, Security for telecommuting and broadband communications, NIST Recommendations, vol , pp , August [2] H. Shinohara, Broadband access in japan: Rapidly growing ftth market, IEEE Communications Magazine, vol. 43, pp , September [3] N. Cheung, Fiber to 30 million homes, IEEE Communications Magazine, vol. 43, pp , September [4] E. Desurvire, Optical communications in 2025, 31st European Conference on Optical Communication (ECOC 2005, vol. 1, pp. 5 6, September [5] FTTH, Fiber to the home deployment spreads as more economies show market growth, FTTH Council, Available at: [6] S. Kartalopoulos, Communications security: Biometrics over communications networks, Proceedings of the Globecom, pp. 1 5, December [7] A. Jain, A. Ross, and S. Prabhakar, An introduction to biometric recognition, IEEE Trans. on Circuits and Systems for Video Technology, vol. 14, no. 1, pp. 4 20, [8] A. Jain, K. Nandakumar, and A. Nagar, Biometric template security, EURASIP Journal on Advances in Signal Processing, vol. 2008, pp. 1 18, [9] R. Sant Ana, R. Coelho, and A. Alcaim, Text-independent speaker recognition based on the hurst parameter and the multi-dimensional fractional brownian motion, IEEE Transactions on Audio, Speech and Language Processing, vol. 14, pp , May 2006.

10 [10] D. Reynolds and R. Rose, Robust text-independent speaker identification using gaussian mixture speaker models, IEEE Transactions on Speech, and Audio Processing, vol. 3, pp , January [11] L. Zão, A. Alcaim, and R. Coelho, Optical communications security with robust channel access based on speaker identification, 16th International Conference on Digital Signal Processing (DSP 2009), pp. 1 5, July [12] D. O Shaughnessy, Speech Communication: Human and Machine, vol. 2 Ed. IEEE Press, [13] D. Reynolds, R. Rose, and E. Hosftetter, Integrated models of signal and background with application to speaker identification in noise, IEEE Transactions on Speech, and Audio Processing, vol. 2, pp , April [14] J. Prat and J. Gené, Reduction of laser modulation bandwidth requirement in fsk systems using duobinary coding and differential detection, Electronics Letters, vol. 42, pp , May [15] I. Daubechies, Ten Lectures on Wavelets. Philadelphia: SIAM, [16] D. Veith and P. Abry, A wavelet-based joint estimator of the parameters of long-range dependence, IEEE Trans. on Information Theory, vol. 45, no. 3, pp , [17] M. Barnsley et al, The Science of Fractal Images. USA: Springer-Verlag New York Inc., [18] G. Agrawal, Fiber-Optic Communication Systems. USA: John-Wiley, 2002.

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise Noha KORANY 1 Alexandria University, Egypt ABSTRACT The paper applies spectral analysis to