Friedrich-Alexander Universität Erlangen-Nürnberg. Lab Course. Pitch Estimation. International Audio Laboratories Erlangen. Prof. Dr.-Ing.

Similar documents
Pitch and Harmonic to Noise Ratio Estimation

Speech Enhancement Using Microphone Arrays

Harmonic Percussive Source Separation

Digital Signal Processing

Frequency Division Multiplexing Spring 2011 Lecture #14. Sinusoids and LTI Systems. Periodic Sequences. x[n] = x[n + N]

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

EE 422G - Signals and Systems Laboratory

ECE 201: Introduction to Signal Analysis

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

COMP 546, Winter 2017 lecture 20 - sound 2

Aberehe Niguse Gebru ABSTRACT. Keywords Autocorrelation, MATLAB, Music education, Pitch Detection, Wavelet

Experiments #6. Convolution and Linear Time Invariant Systems

Lab 8. Signal Analysis Using Matlab Simulink

8.3 Basic Parameters for Audio

Complex Sounds. Reading: Yost Ch. 4

Laboratory Assignment 4. Fourier Sound Synthesis

Lab S-2: Direction Finding: Time-Difference or Phase Difference

Fourier Signal Analysis

Frequency Domain Representation of Signals

Contents. Introduction 1 1 Suggested Reading 2 2 Equipment and Software Tools 2 3 Experiment 2

Adaptive Filters Application of Linear Prediction

Introduction. Chapter Time-Varying Signals

Digital Signal Processing PW1 Signals, Correlation functions and Spectra

COM325 Computer Speech and Hearing

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Nonuniform multi level crossing for signal reconstruction

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Theory of Telecommunications Networks

Sound Synthesis Methods

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Lab S-8: Spectrograms: Harmonic Lines & Chirp Aliasing

Linguistics 401 LECTURE #2. BASIC ACOUSTIC CONCEPTS (A review)

y(n)= Aa n u(n)+bu(n) b m sin(2πmt)= b 1 sin(2πt)+b 2 sin(4πt)+b 3 sin(6πt)+ m=1 x(t)= x = 2 ( b b b b

1 Introduction and Overview

Digital Video and Audio Processing. Winter term 2002/ 2003 Computer-based exercises

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Figure 1: Block diagram of Digital signal processing

Spectrum Analysis: The FFT Display

L19: Prosodic modification of speech

Sampling and Reconstruction of Analog Signals

DFT: Discrete Fourier Transform & Linear Signal Processing

ME scope Application Note 01 The FFT, Leakage, and Windowing

Fall Music 320A Homework #2 Sinusoids, Complex Sinusoids 145 points Theory and Lab Problems Due Thursday 10/11/2018 before class

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

DSP Laboratory (EELE 4110) Lab#10 Finite Impulse Response (FIR) Filters

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Here are some of Matlab s complex number operators: conj Complex conjugate abs Magnitude. Angle (or phase) in radians

II. Random Processes Review

THE CITADEL THE MILITARY COLLEGE OF SOUTH CAROLINA. Department of Electrical and Computer Engineering. ELEC 423 Digital Signal Processing

MUSC 316 Sound & Digital Audio Basics Worksheet

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Electrical & Computer Engineering Technology

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Discrete Fourier Transform

Final Exam Solutions June 14, 2006

Laboratory Manual 2, MSPS. High-Level System Design

Lab S-3: Beamforming with Phasors. N r k. is the time shift applied to r k

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Discrete Fourier Transform (DFT)

Fourier Methods of Spectral Estimation

Basic Characteristics of Speech Signal Analysis

Lecture 7 Frequency Modulation

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

The Formula for Sinusoidal Signals

Basic Signals and Systems

TRANSFORMS / WAVELETS

Sound synthesis with Pure Data

CG401 Advanced Signal Processing. Dr Stuart Lawson Room A330 Tel: January 2003

The quality of the transmission signal The characteristics of the transmission medium. Some type of transmission medium is required for transmission:

ECE 301, final exam of the session of Prof. Chih-Chun Wang Saturday 10:20am 12:20pm, December 20, 2008, STEW 130,

EECS 216 Winter 2008 Lab 2: FM Detector Part I: Intro & Pre-lab Assignment

LAB 2 Machine Perception of Music Computer Science 395, Winter Quarter 2005

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

Lab week 4: Harmonic Synthesis

Continuous time and Discrete time Signals and Systems

1. In the command window, type "help conv" and press [enter]. Read the information displayed.

Digital Signal Processing ETI

TIMA Lab. Research Reports

GEORGIA INSTITUTE OF TECHNOLOGY. SCHOOL of ELECTRICAL and COMPUTER ENGINEERING

Lecture 9. Lab 16 System Identification (2 nd or 2 sessions) Lab 17 Proportional Control

Vibroseis Correlation An Example of Digital Signal Processing (L. Braile, Purdue University, SAGE; April, 2001; revised August, 2004, May, 2007)

Jawaharlal Nehru Engineering College

Islamic University of Gaza. Faculty of Engineering Electrical Engineering Department Spring-2011

Synthesis: From Frequency to Time-Domain

6 Sampling. Sampling. The principles of sampling, especially the benefits of coherent sampling

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

REAL-TIME PROCESSING ALGORITHMS

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE

Advanced audio analysis. Martin Gasser

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Music 171: Amplitude Modulation

NOISE ESTIMATION IN A SINGLE CHANNEL

LABORATORY - FREQUENCY ANALYSIS OF DISCRETE-TIME SIGNALS

Additive Synthesis OBJECTIVES BACKGROUND

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Transcription:

Friedrich-Alexander-Universität Erlangen-Nürnberg Lab Course Pitch Estimation International Audio Laboratories Erlangen Prof. Dr.-Ing. Bernd Edler Friedrich-Alexander Universität Erlangen-Nürnberg International Audio Laboratories Erlangen Lehrstuhl Semantic Audio Processing Am Wolfsmantel 33, 958 Erlangen bernd.edler@audiolabs-erlangen.de International Audio Laboratories Erlangen A Joint Institution of the Friedrich-Alexander Universität Erlangen-Nürnberg (FAU) and the Fraunhofer-Institut für Integrierte Schaltungen IIS

Authors: Stefan Bayer, Nils Werner, Tutors: Nils Werner, Christian Helmrich, Contact: Nils Werner, Christian Helmrich, Friedrich-Alexander Universität Erlangen-Nürnberg International Audio Laboratories Erlangen Lehrstuhl Semantic Audio Processing Am Wolfsmantel 33, 958 Erlangen nils.werner@audiolabs-erlangen.de christian.helmrich@audiolabs-erlangen.de This handout is not supposed to be redistributed. Pitch Estimation, c May 9, 25

Lab Course Pitch Estimation Abstract When looking at audio signals, one possible signal model is to distinguish between harmonic components and noise like components. The harmonic components exhibit a periodic structure in time and it is of course of interest to express this periodicity via the fundamental frequency F, i.e. the frequency of the first sinusoidal component of the harmonic source. This fundamental frequency is closely related to the so called pitch of the source. The pitch is defined as how low or high a harmonic or tone-like source is perceived. Although strictly speaking this is a perceptual property, and is not necessarily equal to the fundamental frequency, it is often used as a synonym for the fundamental frequency. We will use the term pitch in this way in the remaining text. It is also of interest how the relationships in terms of energy between the harmonic and noise like components of an audio signal are. One feature expressing this relationship is the Harmonic to Noise Ratio (HNR). The estimation of the pitch and the HNR then can be used e.g. for efficiently coding the signal, or to generate a synthetic signal based on this and other information gained from analysing the signal. In this laboratory we will concentrate on a single audio source, and we will restrict ourselves to speech, which is the primary mode of human interaction. We will use this signals to develop simple estimators for both features and compare the results to state-of-the-art solutions for estimating the pitch and the HNR. Pitch Estimation As stated above, we model an audio, or to be more specific, an speech signal as a mixture of a harmonic signal and a noise signal: s(t) = h(t) + n(t) () where s(t) is the speech signal, h(t) is the harmonic component, and n(t) ist the noise component. For time-discrete signal (and in digital signal processing of course we deal with such time-discrete signals) the equation becomes: s[k] = h[k] + n[k] (2) k being the index. In this section we will have a closer look at the harmonic component h(t), which can be expressed as the sum of its partial tones, which are sinusoidals where the frequencies of the individual partial tones are integer multiples of the fundamental frequency: h(t) = N n= ( ) 2πn a n sin + φ n F where a n are the individual amplitudes and φ n are additional phases for the individual partial tones. Unfortunately in real world signals like speech typically neither the amplitudes nor the fundamental frequency stay constant over the whole duration of the signal. But when looking closer at for e.g. speech, we see that this parameters normally only change slowly over time. This behaviour gives us the possibility to assume that the parameters stay constant if we compart the signals into small enough sections in time. Such signals are called quasi-stationary. So the first step towards a pitch estimation is to divide the signal into small enough blocks. The length of the block is determined by the lowest pitch we like to detect, for most algorithms at least two periods of the signal should be contained within one block to give a reliable estimate. Table gives a rough overview of the pitch ranges in human speech. (3)

lower limit upper limit male 75 Hz 5 Hz female 25 Hz 25 Hz child 6 Hz Table : typical fundamental frequencies in human speech.4.2 time sequence.2. autocorrelation sequence biased unbiased.2.4 2 4 6. 2 4 6 lag Figure : Comparison of the biased and unbiased autocorrelation sequence for a periodic signal (part of a vowel of a male speaker). The simplest way would now be to just use the zero crossings of the signal. But although this method is very efficient it is not well suited if higher partials have amplitudes or if the noise component is very strong. So most pitch algorithms are based on other methods, for a simple overview go to []. In this laboratory we will develop a estimation algorithm based on the autocorrelation [2]. For discrete time signals the autocorrelation is defined as: R xx [l] = lim N N N k= N x[k]x[k l] (4) where l is the so called lag. Of course this is the definition for signals of infinite length, but we already divided our signal into blocks of length N each, so the autocorrelation becomes (in its biased form): R xx [l] = N N k=l x[k]x[k l] (5) We only consider positive lags since the resulting autocorrelation sequence is symmetric around l =. Another form of the autocorrelation is the so called unbiased autocorrelation sequence R xx [l] = N x[k]x[k l] (6) N l The difference between unbiased and biased autocorrelation is that the unbiased takes the decreasing number of involved in the summation into account. When looking at figure we observe the difference between the biased and the unbiased autocorrelation, the biased tapers off towards high lags. When we compare the autocorrelation equations with our assumption that the signal is k=l

periodic with a periodicity T = /F : x[k] [k + mt ], m Z (7) we see that for such a signal we can expect local maxima of the autocorrelation sequence for lags that are a multiple of T. By finding the maximum of the autocorrelation we get an estimate of the fundamental frequency. Note that the autocorrelation function always has a maximum at l =, so to not erroneously detecting the zero lag as maximum, it is wise to restrict the search within lags that correspond to the upper and lower limits of the fundamental frequency range under consideration. Also the found global maximum might not be at the lag corresponding to the true fundamental frequency but can possibly be an integer multiple of that. Furthermore note that due to this, the estimate can jump between lags in consecutive frames leading also to jumps in the F -estimate. For a more robust estimation this must be taken into account. Homework Excercise Pitch estimation: Theory. Given is the time sequence x[k] = {4, 2, 3,, 5, }. Calculate both the biased and unbiased autocorrelation sequences using pen and paper. Sketch the time and the autocorrelation sequence. 2. Calculate the necessary window length (both in ms and in for a sampling frequency of f s = 6Hz) for an autocorrelation based pitch estimator that should detect typical pitches for human speech as given in table. 3. Calculate the minimum and maximum lag in the autocorrelation domain for said estimator for the desired F range. 4. What is the relationship between the autocorrelation and the power spectral density (PSD)? 5. Think about strategies to avoid octave jumps and errors in the autocorrelation based pitch estimation.

Time Sequence 5 Fourier Transform Harmonic 2 4 6 5 2 3 2 Noise 2 4 6 2 2 3 4 H+N 2 2 4 6 2 2 3 Figure 2: Example of a signal consisting of a harmonic part and a noise part. 2 Harmonic to Noise Ratio Estimation We now go back to our signal model of equation 2. As already said, the relationship between the harmonic component h(t) and the noise component n(t) is of interest here. One possible characterization of that relationship is the ratio of the energies of both components, which we will call harmonic to noise ratio. The same assumption as above, i.e. that the energies of the components and therefore the HNR will vary over time, but slow enough so that it also can be assumed as constant for a small enough amount of time. If we know the exact nature of both h[k] and n[k] the HNR will then be: N k= HNR = h[k]2 N (8) k= n[k]2 Unfortunately, for a real world signal neither h[k] nor n[k] are known, so one has to think about how to estimate the HNR. For example see figure 2 where in the mixture of harmonic and noise components neither in the time sequence representation or the Fourier transformed representation a clear distinction can be seen between the harmonic and noise parts. Lets start with some basic assumptions that will make life a little bit easier. We assume that h[k] and n[k] are uncorrelated, furthermore we assume that we already know F of h[k] and that n[k] is white gaussian noise with zero mean. Now we will have a closer look at the autocorrelation and insert equation 2 into 6: R xx [l] = N (h[k] + n[k])(h[k l] + n[k l]) (9) N l k=l

for l = we would get the energy of the combined signal. No we look what happens for l = T (by expanding the equation above): R xx [T ] = N T ( N k=t h[k]h[k T ] + N k=t h[k]n[k T ] + N k=t h[k T ]n[k] + N k=t n[k]n[k T ]) () Under the assumptions from above (no correlation, white noise), the last three sums will be approximately zero, which leaves: We now insert the approximation of equation 7: N R xx [T ] = h[k]h[k T ] () N T k=t N R xx [T ] = h[k]h[k] (2) N T k=t and see that the autocorrelation at lag l = T is approximately the energy of the harmonic component. Together with R xx [] we can now calculate an estimation of the HNR: HNR = R xx [T ] R xx [] R xx [T ]. (3) We now have found a nice estimate of the HNR that can be implemented very straightforward. In the literature many other approaches can be found, feel free to search for different algorithms and get some of the ideas, whether be it time-domain, time/frequency-transform based, or methods using the cepstrum [3]. Homework Excercise 2 Harmonic to Noise Ration: Theory. Why can we assume that the last three sums in equation are approximately zero under the stated terms that the noise is white and the noise and the harmonic component are uncorrelated? 2. Which autocorrelation should be used for the HNR estimation, the biased or the unbiased? Why? 3. Estimate the HNR for the sequence given in home work part using the calculated autocorrelation and the estimation of equation 3 (Hint: take the position first maximum of the autocorrelation as T ). If the result seems to be not in line with the theory find an explanation for that. 4. Search for or think about other possibilities to estimate the HNR. 3 The Experiment 3. Matlab based estimation The Matlab directory contains stubs for the autocorrelation function, the F -estimation function, and the HNR estimation function called autcorr.m, f_estimation.m, and hnr_estimation.m.

Figure 3: Screenshot of the Matlab GUI for comparing the implemented pitch estimation against the given reference. Furthermore for the evaluation of the pitch estimation against a given reference, a GUI called APLab_pitch.m exists. A screenshot of the GUI can be seen in figure 3. A similar GUI for the HNR estimation exists, called APLab_hnr.m. The subdirectory audiofiles contains several example audio files, you can bring your own files. Additionally, the GUIs allow to make recordings on the fly.

3.2 Exercises Lab Experiment Pitch Estimation: Instructions. Implement the autocorrelation of equations 5 and 6 in Matlab and compare the results for different signals to the Matlab function xcorr(). If the results differ, find an explanation for the difference. 2. Open f estimate.m in the Matlab editor. 3. Implement a first version of the F -estimator based on the comments in f estimate.m. 4. Compare the results using the APLab pitch GUI to the results of the reference F estimator. 5. Implement a refinement to reduce octave errors and jumps. 6. Compare the results using the APLab pitch GUI to the results of the reference F estimator. 7. Explain your solution.

Lab Experiment 2 Harmonic to Noise Ration Estimation: Instructions. Implement the HNR estimation derived in section 2 as Matlab function. For this use the already implemented functions for the pitch estimation. 2. Load the files synth vowel.wav and synth vowel 2.wav into the Matlab workspace. Both files contain synthetic vowels with the same HNR and same F. Calculate the HNR estimates for both signals using your implemented HNR estimation (Fs=6, F=) on the complete items, if they differ, find an explanation. Note that for this exercise you should not use the APLab HNR tool. (Hint: plotting the signals for inspection is always a good idea). 3. Compare the estimate to the reference estimate using the APLab HNR tool. 4. Both the F and HNR estimates are not reliable for certain signal portions, i.e. large variations can be observed. Why is this? And what solutions might be found to overcome this problem? Implement your solution. References [] Wikipedia. Pitch detection algorithm. [Online]. Available: https://en.wikipedia.org/wiki/ Pitch estimation [2]. Autocorrelation. [Online]. Available: https://en.wikipedia.org/wiki/autocorrelation [3]. Cepstrum. [Online]. Available: https://en.wikipedia.org/wiki/cepstrum