Analysis and Synthesis of Expressive Guitar Performance. AThesis. Submitted to the Faculty. Drexel University. Raymond Vincent Migneco

Size: px

Start display at page:

Download "Analysis and Synthesis of Expressive Guitar Performance. AThesis. Submitted to the Faculty. Drexel University. Raymond Vincent Migneco"

Martha Rice
5 years ago
Views:

1 Analysis and Synthesis of Expressive Guitar Performance AThesis Submitted to the Faculty of Drexel University by Raymond Vincent Migneco in partial fulfillment of the requirements for the degree of Doctor of Philosophy May 212

3 ii Table of Contents List of Tables vi List of Figures vii Abstract xi 1 INTRODUCTION Contributions Overview COMPUTATIONAL GUITAR MODELING Sound Modeling and Synthesis Techniques Wavetable Synthesis FM Synthesis Additive Synthesis Source-Filter Modeling Physical Modeling Summary and Model Recommendation Synthesis Applications Synthesis Engines Description and Transmission New Music Interfaces PHYSICALLY INSPIRED GUITAR MODELING Overview Waveguide Modeling Solution for the Ideal, Plucked-String Digital Implementation of the Wave Solution Lossy Waveguide Model Waveguide Boundary Conditions

4 iii Extensions to the Waveguide Model Analysis and Synthesis Using Source-Filter Approximations Relation to the Karplus-Strong Model Plucked String Synthesis as a Source-Filter Interaction SDL Components Excitation and Body Modeling via Commuted Synthesis SDL Loop Filter Estimation Extensions to the SDL Model SOURCE-FILTER PARAMETER ESTIMATION Overview Background on Expressive Guitar Modeling Excitation Analysis Experiment: Expressive Variation on a Single Note Physicality of the SDL Excitation Signal Parametric Excitation Model Joint Source-Filter Estimation Error Minimization Convex Optimization SYSTEM FOR PARAMETER ESTIMATION Onset Localization Coarse Onset Detection Pitch Estimation Pitch Synchronous Onset Detection Locating the Incident and Reflected Pulse Experiment Formulation Problem Solution Results Experiment

5 iv Formulation Problem Solution Results Discussion EXCITATION MODELING Overview Previous Work on Guitar Source Signal Modeling Data Collection Overview Approach Excitation Signal Recovery Pitch Estimation and Resampling Residual Extraction Spectral Bias from Plucking Point Location Estimating the Plucking Point Location Equalization: Removing the Spectral Bias Residual Alignment Component-based Analysis of Excitation Signals Analysis of Recovered Excitation Signals Towards an Excitation Codebook Application of Principal Components Analysis Analysis of PC Weights and Basis Vectors Codebook Design Codebook Evaluation and Synthesis Nonlinear PCA for Expressive Guitar Synthesis Nonlinear Dimensionality Reduction Application to Guitar Data Expressive Control Interface Discussion CONCLUSIONS

6 v 7.1 Expressive Limitations Physical Limitations Future Directions Appendix A Overview of Fractional Delay Filters A.1 Overview A.2 The Ideal Fractional Delay Filter A.3 Approximation Using FIR Filters A.3.1 Delay Approximation using Lagrange Interpolation Filters A.4 Further Considerations Appendix B Pitch Glide Modeling B.1 Overview B.2 Pitch Glide Model B.3 Pitch Glide Measurement B.4 Nonlinear Modeling and Data Fitting B.4.1 Nonlinear Least Squares Formulation B.4.2 Fitting and Results B.5 Implementation Bibliography VITA

7 vi List of Tables 2.1 Summary of sound synthesis models including their modeling domain and applicable audio signals. Adopted from Vercoe et al. [93] Evaluating the attributes of various sound modeling techniques. The boldface tags indicate the optimal evaluation for a particular category Mean and standard deviation of the SNR computed using Equation The joint source-filter estimation approach was used to obtain parameters for synthesizing the guitar tones based on an IIR loop filter Mean and standard deviation of the SNR computed using Equation The joint source-filter estimation approach was used to obtain parameters for synthesizing the guitar tones using a FIR loop filter with length N = B.1 Pitch glide parameters of Equation B.3 for plucked guitar tones for each guitar string. p, mf and f indicate strings excited with piano, mezzo-forte and forte dynamics, respectively

8 vii List of Figures 3.1 Traveling wave solution of an ideal string plucked at time t = t 1 and its displacement at subsequent time instances t 2,t 3. The string s displacement (solid) at any position is the summation of the two disturbances (dashed) at that position Waveguide model showing the discretized solution of an ideal, plucked string. The upper (y + ) and lower (y ) signal paths represent the right and left traveling disturbances, respectively. The string s displacement is obtained by summing y + and y at a desired spatial sample Waveguide model incorporating losses due to propagation at the spatial sampling instances. The dashed lines outline a section where M gain and delay blocks are consolidated using a linear time-invariant assumption Plucked-string waveguide model as it correlates to the physical layout of the guitar. Propagation losses and boundary conditions are lumped into digital filters located at the bridge and nut positions. The delay lines are initialized with the string s initial displacement Single delay-loop model (right) obtained by concatenating the two delay lines from a bidirectional waveguide model (left) at the nut position. Losses from the bridge and nut filters are consolidated into a single filter in the feedback loop Plucked string synthesis using the single delay-loop (SDL) model specified by S(z). C(z) and U(z) are comb filters simulating the e ects of the plucking point and pickup positions along the string, respectively Components for guitar synthesis including excitation, string and body filters. The excitation and body filter s may be consolidated for commuted synthesis Overview of the loop filter design algorithm outlined in Section using short-time Fourier transform analysis on the signal Top: Plucked guitar tones representing various string articulations by the guitarist on the open, 1st string (pitch E 4, Hz). Bottom: Excitation signals for the SDL model associated with each plucking style The output of a waveguide model is observed over one period of oscillation. The top figure in each subplot shows the position of the traveling acceleration waves at di erent time instances. The bottom plot traces out the measured acceleration at the bridge (noted by the x in the top plots) over time Proposed system for jointly estimating the source-filter parameters for plucked guitar tones Pitch estimation using the autocorrelation function. The lag corresponding to the global maximum indicates the fundamental frequency for a signal with f = 33 Hz. 46

9 viii 5.3 Overview of residual onset localization in the plucked-string signal. (a): Coarse onset localization using a threshold based on spectral flux with a large frame size. (b): pitch-synchronous onset detection utilizing spectral flux threshold computed with a frame size proportional to the fundamental frequency of the string. (c): Plucked-string signal with onsets coarse and pitch-synchronous onsets overlayed Detail view of the attack portion of the plucked-tone signal in Figure 5.3. The pitchsynchronous onset is marked as well as the incident and reflected pulses from the first period of oscillation Pole-zero and magnitude plots of a string filter S(z) withf = 33 Hz and a loop filter pole located at =.3. The pole-zero and magnitude plots of the system are shown in (a) and (c) and the corresponding plots using an all-pole approximation of S(z) are shown in (b) and (d) Analysis and resynthesis of the guitar s 1 st String in the open position (E 4, f = Hz). Top: Original plucked-guitar tone, residual signal and estimated excitation boundaries. Middle: Resynthesized pluck and excitation using estimated source-filter parameters. Bottom: Modeling error Comparing the amplitude envelopes of synthetic plucked-string tones produced with the parameters obtained from the joint source-filter algorithm against their analyzed counterparts. The tones under analysis were produced by plucking the 1 st string at the 2 nd fret position (F# 4, f = 37 Hz) at piano, mezzo-forte and forte dynamics Comparing the amplitude envelopes of synthetic plucked-string tones produced with the parameters obtained from the joint source-filter algorithm against their analyzed counterparts. The tones under analysis were produced by plucking the 5 th string at the 5 th fret position (D 3, f = Hz) at piano, mezzo-forte and forte dynamics Source-filter model for plucked-guitar synthesis. C(z) is the feed-forward comb filter simulating the a ect of the player s plucking position. S(z) models the string s pitch and decay characteristics Front orthographic projection of the bridge-mounted piezoelectric bridge used to record plucked-tones. A piezoelectric crystal is mounted on each saddle, which measures pressure during vibration. Guitar diagram obtained from Diagram outlining the residual equalization process for excitation signals Comb filter e ect resulting from plucking a guitar string (open E, f = 331 Hz) 8.4 cm from the bridge plucked-guitar tone. (a) Residual obtained from single delayloop model. (b) Residual spectrum. Using equation 6.2, the notch frequencies are approximately located at multiples of 382 Hz Plucked-guitar tone measured using a piezo-electric bridge pickup. Vertical dashedlines indicate the impulses arriving at the bridge pickup. t indicates the arrival time between impulses (a) One period extracted from the plucked-guitar tone in Figure 6.5. (b) Autocorrelation of the extracted period. The minimum is marked and denotes time lag, t, between arriving pulses at the bridge pickup

10 ix 6.7 Comb filter structures for simulating the plucking point location. (a) Basic structure. (b) Basic structure with fractional delay filter added to the feedforward path to implement non-integer delay Spectral equalization on a residual signal obtained from plucking a guitar string 8.4 cm from the bridge (open E, f = 331 Hz) Excitation signals corresponding to strings excited using a pick (a) and finger (b) Average magnitude spectra of signals produced with pick (a) and finger (b) Application of principal components analysis to a synthetic data set. The vector v 1 explains the greatest variance in the data while v 2 explains the remaining greatest variance Explained variance of the principal components computed for the set of (a) unwound and (b) wound strings Selected basis vectors extracted from plucked-guitar recordings produced on the 1 st, 2 nd and 3 rd strings Selected basis vectors extracted from plucked-guitar recordings produced on the 4 th, 5 th and 6 th strings Projection of guitar excitation signals into the principal component space. Excitations from strings 1-3 (a) and 4-6 (b) Histogram of basis vector occurrences generated with M top = Excitation synthesis by varying the number of code book entries: (a) 1 entry, (b) 1 entries, (c) 5 entries Computed Signal-to-noise ratio when increasing the number of codebook entries used to reconstruct the excitation signals Architecture for a autoassociative neural network Top: Projection of excitation signals into the space defined by the first two linear principal components. Bottom: Projection of the linear PCA weights along the axis defined by the bottleneck layer of the trained ANN Guitar data projected along orthogonal principal axes defined by the ANN (center). Example excitation pulses resulting from sampling this space are also shown Tabletop guitar interface for the components based excitation synthesis. The articulation is applied in the gradient rectangle, while the colored squares allow the performer to key in specific pitches A.1 Impulse responses of an ideal shifting filter when the sample delay assumes an integer (top) and non-integer (bottom) number of samples A.2 Lagrange interpolation filters with order N = 3 (top) and N = 7 (bottom) to provide a fractional delay, d F =.3. As the order of the filter is increased, the Lagrange filter coe cients near the values of the ideal function

11 x A.3 Frequency response characteristics of Lagrange interpolation filters with order N = 3, 5, 7 to provide a fractional delay d F =.3. Magnitude (top) and group delay (bottom) characteristics are plotted B.1 Measured and modeled pitch glide for forte plucks B.2 Measured and modeled pitch glide for piano, mezzo-forte and forte plucks B.3 Single delay-loop waveguide filter with variable fractional delay filter, H F (z)

12 xi Abstract Analysis and Synthesis of Expressive Guitar Performance Raymond Vincent Migneco Advisor: Youngmoo Edmund Kim, Ph.D. The guitar is one of the most popular and versatile instruments used in Western music cultures. Dating back to the Renaissance era, the guitar can be heard in nearly every genre of Western music, and is arguably the most widely used instrument in present-day rock music. Over the span of 5 years, the guitar has developed a multitude of performance and compositional styles associated with nearly every musical genre such as classical, jazz, blues and rock. This versatility can be largely attributed to the relatively simplistic nature of the instrument, which can be built from a variety of materials and optionally amplified. Furthermore, the flexibility of the instrument allows performers to develop unique playing styles, which reflect how they articulate the guitar to convey certain musical expressions. Over the last three decades, physical- and physically-inspired models of musical instruments have emerged as a popular methodology for modeling and synthesizing various instruments, including the guitar. These models are popular since their components relate to the actual mechanisms involved with sound production on a particular instrument, such as the vibration of a guitar string. Since the control parameters are physically relevant, they have a variety of applications including control and manipulation of virtual instruments. The focus of much of the literature on physical modeling for guitars is concerned with calibrating the models from recorded tones to ensure that the behavior of real strings is captured. However, far less emphasis is placed on extracting parameters that pertain to the expressive styles of the guitarist. This research presents techniques for the analysis and synthesis of plucked guitar tones that are capable of modeling the expressive intentions applied through the guitarist s articulation during performance. A joint source-filter estimation approach is developed to account for the performer s articulation and the corresponding resonant string response. A data-driven, statistical approach for modeling the source signals is also presented in order to capture the nuances of particular playing styles. This research has several pertinent applications, including the development of expressive synthesizers for new musical interfaces and the characterization of performance through audio analysis.

14 1 CHAPTER 1: INTRODUCTION The guitar is one of the most popular and versatile instruments used in Western music cultures. Dating back to the Renaissance period, it has been incorporated into nearly every genre of Western music and, hence, has a rich tradition of design and performance techniques pertaining to each genre. From a cultural standpoint, musicians and non-musicians alike are captivated by the performances of virtuoso guitarists past and present, who introduced innovative techniques that defined or redefined the way the instrument was played. This deep appreciation is no doubt related to the instrument s adaptability, as it is recognized as a primary instrument in many genres, such as blues, jazz, folk, country and rock. The guitar s versatility is inherent in its simple design, which can be attributed to its use in multiple musical genres. The basic components of any guitar consist of a set of strings mounted across a fingerboard and a resonant body to amplify the vibration of the strings. The tension on each string is adjusted to achieve a desired pitch when the string is played. Particular pitches are produced by clamping down each string at a specific location along the fingerboard, which changes the e ective length of the string and, thus, the associated pitch when it is plucked. Frets, which are metallic strips spanning the width of the fingerboard, are usually installed on the fingerboard to exactly specify the location of notes in accordance with an equal tempered division of the octave. The basic design of the guitar has been augmented in a multitude of ways to satisfy the demands of di erent performers and musical genres. For example, classical guitars are strung with nylon strings, which can be played with the fingers or nails, and a wide fingerboard to permit playing scales and chords with minimal interference from adjacent strings. Often a solo instrument, the classical guitar requires a resonant body for amplification where the size and materials of the body are chosen to achieve a specific timbre. On the other hand, country and folk guitarists prefer steelstrings which generally produce brighter tones. Electric guitars are designed to accommodate the demands of guitarists performing rock, blues and jazz music. These guitars are outfitted with electromagnetic pickups where string vibration induces an electrical current, which can be processed to apply certain e ects (e.g. distortion, reverberation) and eventually amplified. The role of the body is less important for electric guitars (although guitarists argue that it a ects the instrument s

15 2 timbre) where the body is generally thinner to increase comfort during performance. When the electric guitar is outfitted with light gauge strings, it facilitates certain techniques such as pitchbending and legato, which are more di cult to perform on acoustic instruments. Though the guitar can be designed and played in di erent ways to achieve a vast tonal palette, the underlying physical principles of vibrating strings is constant for each variation of the instrument. Consequently, a popular topic among musicians and researchers is the development of quantitative guitar models that simulate this behavior. Physical- and physically-inspired models of musical instruments have emerged as a popular methodology for this task. The lure of these models is that they simulate the physical phenomena responsible for sound production in instruments, such as a vibrating strings or air in a column, and produce high-quality synthetic tones. Properly calibrating these models, however, remains a di cult task and is an on-going topic in the literature. Several guitar synthesizers have been developed using physically-inspired models, such as waveguide synthesis and the Karplus-Strong Algorithm. In the last decade, there has been considerable interest in digitally modeling analog guitar components and e ects using digital signal processing (DSP) techniques. This work is highly relevant to the consumer electronics industry since it promises low-cost, digital clones of vintage, analog equipment. The promise of these devices is to help musicians consolidate their analog equipment into a single device or acquire the specific tones and capabilities of expensive and/or discontinued equipment at lower cost. Examples of products designed using this technology include Line6 modeling guitars and amplifiers, where DSP is used to replicate the sounds of well-known guitars and tube-based amplifiers [45, 46]. Despite the large amount of research focused on digitally modeling the physics of the guitar and its associated e ects, there has been relatively little research conducted which analyzes the expressive attributes of guitar performance. The current research is mainly concerned with implementing specific performance techniques into physical models based on detailed physical analysis of the performer-instrument interaction. However, there is a void in the research for guitar modeling and synthesis that is concerned with measuring physical and expressive data from recordings. Obtaining such data is essential for developing an expressive guitar synthesizer; that is, a system that not only faithfully replicates guitar timbres, but is also capable of simulating expressive intentions used by many guitarists.

16 3 1.1 Contributions This dissertation proposes analysis and synthesis techniques for plucked guitar tones that are capable of modeling the expressive intentions applied through the guitarist s articulation during performance. Specifically, the expression analyzed through recorded performance focuses on how the articulation was applied through plucking mechanism and strength. The main contributions of this research are summarized as follows: Generated a data set of plucked guitar tones comprising variations of the performer s articulation including the plucking mechanism and strength, which spans all of the guitar s strings and several fretting positions. Developed a framework for jointly estimating the source and filter parameters for pluckedguitar tones based on a physically-inspired model. Proposed and demonstrated a novel application of principal component analysis to model the source signal for plucked guitar tones to encapsulate characteristics of various string articulations. Utilized nonlinear principal components analysis to derive an expressive control space to synthesize excitation signals corresponding to guitar articulations. The analysis and synthesis techniques proposed here are based on physically inspired models of plucked-guitar tones. These types of models are chosen because they have great potential for analyzing and synthesizing expressive performance because their operation has a strong physical analog to the process of exciting a string; that is, an impulsive force excites a resonant string response. These advantages are in contrast to other modeling techniques, such as frequency modulation (FM), additive and spectral modeling synthesis, which are often used for music synthesis tasks, but lack easily controlled parameters that relate to how an instrument is excited (e.g. bowing, picking). Physical models, on the other hand, relate to the initial conditions of a plucked string and possible variations which produce unique tones when applied to the model. This is intuitive, considering guitarists a ect the same physical variables when plucking a string. The proposed method for deriving the parameters relating to expressive guitar performance is based on a joint source-filter estimation framework. The motivation to implement the estimation in a joint source-filter framework is two-fold. Foremost, musical expression results from an interaction

17 4 between the performer and the instrument and estimating the expressive attributes of performance requires accounting for the simultaneous variation of source and filter parameters. For the specific case of the guitar, the performer can be seen as imparting an articulation (i.e. excitation) on the string (i.e. filter), which has a resonant response to the performance input. The second reason for this approach is to facilitate the estimation of the source and filter parameters, which is typically accomplished in two separate tasks. Building o the joint parameter estimation scheme, component-based analysis is applied to the source (i.e. excitation) signals obtained from recorded performance. Existing modeling techniques treat the excitation signal as a separate entity saved o -line to model a specific articulation, but in doing so provides no mechanism to quantify or manipulate the excitation signal. The application of component analysis is a data-driven, statistical approach used to represent the nuances of specific articulations through linear combinations of component vectors or functions. Using this representation, the articulations can be visualized in the component space and dimensionality reduction is applied to yield an expressive synthesis space that o ers control over specific characteristics of the data set. The proposed guitar modeling techniques presented in this dissertation have many potential applications for music analysis and synthesis tasks. Analyzing the source-filter parameters derived from the recordings of many guitarists could lead to development of quantitative models of guitar expression and a deeper understanding of expression during performance. The application of the estimated parameters using the proposed techniques can expand upon the sonic and expressive capabilities of current synthesizers, which often rely on MIDI or wavetable samples to replicate the tone with little or no expressive control. During the advent of computer music, limited computational power was a major constraint when implementing synthesis algorithms, but this is now much less of a concern given the capabilities of present-day computers and mobile devices. These advances in technology have provided new avenues for interacting with audio through gesture-based technologies. The guitar analysis and synthesis techniques presented in this dissertation can be harnessed along with these technologies to create new experiences for musical interaction. 1.2 Overview As computational modeling for plucked-guitars is the basis of this thesis, Chapter 2 overviews various approaches for modeling and synthesizing musical sounds. These approaches include wavetable

18 5 synthesis, spectral modeling, FM synthesis, physical modeling and source-filter model. The strengths and weaknesses of each model are evaluated and based on our assessment, a recommendation is made to base the techniques proposed in this dissertation on a source-filter approximation of physical guitar models. Physical and source-filter models are discussed in detail in Chapter 3, which digitally implement the behavior of a vibrating string due to an external input. The so-called waveguide model, which is based on a digital implementation of the d Alembert solution for describing traveling waves on a string, is introduced as well as a source-filter approximation of this model. Chapter 4 presents an approach for capturing the expression contained in specific string articulations via the source signal from a source-filter model. The physical relation of this source signal to the waveguide model is highlighted and it is suggested that a parametric model can be used to capture the nuances of the articulations. The joint estimation of the source and filter models is proposed by finding parameters that minimize the error between the analyzed recording and the synthetic signal. This constrained least squares problem is solved using convex optimization. The implementation for this approach and results are discussed in Chapter 5. In Chapter 6, principal components analysis (PCA) is applied to a corpus of excitation signals derived from recorded performance. In this application, PCA models each excitation signal as a linear combination of basis functions, where each function contributes to the expressive attributes of the data. We show that a codebook of relevant basis functions can be extracted which describe particular articulations where the plucking device and strength are varied. Furthermore, using components as features, we show that nonlinear PCA (NLPCA) can be applied for dimensionality reduction, which helps visualize the expressive attributes of the data set. This mapping is reversible, so the reduced dimensional space can be used as an expressive synthesizer using the linear basis functions to reconstruct the excitation signals. This chapter also deals with the pre-processing steps required to remove biases from the recovered signals, including the e ect of the guitarist s plucking position along the string. The conclusions from this dissertation are presented in Chapter 7, which includes the limitations and future avenues to explore.

19 6 CHAPTER 2: COMPUTATIONAL GUITAR MODELING A number of techniques are available for the computational modeling and synthesis of guitar tones, each with entirely di erent approaches for capturing its acoustic attributes. This chapter will provide an overview of the sound models most commonly applied to guitar tones including their computational basis, strengths and weaknesses. For detailed treatment of these techniques, the reader is referred to extensive overviews provided by [1] and [89]. The analysis of each synthesis techniques will also be used to justify the source-filter modeling approach used throughout this dissertation. Finally, this chapter will discuss pertinent applications of computational synthesis of guitar tones. 2.1 Sound Modeling and Synthesis Techniques Wavetable Synthesis In many computer music applications, wavetable synthesis is a viable means for synthetically generating musical sounds with low computational overhead. A wavetable is simply a bu er that stores the periodic component of a recorded sound, which can be looped repeatedly. As musical sounds vary in pitch and duration, signal processing techniques are required to modify the synthetic tones from a wavetable sample. Pitch shifting is achieved by interpolating the samples in the wavetable where a decrease or increase in pitch is achieved by interpolating the wavetable samples up or down, respectively. A problem with interpolation in wavetable synthesis is that excessive interpolation of a particular wavetable sample can result in synthetic tones that sound unnatural since interpolation alters the length of the synthetic signal. To overcome this limitation, multi-sampling is used, where several samples of an instrument are used and these samples span the pitch range of the instrument. Interpolation can now be used between the reference samples without excessive degradation to the synthetic tone, which is preferred to storing every possible pitch the instrument can produce. Multisampling can also be used to incorporate di erent levels of dynamics, or relative loudness into the system as well. Beyond interpolation, digital filters can be used to adjust the spectral properties

20 7 (e.g. brightness) of the wavetable samples as well. The computational costs of wavetable synthesis are fairly low and the main restriction is the amount of memory available to store samples. The sound quality in these systems can be quite good as long as there is not excessive degradation from modification. However, wavetable synthesis has no true modeling basis (i.e. sinusoidal, source-filter) and is rather ad-hoc in its approach. Also, its flexibility in modeling and synthesis is restricted by the samples available to the synthesizer FM Synthesis Frequency Modulation (FM) synthesis is a technique used to simulate characteristics of sounds that cannot be produced with LTI models. A FM oscillator is one such way of achieving these sounds and it operates by modulating the base frequency of a signal with another signal. FM Synthesis is often used to simulate characteristics of sounds that cannot be modeled using linear time-invariant models. A simple FM oscillator is given by y(t) =A c sin(2 tf c + f c cos(2 tf m )) (2.1) where A c and f c are the amplitude and frequency of the carrier signal, respectively, f m is the modulating frequency and f c is the maximum di erence between f c and f m. The spectrum of the resulting signal y(t) contains a peak located at the carrier frequency and sideband frequencies located at plus and minus integer multiples of f m. When the ratio of the carrier to the modulating frequency is non-integer, FM synthesis creates an inharmonic spectrum where the frequency spacing between the partials is not constant. This is useful for modeling the spectra of certain musical sounds, such as strings and drums, which exhibit inharmonic behavior. FM synthesis is a fairly computationally e cient technique and can be easily implemented on a microprocessor, which makes it attractive for commercially available synthesizers. Due to the nonlinearity of the FM oscillator, for example, it is capable of producing timbres not possible with other synthesis methods. However, there is no automated approach for matching the synthesis parameters to an acoustic recording [8]. Rather, the parameters must be tweaked by trial and error and/or using perceptual evaluation.

21 Additive Synthesis Additive, or spectral modeling, synthesis is a sound modeling and synthesis approach based on characterizing the spectra of musical sounds and modeling them appropriately. Sound spectra categories typically consist of harmonic, inharmonic, noise or mixed spectra. Analysis via the additive synthesis approach typically entails performing a short-time analysis on the signal to divide it into relatively short frames where the signal is assumed to be stationary within the frame. In the spectral modeling synthesis technique proposed by Serra and Smith, the sinusoidal, or deterministic, parts of the spectrum within each frame are identified and modeled using amplitude, frequency and phase. The sound can be re-synthesized by interpolating between the deterministic components of each frame to generate a sum of smooth, time-varying sinusoids. The noise-like, or stochastic, parts of the spectrum can be obtained by subtracting the synthesized, deterministic component from the original signal [68]. There are several benefits to synthesizing musical sounds via additive synthesis. Foremost, the model is very general and can be applied to a wide range of signals including polyphonic audio and speech [5, 68]. Also, the separation of the deterministic and stochastic components permits flexible modification of signals since the sinusoidal parameters are isolated within the spectrum. For example, pitch and time/scale modification can be achieved independently or simultaneously by shifting the frequencies of the sinusoids and altering the interpolation time between successive frames. This leads to synthetic tones that sound more natural and can be extended indefinitely, unlike wavetable interpolation. A problem with additive synthesis is that transient events present in an analyzed signal are often too short to be adequately modeled by sinusoids and must be accounted for separately. This is problematic especially for signals with a percussive attack such as plucked-strings. It is also unclear how to modify the sinusoids in order to achieve certain e ects related to the perceived dynamics of a musical tone Source-Filter Modeling Analysis and synthesis via source-filter models involves using a complex sound source, such as an impulse or periodic impulse train, to excite a resonant filter. The filter includes the important perceptual characteristics of the sound, such as the overall spectral tilt and the formants, or resonances, characteristic to the sound. When such a filter is excited by an impulse train, for example, the

22 9 resonant filter is sampled at regular intervals in the spectrum as defined by the frequency of the impulse train. Source-filter models are attractive because they permit the automated analysis of the resonant characteristics through either time or frequency domain based techniques. One of the most wellknown examples of this is linear prediction. Linear prediction entails predicting a sample of a signal based on a linear combination of past samples for that signal PX x(n) = p x(n p) (2.2) p=1 where p, p+1,..., P are the prediction coe cients to be estimated from the recording [6]. When a fairly low prediction order P is used, the prediction coe cients yield an all-pole filter that approximates the spectral shape, including resonances, of the analyzed sound. Computationally e cient techniques, such as the autocorrelation and covariance methods, are available for estimating the filter parameters as well. A significant advantage of source-filter models is that they approximate musical sounds as the output of a linear time-invariant (LTI) system. Therefore, using the estimated resonant filter, the source signal for the model can be recovered through an inverse filtering operation. Analysis of the recovered source signals provides insight into the expression used to produce the sound for the case of musical instruments. Also, source signals derived from certain signals can be used to excite the resonant filters from others, thus permitting cross-synthesis for generating new and interesting sounds. As will be discussed in Chapter 3, source-filter models have a close relation to physical models of musical instruments. Despite the advantages of source-filter models, they have certain limitations. Namely, as they are based on LTI models, they cannot model the inherent nonlinearities found in real musical instruments. For example, tension modulation in real strings alters the spectral characteristics in a time-varying manner, while source-filter models have fixed fundamental frequencies Physical Modeling Physical modeling systems aim to model the behavior of systems using physical variables such as force, displacement, velocity and acceleration. Physical systems describing sound can range from musical interactions such as striking a drum or string or natural sounds such as wind and rolling objects. An example physical system for a musical interaction consists of releasing a string from an

23 1 initial displacement. The solution to this system is discussed extensively in Chapter 3, but involves computing the infinitesimal forces acting on the string as it is released which results in a set of di erential equations describing the motion of the string with respect to time and space. The digital implementation of physical models for sound can be achieved in a number of ways including modal decomposition, digital waveguides and wave digital filters to name a few [89]. While physical models are capable of high quality synthesis of acoustic instruments, developing models of these systems is often a di cult task. Taking the plucked-string as an example, a complete physical description requires knowledge of the string including its material composition and how it interacts with the boundary conditions at its termination points, which includes fricative forces acting on the string as it travels. Furthermore, there may be coupling forces acting between the string and the excitation mechanism (e.g. the player s finger), which should be included as well. For these reasons, the physical system must be known a priori and it cannot be calibrated directly through audio analysis. 2.2 Summary and Model Recommendation Table 2.1 summarizes the sound modeling techniques presented above by comparing their modeling domains and the range of musical signals that can be produced using each method. The vertical ordering is indicative of the underlying basis and/or structure of the model types. For example, wavetable synthesis is a rather ad-hoc approach without a true computational basis, while FM synthesis is based on modulating sinusoids. Additive synthesis and source-filter models have a strict modeling basis using sinusoids plus noise and source-filter parameters, respectively. Physical models are most closely related to musical instruments since they deal with related physical quantities and interactions. As a model s parameter domain becomes more general, a greater range of sounds can be synthesized with more control over their properties (i.e. pitch, timbre, articulation). Based on the discussion in Section 2.1, the strengths and weaknesses of each model are evaluated on a scale (Low, Moderate, High) as they pertain to four categories: 1. Computational complexity required for implementation 2. The resulting sound quality when the model is used for sound synthesis of guitar tones 3. The di culty required to calibrate the model in accordance with acoustic samples 4. The degree of expressive control a orded by the model

24 11 Table 2.1: Summary of sound synthesis models including their modeling domain and applicable audio signals. Adopted from Vercoe et al. [93]. Sound Model Parameter Domain Acoustic Range Wavetable FM Additive Source-Filter Physical sound samples, manipulation filters carrier and modulating frequencies noise sources, time-varying amplitude, frequency and phase excitation signal, filter parameters physical quantities (length, sti ness, position, etc.) discrete pitches, isolated sound events sounds with harmonic and inharmonic spectra sounds with harmonic, inharmonic, noisy or mixed spectra voice (speech, singing), plucked-string or struck instruments plucked, struck, bowed or blown instruments Table 2.2: Evaluating the attributes of various sound modeling techniques. The boldface tags indicate the optimal evaluation for a particular category. Sound Model Computational Complexity Sound Quality Calibration Di culty Expressive Control Wavetable Low High High Low FM Low Moderate High Low Additive Moderate High Moderate Moderate Source-Filter Moderate High Moderate High Physical High High High Moderate Table 2.2 shows the results of this evaluation in accordance with the four categories presented above. The model(s) earning the best evaluation for each category are highlighted in bold face font for emphasis. It should be noticed that, in general, the computational complexity of the models increases in accordance with the associated model parameter domain in Table 2.1. That is, as the parameters become more general, they are more di cult to implement and harder to calibrate. For truly flexible and expressive algorithmic synthesis, additive, source-filter and physical models o er the best of all categories. While the additive model provides good sound quality and flexible synthesis (especially with regard to pitch and time shifting), the sinusoidal basis does not allow the performer s input to be separated from the instrument s response. Physical models provide this

25 12 separation, but are di cult to calibrate, especially from a recording, since the physical configuration of the instrument s components and the performer s interaction are generally not known a priori. Of the remaining models, the source-filter model provides the greatest appeal due to its inherent simplicity especially, especially as it pertains to modeling the performer s articulation, relative ease of calibration and available expressive control. 2.3 Synthesis Applications The techniques for modeling plucked-guitar tones presented in this thesis are applicable to a number of sound synthesis tasks. This section will highlight a few such tasks to provide a larger perspective on the benefits of computational guitar modeling Synthesis Engines There are numerous systems available which encompass a variety of computational sound models for the creation of synthetic audio. One system includes CSound, which is an audio programming language created by Vercoe et al. based on the C language [92]. CSound o ers the implementation of several synthesis algorithms, including general filtering operations, additive synthesis and linear prediction. The Synthesis ToolKit (STK) is another system created by Cook and Scavone, which adopts a hierarchical approach to sound modeling and synthesis using an open-source application programming interface based on C++ [11]. STK handles low level, core sound synthesis via unit generators which include envelopes, oscillators and filters. High-level synthesis routines encapsulate physical modeling algorithms for specific musical instruments, FM synthesis, additive synthesis and other routines Description and Transmission Computational modeling of musical instruments, especially the guitar, is highly applicable in systems requiring generalized audio description and transmission. The MPEG-4 standard is perhaps the most well-known codec (compressor-decompressor) for transmission of multimedia data. However, the compression of raw audio, even using the perceptual codec found in mp3, leaves little or no control over the sound at the decoder. To expand the parametric control of compressed audio, the MPEG-4 standard includes a descriptor for so-called Structured Audio, which permits the encoding, transmission and decoding of audio using highly structured descriptions of sound [21, 66, 93]. The

26 13 audio descriptors can include high-level, performance information for musical sounds such as pitch, duration, articulation and timbre and low-level descriptions based on the models (e.g. source-filter, additive synthesis) used to generate the sounds. It should be noted that the structured audio descriptor does not attempt to standardize the model used to parameterize the audio, but provides a means for describing the synthesis method(s), which keeps the standard flexible. The level of description provided by structured audio di erentiates it from other formats such as pulse-code modulated audio or mp3, which do not provide contextual descriptions and MIDI (musical instrument digital interface), which provide contextual description, but lacks timbral or expressive descriptors. In essence, structured audio provides a flexible and descriptive language for communicating with synthesis engines New Music Interfaces Computer music researchers have long sought to develop new interfaces for musical interaction. Often, these interfaces deviate from the traditional notion in which an instrument is played in order to appeal to non-musicians or enable entirely new ways of interacting with sound. For the guitar, Karjalainen et al. developed a virtual air guitar where the performer s hands are tracked using motion sensing gloves [26]. The guitar tones are produced algorithmically using waveguide models in response to gestures made by the performer. More recently, commercially available gesture and multitouch technologies have been used for music creation. The limitations of these systems, however, is that their audio engines utilize sample-based synthesizers and provide little or no parametric control over the resulting sound [2, 55]. The plucked-guitar model techniques presented in this dissertation are applicable to each of the sound synthesis areas outlined above. The source and filter parameters extracted from recordings can be used for low bit-rate transmission of audio and are based on algorithms (source-filter) that are either available in many synthesis packages are easily implemented on present-day hardware. Given the computational power available in present day computers and mobile devices, the analysis techniques and algorithms presented here can be harnessed into applications for new musical interfaces as well.

27 14 CHAPTER 3: PHYSICALLY INSPIRED GUITAR MODELING 3.1 Overview For the past two decades, physically-inspired modeling systems have emerged as a popular method for simulating plucked-string instruments since they are capable of producing high-quality tones with computationally e cient implementations. The emergence of these techniques was due, in part, to the innovations of the Karplus-Strong algorithm, which simulated plucked-string sounds using a simple and e cient model, which was later shown to approximate the physical phenomena of traveling waves on a string [22, 3, 31, 72, 89]. Thus, direct physical modeling of a musical instrument aims to simulate the behavior of particular elements responsible for sound production (e.g. a vibrating string or resonant air column) due to the musician s interaction with the instrument (e.g. plucking or breath excitation) with a digital model [89]. This chapter will briefly overview waveguide techniques for guitar synthesis, which directly models the traveling wave solution resulting from a plucked string. A related model, known as the single delay-loop, is also discussed, which is utilized for the analysis and synthesis tasks presented in this thesis. 3.2 Waveguide Modeling Directly modeling the complex vibration of guitar strings due to the performer-instrument interaction is a di cult problem. However, by using simplified models of plucked-strings, waveguide models o er an intuitive understanding of string and lead to practical and e cient implementations [72]. In this section, the well-known traveling wave solution for ideal, plucked-strings is presented [33]. This general solution is then discretized and digitally implemented, as shown by Smith, to constitute a digital waveguide model [72]. Common extensions to the waveguide model are also presented, which correspond to non-ideal string conditions.

28 Solution for the Ideal, Plucked-String The behavior of a vibrating string is understood by deriving and solving the well-known wave equation for an ideal, lossless string. The full derivation of the wave equation is documented in several physics texts [33, 52] and is obtained by computing the tension di erential across a curved section of string with infinitesimal length. This tension is balanced at all times by an inertial restoring force due to the string s transverse acceleration. The wave equation is expressed as [33] K t y = "ÿ (3.1) where K t, " are the string s tension and linear mass density, respectively, and y = y (t, x) isthe string s transverse displacement at a particular time instant, t, and location along the string, x. The curvature of the string is indicated by y 2 y(t, x)/@x 2 and its transverse acceleration is given by ÿ 2 y(t, x)/@t 2. The general solution to the wave equation is given by [33] y (t, x) =y r (t x/c)+y l (t + x/c), (3.2) where y r and y l are functions that describe the right and left traveling components of the wave, respectively, and c is the wave speed, which is a constant determined by p K t /". It should be noted that, y r and y l are arbitrary functions of arguments (ct x) and (ct + x) and it can be verified that substituting any twice-di erentiable function with these arguments for y(t, x) will satisfy Equation 3.1 [33, 72]. Equation 3.2 indicates that the wave solution can be represented by two functions, each depending on a time and a spatial variable. This notion becomes clear by analyzing an ideal, plucked-string at a few instances after its initial displacement as shown in Figure 3.1. After the string is released, its total displacement is obtained by summing the amplitudes of the right- and left-traveling wave shapes, which propagate away from the plucking position, along the entire length of the string Digital Implementation of the Wave Solution As demonstrated in Figure 3.1, the traveling wave solution has both time and spatial dependencies, which must be discretized to digitally implement Equation 3.2. Temporal sampling is achieved by employing a change of variable in Equation 3.2 such that t n = nt s where T s is the audio sampling

29 16 t = t 1 t = t 2 t = t 3 Figure 3.1: Traveling wave solution of an ideal string plucked at time t = t 1 and its displacement at subsequent time instances t 2,t 3. The string s displacement (solid) at any position is the summation of the two disturbances (dashed) at that position. interval. The wave s position is discretized by setting x m = mx, wherex = ct s, such that the waves are sampled at a fixed spatial interval along the string. Substituting t and x with t n and x m in Equation 3.2 yields [72]: y (t n,x m )=y r (t x/c)+y l (t + x/c) (3.3) = y r (nt s mx/c)+y l (nt s + mx/c) (3.4) = y r ((n m) T s )+y l ((n + m) T s ) (3.5) Since all arguments are multiplied by T s, it is suppressed and the terms corresponding to the right and left traveling waves can be simplified to [72, 89]: y + (n), y r (nt s ), y (n), y l (nt s ) (3.6) Smith showed that Equation 3.5 could be schematically realized as a so-called digital waveguide model shown in Figure 3.2 [7, 71, 72]. When the upper and lower signal paths, or rails, of Figure 3.2 are initialized with the values of the string s left and right wave shapes, the traveling wave phenomena in Figure 3.1 and Equation 3.2 is achieved by shifting the transverse displacement values for the wave shapes in the upper and lower rails. For example, during one temporal sampling instance, the right-traveling wave shifts by the amount ct s along the string, which is equivalent to delaying y + by one sample in Figure 3.2. The waveguide model also provides an intuitive understanding for how the traveling waves relate to the string s total displacement, which is obtained by

30 17 y + (n) y + (n-1) y + (n-2) y + (n-3) z -1 z -1 z -1 y(nt s, ) y(nt s, 3X) y - (n) y - (n+1) y - (n+2) y - (n+3) z -1 z -1 z -1 (x = ) (x = ct s ) (x = 2cT s ) (x = 3cT s ) Figure 3.2: Waveguide model showing the discretized solution of an ideal, plucked string. The upper (y + ) and lower (y ) signal paths represent the right and left traveling disturbances, respectively. The string s displacement is obtained by summing y + and y at a desired spatial sample. summing the values of y + and y at a desired spatial sample x = mct s. It should be noted that the values obtained at the sampling instants in the waveguide model are exact, although band-limited interpolation can be used to obtain the displacement between spatial sampling instants if desired [89] Lossy Waveguide Model The lossless waveguide model in Figure 3.2 clearly represents the phenomena of the traveling wave solution for a plucked string under ideal conditions. However, this model does not incorporate the characteristics of real strings, which are subject to a number of non-ideal characteristics, such as internal friction and losses due to boundary collisions. In the context of sound synthesis, incorporating these properties is essential for modeling tones that behave naturally both from a physical and perceptual standpoint. Non-ideal string propagation is hindered by energy losses from internal friction and drag imposed by the surrounding air. If these losses can be modeled as a constant, µ, proportional to the wave s transverse velocity, ẏ, Equation 3.1 can be modified as [72] K t y = "ÿ + µẏ (3.7) where the additional term, µẏ, incorporates the fricative losses applied to the string in the transverse direction. The solution to Equation 3.7 is the same as Equation 3.1, but with an exponential term that attenuates the right- and left-traveling waves as a function of propagation distance. The solution

31 18 M sections y + (n) z -1 g z -1 g z -1 g y(nt s, ) y(nt s, MX) y - (n) (x = ) g z -1 g z -1 (x = McT s ) g z -1 Figure 3.3: Waveguide model incorporating losses due to propagation at the spatial sampling instances. The dashed lines outline a section where M gain and delay blocks are consolidated using a linear time-invariant assumption. is given by [72]: y(t, x) =e (µ/2")x/c y r (t x/c)+e (µ/2")x/c y l (t + x/c) (3.8) To obtain the lossy waveguide model, Equation 3.8 is discretized by applying the same change of variables that were used to discretize Equation 3.1. This yields a waveguide model with a gain factor, g = e µts/2", inserted after each delay element in the waveguide as shown in Figure 3.3. Thus, a particular point along the right- or left-traveling wave shape is subject to an amplitude attenuation by the amount g as it advances one spatial sample through the waveguide. By using a linear time-invariant (LTI) assumption, Figure 3.3 can be simplified to reduce the number of delay and gain elements required for the model. For example, if the output of the waveguide is observed at x =(M + 1)X, then the previous M delay and gain elements can be consolidated into a single delay, z M, and loss factor, g M. This greatly reduces the complexity of the waveguide model, which is desirable for practical implementations Waveguide Boundary Conditions In practice, the behavior of a vibrating string is determined by boundary conditions due to the string s termination points. In the case of the guitar, each string is terminated at the nut and bridge where the former is located near the guitar s headstock and the latter is mounted on the guitar s saddle. The behavior of the string at these locations depends on several factors, including the string s tensile properties, how it is fastened and the construction of the bridge and nut. For

32 19 simplistic modeling, however, it su ces to assume that guitar string s are rigidly terminated such that there is no displacement at these positions. By assuming rigid terminations for a string with length L, a set of boundary conditions are obtained for solving the wave equation [33] y (t, ) = y (t, L) =. (3.9) By substituting these conditions into Equation 3.2 and discretizing, the following relations between y + and y are obtained [72]: y + (n) = y (n) (3.1) y + (n D/2) = y (n + D/2) (3.11) In Equation 3.11, D =2L/X and is often referred to as the loop delay since it indicates the delay time, in samples, for a point on the right wave shape, for example, to travel from x =tox = L and back along the string. Thus, points located at the same spatial sample on the right and left wave shapes will have the same amplitude displacement every D/2 samples. Viewed another way, D can be calculated as a ratio of the sampling frequency and the string s pitch, which is determined by the string s length, D = 2L X = 2L = 2Lf s = f s (3.12) ct s c f where the fundamental frequency, f, was substituted based on the wave relationship f = c/2l where 2L is the wavelength and c is the wavespeed. Figure 3.4 shows the lossy waveguide model with boundary conditions superimposed on a guitar body to illustrate the physical relationship between the model and instrument. The loss factors due to wave propagation and rigid boundary conditions are consolidated into two filters located at x = and x = L, which correlate the guitar s bridge and nut positions, respectively. The individual delay elements are merged into two bulk delay lines, each having a length of D/2 samples and store the shapes of the left- and right-traveling wave shapes at any time during the simulation. Furthermore, this model allows the string s initial conditions to be specified relative to a spatial sample in the delay line that represents the plucking point position. Initializing the waveguide in this way removes

33 2 y + (n) Delay Line D/2 Samples y + (n-d/2) H b (z) y(nt s, M 1 X) H h (z) Delay Line D/2 Samples y - (n) y - (n+d/2) (x = ) (x = M 1 X) (x = M 2 X) (x = L) Bridge Pickup Pluck Point Nut Figure 3.4: Plucked-string waveguide model as it correlates to the physical layout of the guitar. Propagation losses and boundary conditions are lumped into digital filters located at the bridge and nut positions. The delay lines are initialized with the string s initial displacement. the need to explicitly model the coupling e ects arising from the interaction between the string and excitation mechanism [72]. The guitar s output is observed at the pickup location by summing the values of the upper and lower delay lines at a desired spatial sample. The simplistic nature of the the waveguide model in Figure 3.4 leads to computationally e cient hardware and software implementations of realistic plucked guitar sounds. Memory requirements are minimal, since only two bu ers are required to store the string s initial conditions and the lossy boundaries can be implemented with simple digital filters. Furthermore, as Smith showed, the contents of the delay lines can be shifted via pointer manipulation to reduce the load on the processor [1, 72]. Karjalainen showed that using such techniques enables several string models to be implemented on a single DSP chip, with computational capabilities that are eclipsed by present day (212) microprocessors [25] Extensions to the Waveguide Model An important extension is providing fractional delay for the waveguide model since strings are often tuned to non-integer frequencies that may not be obtainable by taking the ratio of sampling frequency over delay line length. While certain hardware and software configurations support multiple sampling rates, it is generally undesirable to vary the sampling rate to achieve a particular tuning, especially when synthesizing multiple string tones with di erent pitches. Instead, Karjalainen proposed adding

34 21 fractional delay into the waveguide loop via a Lagrange interpolation filter. Thus, a FIR filter is computed to add the required fractional delay to precisely tune the waveguide [25]. Smith proposed using all-pass filters to simulate the e ects of dispersion in strings, where the string s internal sti ness causes higher frequency components of the wave to travel faster than lower ones. This has the e ect of constantly altering the shape of the string. All-pass filters introduce frequency-dependent group delay to simulate this e ect [72]. Tolonen et al. incorporate the e ects of pitch glide, or tension modulation, exhibited by real strings using a non-linear waveguide model [79, 8, 91]. At rest, a string exhibits a nominal length and tension. However, as the string is displaced from its equilibrium position, the string undergoes elongation which increases its tension. After release, the tension and, thus, the wave speed constantly fluctuates as the string oscillates about its nominal position. This constant fluctuation does not allow a fixed spatial sampling scheme to su ce and the wave must be resampled at each time instance to account for the elongation. 3.3 Analysis and Synthesis Using Source-Filter Approximations The waveguide model discussed in the previous discussion provides an intuitive methodology for implementing the traveling wave solution and simulating plucked-string tones. However, accurate re-synthesis of plucked-guitar tones using the waveguide model requires knowledge of the string s initial conditions and loss filters that are correctly calibrated to simulate naturally decaying tones. The former requirement is a significant limitation since the exact initial conditions of the string are not available from a recorded signal and must be measured during performance, which is often impractical. Therefore, when performance and physical data are unavailable, the utility of the waveguide model is limited for analysis-synthesis tasks, such as characterizing recorded performance. An alternative model, known as the single delay-loop (SDL), was developed to simplify the waveguide model from a computational standpoint by consolidating the delay lines and loss filters. The SDL model is also widely used in the literature because it permits the analysis of pluckedguitar tones from a source-filter perspective; that is, an external signal excites a filter to simulate the resonant behavior of a plucked string. Thus, the physical specifications for the guitar and its strings are generally not required to calibrate the SDL model since linear time-invariant methods can be applied for this task. A number of guitar synthesis systems are based on SDL models [26, 56, 74, 75, 9].

35 Relation to the Karplus-Strong Model For a more streamlined structure, the bidirectional waveguide model from Figure 3.4 can be reduced to a single, D-length delay line and a loop filter that consolidates the losses incurred from the bridge and nut [7, 72]. This reduction is shown in Figure 3.5, where the lower delay line is concatenated with the upper delay line at the nut position. The wave shape contained in the lower delay line is inverted to incorporate the reflection at the rigid nut, which has been removed. y + (n) H b (z) D/2 Samples y + (n-d/2) H h (z) y + (n) D Samples y + (n-d) H l (z) y - (n) D/2 Samples y - (n+d/2) Figure 3.5: Single delay-loop model (right) obtained by concatenating the two delay lines from a bidirectional waveguide model (left) at the nut position. Losses from the bridge and nut filters are consolidated into a single filter in the feedback loop. The new waveguide structure in Figure 3.5 (right) demonstrates the basic SDL model and is identical to the well-known Karplus-Strong (KS) plucked-string model, whose discovery pre-dated waveguide synthesis techniques [22, 31]. Unlike waveguide techniques where the excitation is based on wave variables, the KS model works by initializing a D-length delay line with random values and circularly shifting the samples through a loss filter. The random initialization of the delay line simulates the transient noise burst perceived during the attack of plucked-string instruments, though this excitation signal has no physical relation to the string, while the feedback loop acts a comb filter so that only the harmonically-related frequencies are passed. The loss filter, H l (z), employs low-pass filtering to implement the frequency dependent decay characteristics of real strings so that high frequency energy dissipates faster than the lower frequencies Plucked String Synthesis as a Source-Filter Interaction By modeling plucked-guitar tones with the single-delay loop (SDL), the physical interpretation of traveling wave shapes on a string is no longer clear as it was for the bidirectional waveguide. However, Valimaki et al. show that the SDL can be derived from the bidirectional waveguide model by computing a transfer function between the spatial samples representing the plucking position

36 23 and output samples [3, 89]. This derivation is still physically valid, though the model s excitation signal is treated as an external input rather than a set of initial conditions describing the string s displacement. Figure 3.6 shows a complete source-filter model for plucked guitar synthesis based on waveguide modeling principles. The SDL model is contained in the block labeled S(z), which is equivalent to the single delay line structure shown in Figure 3.5, except the model is driven by an external excitation signal rather than a random initialization as in the Karplus-Strong model. S(z) alone cannot simulate the complete behavior of plucked-strings found in the waveguide model. Notably, missing is the ability to manipulate the plucking point and pick up positions, both of which are achieved by selecting a desired spatial sample in the waveguide model corresponding to the location on where the string is displaced and where the vibration is observed as the output. Valimaki showed that this functionality could be achieved by adding comb filters before and after the SDL to simulate the e ects of plucking point and pickup positions present in the waveguide model. Figure 3.6 shows a comb filter C(z) precedings(z) to simulate the e ect of the plucking point position. For simplicity, the input p(n) can be an ideal impulse. The comb filter delay determines when p(n) is reflected, which is analogous to a sample in the digital waveguide model encountering a rigid boundary. The number of samples between the initial and reflected impulses is specified as a fraction of the loop delay where D indicates the number of samples corresponding to one period of string vibration. Similarly, the comb filter U(z) proceedings(z) simulates the position of the pickup seen on electric guitars. In this filter, the comb filter delay specifies the delay between arriving pulses associated with a relative position along the string. It should be noted that, since each of the blocks in Figure 3.6 are linear time-invariant (LTI) systems, they may be freely interchanged as desired SDL Components Whereas the comb filters in Figure 3.6 specify initial and output observation conditions for the plucked guitar tone, the SDL filter in S(z) is responsible for modeling the string vibration including its fundamental frequency and decay. As in the case of the bidirectional waveguide, the total loop delay, D, of the SDL denoted by S(z) determines the pitch of the resulting guitar tone as determined by Equation Since D is typically a non-integer, the fractional delay filter, H F (z), is used to add the required fractional group delay, while z D I provides the bulk, integer delay component of D. All-pass and Lagrange interpolation filters are commonly used for H F (z), with the latter being

37 24 p(n) + + C(z) z -λ 1 D S(z) H l (z) H F (z) z -D I U(z) z -λ 2 D + + y(n) Figure 3.6: Plucked string synthesis using the single delay-loop (SDL) model specified by S(z). C(z) and U(z) are comb filters simulating the e ects of the plucking point and pickup positions along the string, respectively. especially popular in synthesis systems since it can achieve variable delay for pitch modification without significant transient e ects [26, 3]. Additional information pertaining to fractional delay filters is provided in Appendix A. H l (z) is the so-called loop filter and is responsible for implementing the non-ideal characteristics of real strings, including losses due to wave propagation and terminations at the nut and bridge positions. In the early developments of waveguide synthesis, H l (z) was chosen as a two-tap, averaging filter for simplicity and e ciency [31], but since a low order, FIR filter is often too simplistic to match the magnitude decay characteristics of plucked-guitar tones. In the literature, a first order, IIR filter is often used for H l (z) and has the form H l (z) = g 1 z 1 (3.13) where and g must be determined for proper calibration [29, 62, 86, 9] It is useful to analyze the total delay, D, in the SDL as a sum of the delays contributed by each component in the feedback loop, D = l + D F + D I (3.14)

38 25 where l, D F, D I are the group delays associated with H l (z), H F (z) and z D I,respectively.Thus, the bulk and fractional delay components should be chosen to compensate for the group delay introduced by the loop filter, which varies as a function of. For spectral-based analysis, the transfer function of the SDL model between input, p(n), and output, y(n), can be expressed in the z-transform domain as S(z) = 1 1 H l (z)h F (z)z D I. (3.15) Equation 3.15 can be thought of as a modified linear prediction where the prediction occurs over D I samples due to the periodic nature of plucked-guitar tones. The prediction coe cients are determined by the coe cients of the loop and fractional delay filters in the feedback loop of S(z). The SDL model in Figure 3.6 is attractive from an analysis-synthesis perspective since, unlike the bidirectional waveguide model, it does not require specific data about the string during performance (e.g. initial conditions, instrument materials, plucking technique) to faithfully replicate pluckedguitar tones. Rather, the problem becomes properly calibrating the filters from recorded tones via model-based analysis. A significant portion of the literature for plucked-guitar synthesis is dedicated towards developing calibration schemes for extracting optimal SDL components [26, 29, 62, 69, 86, 9] Excitation and Body Modeling via Commuted Synthesis When using the SDL model for guitar synthesis, the output signal is assumed to be strictly the result of the string s vibration where the only external forces acting on the string are due to fricative losses. This assumption is not necessarily true when dealing with real guitars, since the instrument s body incorporates a resonant filter, which a ects its timbre, and interacts with the strings via nonlinear coupling. Valimaki et al. describe the acoustic guitar body as a multidimensional resonator, which requires computationally expensive modeling techniques to implement [89]. While an exhaustive review of acoustic body modeling techniques is beyond the current scope, several attempts have been made to reduce the complexity of this task [7, 28, 57]. Measurement of the acoustic guitar body response is typically achieved by striking the resonant body of the instrument with a hammer with the strings muted. The acoustic radiation is recorded to capture the resonant body modes. In some cases, electro-mechanical actuators are used to excite and measure the resonant body in a controlled manner [63]. Digital implementation of the acoustic body involves designing a

39 26 δ(n) Excitation Filter SDL Model Body Filter E(z) S(z) B(z) y(n) Figure 3.7: Components for guitar synthesis including excitation, string and body filters. excitation and body filter s may be consolidated for commuted synthesis. The filter that captures the resonant modes. This can be achieved using FIR or IIR filters, though precise modeling requires very high order filters. Karjalainen et al. proposed using warped filter models for computationally e cient modeling and synthesis of acoustic guitar bodies. The warped filter is advantageous since the frequency resolution of the filter can favor the lower, resonant frequency modes which are perceptually important to capture for re-synthesis, while keeping the required filter orders low enough for e cient synthesis [24]. For cross-synthesis applications, Karjalainen et al. introduced a technique to morph electric guitar sounds into acoustic tones through equalization of the magnetic pickups found on electric guitars. A filter, which encapsulates the body e ects of the acoustic guitar, was then applied to a digital waveguide model of the instrument [27]. A popular method for dealing with the absent resonant body e ects in SDL model involves using so-called commuted synthesis, which was independently developed by Smith and Karjalainen [29, 73]. This technique exploits the commutative property of linear time-invariant (LTI) systems in order to extract an aggregate signal that encapsulates the e ects of the resonant body filter and the string excitation, p(n), of the SDL model when the loop filter parameters are known. This approach avoids the computational cost incurred with explicitly modeling the body with a high-order filter. Figure 3.7 shows the SDL model augmented by inserting excitation and body filters before and after the SDL loop, respectively. The excitation filter is a general LTI block that encapsulates several aspects of synthesis including pluck-shaping filters to model certain dynamics in the articulation and the comb filtering e ects from the plucking point and/or pickup locations as shown in Figure 3.6. Assuming that S(z) and y(n) are known, the LTI system can be rearranged Y (z) =E (z) S (z) B (z) (3.16) = E (z) B (z) S (z) (3.17) = A (z) S (z) (3.18) where A(z) is an aggregation of the body and excitation filters. By inverse filtering y(n) in the

40 27 frequency domain with S(z), the impulse response for A(z) is obtained. Thus, by making a LTI assumption on the model, this residual signal contains the additional model components which are unaccounted for by the SDL alone. For practical considerations, Valimaki notes that several hundred milliseconds of the residual signal may be required to capture the perceptually relevant resonances of the acoustic body during resynthesis [9], but for many applications the tradeo of storing this signal outweighs the cost of explicit body modeling. It should be noted, that even when plucked-guitar tones do not exhibit prominent e ects from the resonant body, commuted synthesis is still a valid technique for obtaining the SDL excitation signal, p(n). This is often the case for electric guitar tones, where the output is measured by a transducer and is relatively dry compared to an acoustic guitar signal. Also, any excitation signal extracted via commuted synthesis will contain biases from the plucking point and pickup locations unless these phenomena are specifically accounted for in the excitation filter block of Figure 3.7. If the plucking point and pickup locations are known with respect to the SDL model, the excitation signal can be equalized to remove the biases. There are several techniques utilized in the literature to estimate the plucking point location directly from recordings of plucked guitar tones. Traube and Smith developed frequency domain techniques for acoustic guitars [81, 82, 83, 84], while Pentttinen et al. employed time-domain analysis to determine the relative plucking position along the string [58, 59] SDL Loop Filter Estimation Before the SDL excitation signal can be extracted via commuted synthesis, the loop filter, H l (z), needs to be calibrated from the recorded tone. This task has been the primary focus in much of the literature, since the loop filter provides the synthesized tones with natural decay characteristics [14, 29, 39, 62, 69, 86, 9]. This section will overview some of the techniques used in the literature. Early attempts at modeling the loop filter for the violin involved using deconvolution in the frequency domain to obtain an estimate of the loop filter s magnitude response. Smith employed various filter design techniques, including autoregressive methods, in order to model the contours of the spectra, however, the measured spectra were subject to amplified noise due to the deconvolution process [69]. Karjalainen introduced a more robust algorithm that extracts magnitude response specifications for the loop filter by analyzing the recorded tone with a short-time Fourier transform (STFT)

41 28 analysis [29]. Phase characteristics of the STFT are not considered in the loop filter design since the magnitude response is considered to be perceptually more important for plucked-guitar modeling [29, 86]. Lee et al. expand on Karjalainen s STFT-based approach by adapting the so-called Energy Decay Relief (EDR) [4, 64] to model the frequency-dependent attenuation of the waveguide. The EDR was adapted from Jot [23] in order to de-emphasize the e ects of beating in the string so that the resulting magnitude trajectories for each partial are strictly monotonic. Thus, the EDR at time t and frequency f is computed by summing all the remaining energy at that frequency from t to infinity. Due to the decaying nature of plucked-guitar tones, this leads to a set of monotonically decreasing curves for each partial analyzed. Example algorithm for Loop Filter Estimation An example of Karjalainen s calibration scheme is shown in Figure 3.8 and can be summarized with the following steps: 1. Determine the pitch, f, of the recorded tone, y(n). 2. Compute the STFT on the plucked tone y(n). 3. For each frame in the STFT, estimate the magnitudes of the harmonically-related partials. 4. Estimate the slope of each partial s magnitude trajectory across all frames in the STFT. 5. Compute a gain profile, G(f k ), based on the magnitude trajectories for each harmonically related partials. 6. Apply filter design techniques (e.g. least-squares) to determine the parameters of H l (z) that satisfy the gain profile. The details of each step in Karjalainen s calibration scheme vary depending on the specific implementation. For example, the number of partials chosen to analyze is typically between 1-2. Also, partial-tracking across each frame can be achieved by bandpass filtering techniques when the pitch is known [9]. The gain profile, G(f k ), extracted from the STFT analysis is computed as [29] G(f k ) = 1 k D 2f Hop (3.19)

42 29 where k is the slope of the kth partial s magnitude trajectory, D is the loop delay in samples and f Hop is the hop size of the STFT analysis. The physical meaning of Equation 3.19 is to determine the amount of attenuation a particular partial of the plucked tone incurs for each pass through the SDL. Thus, Equation 3.19 provides a gain specification for each partial in the STFT that can be used to design a loop filter, H l (z), with similar magnitude response characteristics. Filter Design Techniques Least-squares filter design techniques are typically employed to derive coe cients for the loop filter that satisfy the estimated gain profile [29, 86, 9]. Valimaki et al. utilized a weighted, least squares algorithm to estimate the gain, g, and pole, of H l (z) with a transfer function described by Equation Since a low-order filter generally cannot match the gain specifications of every partial, the weighted minimization ensures that the magnitudes of the lower, perceptually important partials are more accurately matched with the gain profile [86, 9]. These techniques must ensure that the filter coe cients are constrained for stability, which, for example, requires 1 < < and <g<1 when using the loop filter form of Equation Rather than design a filter based on desired magnitude characteristics, Bank et al. propose filter design technique which minimizes the error of the decay times for the partials in the synthetic tone [3], which are found to be perceptually significant. Erkut and Laurson used Karjalainen s calibration method as a foundation for an iterative scheme based on nonlinear optimization to extract loop filter parameters that best match the amplitude envelope of a recorded tone [14, 39]. The calibration scheme in Figure 3.8 is used to obtain an initial set of loop filter parameters, which are used to resynthesize the plucked signal and an error signal is computed between the amplitude envelopes of the recorded and synthesized signals. The loop filter parameters are adjusted by a small amount and the process is repeated until a global minimum in the error function is found. While this method has the potential to extract precise model parameters, convergence is not guaranteed and its success depends on the accuracy of the initial parameter estimates.

43 3.25 Plucked Guitar Tone y(n) Pitch Estimation Amplitude f Time (sec) Y(m, ω) STFT Magnitude (db) Trajectories of the Partials from a Plucked Guitar Tone Partial 1 Fitted Partial 2 Fitted Partial 3 Fitted Partial 4 Fitted Partial 5 Fitted Peak Detection Time (sec) Loop Filter Gain Specifications 1 Gain Profile Designed Filter Magnitude Loop Filter Design Gain g, α Frequency (Hz) Figure 3.8: Overview of the loop filter design algorithm outlined in Section using short-time Fourier transform analysis on the signal.

44 Extensions to the SDL Model The SDL model discussed in this chapter simulates plucked strings that vibrate in only the transverse (parallel to the guitar s top plate) direction and behave in accordance with linear time-invariant assumptions. These simplifications prevent modeling additional physical behavior exhibited by guitar strings, which are described in this section. Real guitar strings vibrate along the axes parallel and perpendicular to the guitar s sound board. The frequency of vibration along each axis is slightly di erent due to slight di erences in the string s length at the bridge and nut terminations. The di erences in the frequency of vibration along each axis causes the beating phenomena where the sum and di erence frequencies are perceived [9]. Furthermore, these vibrations may be coupled at the guitar s bridge termination, which causes a two-stage decay due to the in- and out-of-phase vibration along each axis [43]. In practice, the beating phenomena is incorporated into synthesis systems by driving two SDL models in parallel, which represent string vibration along the transverse and perpendicular axes [3, 26, 86]. From an analysis perspective, it is di cult to simultaneously estimate parameters for both the transverse and perpendicular axes from a recording since guitar pick-ups measure the total vibration at a particular point on the string. Typically, the parameters for both SDL model are extracted using the methods described in Section with the exception of slightly mistuning one of the delay lines to simulate the beating e ect. In order to estimate the model parameters directly, Riionheimo utilized genetic algorithms to obtain transverse and perpendicular SDL parameters that matched recorded signals in a perceptual sense [62]. Alternately, Lee employed a hybrid waveguidesignal approach where the waveguide model is augmented with a resonator bank to implement beating and two-stage decay phenomena in the lower frequency partials [43]. Modeling the tension modulation in strings necessitates the use of non-linear techniques to model the pitch-glide phenomena [79, 8]. In practice, pitch-glide is simulated by pre-loading a waveguide or SDL model with an initial string displacement and regularly computing the string s slope to determine an elongation parameter. This parameter drives a time-varying delay, which represents wave speed to reproduce the tension modulation e ect. The caveat to this approach, however, is that commuted synthesis cannot be applied to extract an excitation signal from a recorded tone. For an analysis-synthesis approach, Lee uses a hybrid resonator-waveguide model. The resonator bank is calibrated from a recording to implement pitch-glide in the low-frequency partials, since, it is argued, that these are perceptually more relevant [42].

45 32 CHAPTER 4: SOURCE-FILTER PARAMETER ESTIMATION 4.1 Overview Despite the vast amount of literature dedicated towards developing and calibrating physically inspired guitar models, as discussed in Chapter 3, far less research has been dedicated towards estimating expression from recorded performances and incorporating these attributes into the synthesis models. It is well-known that guitarists employ a variety of techniques to articulate guitar strings, such as varying the loudness, or dynamics, and picking device (e.g. finger, pick), which characterizes their playing style. Thus, identifying these playing styles from a performance is essential towards developing a system capable of expressive synthesis. In this chapter, I propose a novel method to capture expressive characteristics of guitar performance from recordings in accordance with the single delay-loop (SDL) model overviewed in Section 3.3. This approach involves jointly estimating the source and filter parameters of the SDL in accordance with a parametric model for the excitation signal, which captures the expressive attributes of guitar performance. Since the SDL is a source-filter abstraction of the waveguide model, this method treats the source signal as the guitarist s string articulation while the filter represents the string s response behavior. The motivation for a joint estimation scheme is to account for simultaneous variation of source and filter parameters, which characterizes particular playing styles. Before providing the details of our approach, I briefly overview existing techniques in the literature for modeling expression in guitar synthesis models. 4.2 Background on Expressive Guitar Modeling Erkut and Laurson present methods to generate plucked-tones with di erent levels of musical dynamics, or relative loudness, by manipulating a reference excitation signal with a known dynamics level. These methods involve designing pluck-shaping filters that can achieve a desired musical dynamics when applied to the reference excitation signal [14]. Erkut employs a method that deconvolves a fortissimo (very loud) excitation with forte (loud) and piano (soft) excitations in order to derive

46 33 their respective pluck-shaping filter coe cients. Laurson used the di erences in log-magnitude between two signals with di erent dynamics and autoregressive filter design techniques to approximate a desired pluck-shaping filter [39]. Both approaches are founded on an argument that a desired level of musical dynamics can be achieved by appropriately filtering a reference excitation signal. A limitation of this approach, however, is the assumption that the string filter parameters remain constant for all plucking styles, which does not always hold. Cuzzocoli et al. presented a model for synthesizing guitar expression by considering the fingerstring interaction for di erent plucking styles in classical guitar performance [12]. This work considered two plucking styles; apoyando, where the string is displaced quickly by the finger, and tirando, where the finger slowly displaces the string before releasing it. The e ects of these finger-string interactions are incorporated into the waveguide model by modifying the wave equation to incorporate the force exerted on the string depending on the plucking style. For example, in the case of apoyando plucking, the force applied to the string is impulsive, while tirando plucks are characterized by a more gradual change in the string s tension. Cuzzucoli s approach relies on o -line analysis and no methods are provided for deriving these parameters from a recorded signal. Though these approaches adequately model expressive intention(s), o ine analysis is required to compute the model s excitation signal separately from the filter. This approach is counter-intuitive from a musical performance perspective, since it is understood by musicians that expression is, in part, the result of a simultaneous interaction between the performer and instrument. 4.3 Excitation Analysis The SDL model presented in Section 3.3 assumes that plucked-guitar synthesis can be modeled by a linear and time-invariant system. Accordingly the model output is the result of a convolution between a source signal p(n) a comb filter C(z) approximating the performer s plucking point position and the string filter model S(z). For analysis-synthesis tasks, the commuted synthesis technique, as overviewed in Section 3.3.4, is used to compute p b (n) by inverse filtering the recorded tone, y(n), in the frequency domain with S(z) as shown in Equation 4.1: P b (z) =Y (z)s 1 (z) (4.1)

47 34 It should be noted that the subscript b on p(n) indicates that the excitation signal contains a bias from the performer s plucking point position. Unless the comb filter C(z) from Section is known, the excitation signal derived from commuted synthesis will always contain this type of bias Experiment: Expressive Variation on a Single Note To determine if the SDL model can incorporate expressive attributes of guitar performance, excitation signals are analyzed corresponding to di erent articulations for the same note on an electric guitar by employing commuted synthesis with Equation 4.1. Assuming the string filter parameters are relatively constant for each performance, one might expect that the excitation signals contain the expressive characteristics that distinguish each playing style. Additionally, any similarities observed between the excitations may permit the development of a parametric input model. To test this hypothesis, recordings of electric guitar performance were analyzed using the following approach; For each plucking style: 1. Vary the relative plucking strength used to excite the string from piano (soft) to forte (loud). 2. Vary the articulation used to excite the string using either a pick or a finger. 3. Calibrate the string filter, S(z), using the methodology described in Section Extract p b (n) by inverse filtering the recording, y(n), with S(z) The tones used for analysis were taken from an electric guitar equipped with a bridge-mounted piezo electric pickup. These signals are relatively dry with negligible e ects from the instrument s resonant body so that the recovered excitation signals should primarily indicate the performer s articulation. The bridge-mounted pickup ensures that the output will be observed from the same location on the string and the recovered excitation signal will only contain a bias due to the plucking point e ect. The top panel of Figure 4.1 shows the recorded tones produced from specific articulations applied to the guitar s open, or unfretted, 1 st string and the corresponding excitation signals obtained using the approach outlined above are shown in the bottom panel. By observation, it is clear that each excitation signal corresponds to the first period of oscillation for its associated signal in the top panel of Figure 4.1 and each has negligible amplitude after this period. This is an intuitive result since the SDL used for synthesis is tuned for the pitch of the string and its harmonics. By inverse filtering with the SDL, the residual signal is devoid of the periodic and harmonic structure of the

48 Amplitude finger, piano finger, forte pick, piano pick, forte.4.2 Amplitude finger, piano finger, forte pick, piano pick, forte Time (msec) Figure 4.1: Top: Plucked guitar tones representing various string articulations by the guitarist on the open, 1st string (pitch E 4, Hz). Bottom: Excitation signals for the SDL model associated with each plucking style. recorded tone. The remaining spikes in the excitation signal correspond to incident and reflected pulses detected by the pick up after the string is released from displacement (see Section 4.3.2). Despite the similar contour patterns of the excitation signals in Figure 4.1, there are several distinguishing features related to the perceived di erences in timbre. The di erences between the amplitudes of overlapping impulses corresponds to the relative strength of the articulation used to produce the tone. More interestingly, however, are the di erences between the tones produced with a pick and those produced with the finger, as the former features sharper transitions near regions of maximum or minimum amplitude displacement. This observation is correlated with the perceived timbre of each tone since plucks generated with a pick have a more pronounced attack and will

49 36 excite the high-frequency harmonics in the string. The common structure of the excitation signals in Figure 4.1 suggest that p b (n) can be parametrically represented to capture the variations imparted by the guitarist through the applied articulation Physicality of the SDL Excitation Signal The excitation signals shown in Figure 4.1 follow the contours of their counterpart plucked signals in Figure 4.1. However, the excitation signal is a short transient event that reduces to residual error after one period of oscillation in the corresponding plucked tones. Essentially, the excitation signal indicates one period of oscillation in the vibrating string measured at a particular position along the string. In this case, the acceleration of the string at the guitar s bridge is the variable observed. The peaks observed in the excitation signals of Figure 4.1 can be explained by observing the output of a bidirectional waveguide model over one period of oscillation. This is shown in Figure 4.2 where the output at the end of the waveguide representing the guitar s bridge position is traced over time. Initially, the amplitude of the acceleration wave is maximal at the moment the string is released from its initial displacement (Figure 4.2a). After time, two separate disturbances form and travel in opposite directions along the string (Figure 4.2b). The initial peak in the excitation signal occurs when the right-traveling wave encounters the bridge position (Figure 4.2c). The amplitude of both traveling waves is inverted after reflecting with the boundary conditions at the nut and bridge positions. Eventually, the initially left-traveling wave, now with inverted amplitude, encounters the bridge position forming the second pulse of the excitation signal (Figure 4.2e). After sometime, the initial pulse returns and the cycle repeats (Figure 4.2e). As will be discussed in Chapter 6, identifying the pulse locations in the excitation signal can be used to estimate the guitarist s relative plucking position.

50 37 Acceleration String Length (meters) 1 Acceleration String Length (meters) 1 Bridge Acceleration.5.5 Bridge Acceleration Time (msec) Time (msec) (a) t =msec (b) t =.56 msec 1 1 Acceleration String Length (meters) 1 Acceleration String Length (meters) 1 Bridge Acceleration.5.5 Bridge Acceleration Time (msec) Time (msec) (c) t =1.156 msec (d) t =2.26 msec 1 1 Acceleration String Length (meters) 1 Acceleration String Length (meters) 1 Bridge Acceleration.5.5 Bridge Acceleration Time (msec) Time (msec) (e) t =3.37 msec (f) t =5.67 msec Figure 4.2: The output of a waveguide model is observed over one period of oscillation. The top figure in each subplot shows the position of the traveling acceleration waves at di erent time instances. The bottom plot traces out the measured acceleration at the bridge (noted by the x in the top plots) over time.

51 Parametric Excitation Model The contour patterns of the excitation signals observed in Figure 4.1 and the simulated waveguide output of Figure 4.2 are consistent with the physical behavior of the vibrating string. This suggests that the variations in the physical behavior of a plucked-string due to di erent articulations can be parametrically represented by capturing the contours of the pulse peaks. Modeling the excitation signal with polynomial segments is a reasonable choice for approximating each contour. By concatenating these polynomial segments together, the excitation signal can be represented by a piecewise function p b (n) =c 1, n + c 1,1 n c 1,K n K + + c J, n + c J,1 n c J,K n K (4.2) where c J,k is the k th coe cient of a K th order polynomial modeling the J th segment of p b (n). Therefore, modeling a particular excitation signal requires determining the number of segments required, the polynomial degree used to model each segment and the boundary locations specifying where a particular segment begins and ends. 4.4 Joint Source-Filter Estimation As shown in Section 4.3.2, the SDL excitation signal reflects one period of oscillation observed at a particular location along the string. Also, it was shown that these signals di er according to the articulation imparted by the guitarist and that a parametric model was proposed that can account for these di erences. To model the SDL filter in response to di erent inputs (i.e. string articulations), this section proposes a joint source-filter approach to simultaneously account for variation in the excitation and string filter parameters. This section will detail the approach for estimating these parameters by formulating a convex optimization problem Error Minimization Using the SDL model, plucked string synthesis is assumed to result from a convolution between an input signal and a string filter. To estimate these parameters in a joint framework, the error between the excitation model described by Equation 4.2 and the residual signal must be minimized e (n) =p b (n) ˆp b (n). (4.3)

52 39 Here, p b (n) is the excitation model from Equation 4.2 and ˆp b (n) is the residual obtained by inverse filtering the output with the string filter. By assuming S(z) is an all-pole filter, e(n) can be expressed in the frequency domain by replacing ˆp b (n) withy (z)s 1 (z) toyield E(z) =P b (z) Y (z)s 1 (z) = P b (z) Y (z)(1 H l (z)h F (z)z D ) (4.4) where the SDL components discussed in Chapter 3 are used to complete the inverse filtering operation. Making an all-pole assumption on S(z) treats the output of the SDL as a generalized linear prediction problem where the current output sample y(n) is computed by a linear combination of previous output samples. Due to the periodic nature of the plucked tone, this prediction happens over an interval defined by the loop delay which is specified by D. Since inverse-filtering is a time-domain process, taking the inverse Z-Transform of E(z) in Equation 4.4 yields e(n) =p b (n) y(n)+ y(n D)+ 1 y(n D 1) + + N y(n D N), (4.5) where, 1,... are generalized filter coe cients that are to be estimated. This equation can be rearranged to e(n) =p b (n)+ y(n D)+ 1 y(n D 1) + + N y(n D N) y(n), (4.6) where the unknowns due to the source signal p b (n) and filter (, 1,...) are clearly separated from the recorded tone y(n). This form leads to a convenient matrix formulation as shown in Equation 4.7.

53 e(1) 1 1 K y(1 D) y(1 D N) e(i) i = i K y(i D) y(i D N) x e(i + 1) (i + 1) (i + 1) K y(i +1 D) y(i +1 D N) e(m) m m K y(m D) y(m D N) 2 3 y(1). y(i) y(i + 1) y(m) e = Hx y (4.7) H contains the time indices corresponding to the boundaries of p b (n) and the shifted samples of y(n) and the unknown source-filter parameters are contained in a column vector x defined as T x = applec 1, c 1,K c J, c J,K 1 N. (4.8) Full specification of Equation 4.7 requires determining the number of unknown source and filter parameters. The generalized filter depends on N coe cients while the excitation signal depends on the number of piecewise polynomials used to model it. J indicates the number of segments and K is the polynomial order for each segment Convex Optimization The source-filter parameters are found by identifying the unknowns in x that minimize Equation 4.7. The complexity of this problem is obviously related to the number of segments used to parameterize p b (n) and the order of the generalized filter used to implement the string decay. In general the number of unknowns are specified by J (K + 1) + N + 1. A common metric for optimizing the estimation of the unknown parameters is by taking the L 2 -norm of the error term in Equation 4.7, which leads to min kek 2 =minkhx yk 2. (4.9) x x

54 41 Expanding 4.9 yields min x khx yk 2 =(Hx y) T (Hx y) = x T H T Hx 2y T Hx + y T y = 1 2 xt Fx + g T x + y T y (4.1) where F =2H T H and g T = 2y T H. Equation 4.1 is now in the form of a convex optimization problem. In this form, any locally minimum solution must also be a global solution [6]. Before applying a solver to the optimization problem, the constraints on the source-filter parameters in x must be addressed. For example, depending on the structure used for the loop filter, the constraints may specify bounds on the coe cients to yield a stable filter. Specific constraints for the filter models used will be discussed in Sections 5.2 and 5.3. Regardless of the filter structure used, the constraints regarding the excitation model are consistent. In particular, the segments constituting the excitation should be a smooth concatenation of polynomial functions that are continuous at the boundary locations. As an example, consider an excitation consisting of J = 2 segments, each modeled with a K-order polynomial and sharing a boundary located at n = i. The equality condition ensuring that these segments are continuous can be expressed as c 1, i + c 1,1 i c 1,K i K = c 2, i + c 2,1 i c 2,K i K, which, in matrix form, is notated as apple i i 1 i K i i 1 i k c 1, c 1,1. c 1,K c 2, c 2,1. c 2,K 3 =. 7 5 The term on the left contains the time indices of the polynomial functions and the column vector

55 42 contains the unknown source coe cients. Since the real excitation signals dealt with will consist of more than two segments, additional equality conditions are required for each pair of segments sharing a boundary. The constraints on the source-filter parameters are specified for the optimization problem via equality and inequality conditions, noted by A eq and A, respectively. By including these constraints, the optimization problem from Equation 4.1 is expressed as min x f(x) = 1 2 xt Fx + g T x (4.11) subject to Ax apple b A eq x = b eq. where the last term of Equation 4.1 is dropped from the objective function f(x) since it is always positive and does not contribute to the minimization. In Equation 4.11, b and b eq specify the bounds on the parameters related to the inequality and equality constraint matrices, respectively. When written in the form of 4.11, Equation 4.9 is solved using quadratic programming techniques. Several software packages are available for this task, including CVX and the quadprog function in MATLAB s Optimization Toolbox. quadprog employs a trust region algorithm, where a gradient approximation is used to evaluate a small neighborhood of possible solutions in x to determine convergence [47]. CVX is also adept for solving quadratic programs, though it formulates the objective function as a second-order cone problem [18]. CVX is the preferred solver for the work in this thesis because the syntax used to specify the quadratic program is identical to the mathematical description of the minimization problem in Equation 4.1.

43 CHAPTER 5: SYSTEM FOR PARAMETER ESTIMATION y(n) Coarse Onset Detection Pitch Estimation f Initialize Least Squares Problem Hx - y 2 Solve Optimization Source-Filter Parameters x Pitch Synchronous

56 43 CHAPTER 5: SYSTEM FOR PARAMETER ESTIMATION y(n) Coarse Onset Detection Pitch Estimation f Initialize Least Squares Problem Hx - y 2 Solve Optimization Source-Filter Parameters x Pitch Synchronous Onset n, n 1,...,n J Onset Localization and Segment Estimation Figure 5.1: Proposed system for jointly estimating the source-filter parameters for plucked guitar tones. This chapter presents the details for the implementation of the joint source-filter estimation scheme proposed in Chapter 4. Figure 5.1 provides a diagram of the proposed system including the major sub-tasks required for estimating the parameters directly from recordings. Section 5.1 discusses the onset localization of the plucked-guitar signal. This is required to determine the pitch of the tone during the attack instant and to localize the indices for the parametric model of the excitation signal. The experiments for application of the joint source-filter scheme are presented in Section 5.2, which include the problem formulation, solution and analysis of the results. 5.1 Onset Localization To estimate the SDL excitation signal in the joint framework, the physics of a vibrating string fixed at both end points are exploited. When considering the SDL model without the comb filter e ect explicitly accounted for, the excitation signal corresponds to one period of string vibration, which can be identified in the recorded signal. From the physical modeling overview provided in Chapter 3,

57 44 when the string is released from an initial displacement, two disturbances are produced that travel in opposite directions along the string. These disturbances are measured by the guitar s pickup as impulse-like signals where the first pulse is incident from the string s initial displacement and the second is inverted from reflection at the guitar s nut. A simulation of this behavior using acceleration as the wave variable was shown in Section By identifying these pulses in the initial period of vibration, the portion of the recorded signal corresponding to the excitation signal can be identified. This section overviews the approach used to identify the boundaries of the excitation within the plucked-guitar signal, which includes locating the incident and reflected pulses. As will be explained in Chapter 6, the spacing of these pulses provides insight on estimating the performer s relative plucking position along the string. The approach utilizes a two-stage onset detection and is outlined as follows: 1. Employ coarse onset detection to determine a rough onset time for the attack of the plucked tone. 2. Estimate the pitch of the tone starting from the coarse onset. 3. Using the estimated pitch value, employ pitch-synchronous onset detection to estimate an onset closer to the initial attack of the signal. 4. Search for the local minimum and maximum values within the first period of the signal Coarse Onset Detection Onset detection is an important tool used for many tasks in music information retrieval (MIR) systems, such as the identification of performance events in recorded music. For example, on a large scale it may be of interest to identify the beats from a recording of polyphonic music by looking for the drum onsets. For melody detection on a monophonic signal, the onsets must be found to determine when the instrument is actually playing. A thorough review of onset detection algorithms is provided in [4] and details several sub-tasks of the process including pre-processing of the audio signal, reducing the audio signal to a detection function and locating the onsets by finding peaks in the detection function. Obtaining a spectral representation of the audio signal is often the initial step for computing a detection function since the time-varying energy in the spectrum can indicate when certain transient events occur, such as note onsets. The short-time Fourier Transform (STFT) provides a time-varying spectral representation

58 45 and may be computed as: Y k (n) = N 2 1 X y(m)w(m nh)e 2j mk N. (5.1) m= N 2 In Equation 5.1, w(m) is an N-point window function and h is the hop-size between adjacent windows. The STFT facilitates the computation of several detection functions for onset detection tasks including spectral flux. For monophonic recordings of instruments with an impulsive attack, such as the guitar, Bello et al. show that spectral flux performs well in identifying onsets [4]. Spectral flux is calculated as the squared distance between successive frames of the STFT SF(n) = N 2 1 X {R ( Y k (n) Y k (n 1) )} 2 (5.2) k= N 2 where R(x) =(x + x )/2 is a rectification function to account for only positive changes in energy while ignoring negative changes. The coarse onset detection is named such because a relatively large window size of N = 248 samples is used to compute the STFT in Equation 5.1 and the flux in Equation 5.2. The motivation for using such a long window size is to identify the attack portion of the plucked-tone where there is the largest energy increase while ignoring spurious noise preceding onset. The corresponding detection function is shown in the top panel of Figure 5.3(a) where there is a clear peak. The onset is taken as the time instant two frames prior to the maxima in the detection function Pitch Estimation The coarse onset detected in Figure 5.3(a) is still quite far o from the attack segment of the plucked signal. Searching for the pulse indices too far from the onset of the signal will likely result in false detections and a closer estimate is required. This is the purpose of pitch synchronous onset detection. The pitch of the signal is estimated by taking a window of audio equal to three times of the STFT frame length starting from the coarse onset location. Using this window, the pitch is estimated using the well-known autocorrelation function, which is given by (m) = 1 NX 1 [y(n + l)w(n)][y(n + l + m)w(n + m)], for apple m apple N 1, (5.3) N n=

59 Autocorrelation Function Fundamental Frequency Lag Autocorrelation Lag (msec) Figure 5.2: Pitch estimation using the autocorrelation function. The lag corresponding to the global maximum indicates the fundamental frequency for a signal with f = 33 Hz. where w(n) is a window with length N. Autocorrelation is used extensively for detecting periodicity in signal processing tasks since it can reveal underlying structure in signals, especially for speech and music. If (m) for a particular signal is known to be periodic with period P, then that signal is also periodic with the same period [61]. The pitch of the plucked-signal is estimated by searching for a global maximum in (m) that occurs after the maximum correlation, i.e. the point of zero lag where m =. An example autocorrelation plot is provided in Figure Pitch Synchronous Onset Detection The estimated pitch of the plucked-signal is used to recompute the STFT using a frame size equal to half the estimated pitch period starting from the coarse onset location. The spectral flux is also recomputed using equation 5.2 and the new frame size. This yields a detection function with much finer time resolution. As an example, the pitch synchronous onset for a plucked signal is shown in Figure 5.3(b), where the onset is taken as the first locally maximum peak indicated by the detection function. Comparing all the panels of 5.3, it is evident that the two stage onset detection procedure provides an onset that is su ciently close to the attack portion of the plucked-note.

60 Flux Onset 14 Spectral Flux (a) 4 35 Flux Onset 3 Spectral Flux (b) Plucked Signal Coarse Onset Pitch Synchronous Onset Amplitude Time (sec) (c) Figure 5.3: Overview of residual onset localization in the plucked-string signal. (a): Coarse onset localization using a threshold based on spectral flux with a large frame size. (b): pitch-synchronous onset detection utilizing spectral flux threshold computed with a frame size proportional to the fundamental frequency of the string. (c): Plucked-string signal with onsets coarse and pitch-synchronous onsets overlayed.

61 Locating the Incident and Reflected Pulse With the pitch-synchronous onset location, identifying the indexes of the incident and reflected pulses is accomplished via a straight-forward search for the minimum and maximum peaks within the first period of the signal. This period is known from the previous pitch estimation step. The plucked-signal from Figure 5.3 is shown again in detail in Figure 5.4 for emphasis. The indices of the pulses are used as boundaries for fitting polynomial curves to model the excitation signal. It should be noted that a straight-forward search for the minima and maxima is sensitive to noise preceding the incident pulse. The pitch-synchronous onset detection is capable of ignoring this noise and yielding an onset closer to the incident pulse location Pluck Signal Pitch Syncrhonous Onset Incident Pulse Reflected Pulse Amplitude Time (sec) Figure 5.4: Detail view of the attack portion of the plucked-tone signal in Figure 5.3. The pitchsynchronous onset is marked as well as the incident and reflected pulses from the first period of oscillation.

62 Experiment 1 This section presents the application of the joint source-filter estimation schemed proposed in Section 4.4 when the loop filter chosen is a single pole infinite impulse response (IIR) type. The problem formulation and solution are discussed as well as the application of the scheme to a corpus of plucked guitar tones Formulation In the literature, the decay rates of the harmonically-related partials of plucked-guitar tones are often approximated by a single, infinite impulse response (IIR) filter with the following form H l (z) = g 1 z 1 (5.4) In this formulation, the pole is tuned so that the spectral roll-o of the filter s magnitude response approximates the decay rates of the harmonically related partials in the plucked guitar tone. The gain term g in the numerator is tuned to improve the fit. To estimate this type of filter in the joint source-filter framework, Equation 5.4 is substituted for H l (z) inthesdlstringfilters(z) 1 S(z) = 1 H l (z)h F (z)z D I = 1 z 1 1 z 1 gh F (z)z D I. (5.5) The pole in the numerator of Equation 5.5 poses a problem for the joint-source filter estimation approach because inverse filtering Y (z) withs(z) does not result in a FIR filtering operation. This is problematic because inverse filtering Y (z) and S(z) in the time domain requires previous samples from the excitation signal p b (n), which is unknown. In practice, we can circumvent this di culty and still formulate the joint source-filter estimation problem by discarding the numerator of S(z) in Equation 5.5 to yield an all-pole filter. This approximation is made by noting a few observations about the source-filter system. First, the magnitude response of S(z), shown in Figure 5.5(d), is dominated by its poles, which creates a resonant structure passing frequencies located near the string s harmonically related partials. Examining the values estimated for the loop filter pole in the literature [14, 39, 86, 9], is typically very small

63 Imaginary Axis (seconds 1 ) Imaginary Axis (seconds 1 ) Real Axis (seconds 1 ) Real Axis (seconds 1 ) (a) (b) Magnitude (db) Frequency (Hz) (c) Magnitude (db) Frequency (Hz) (d) Figure 5.5: Pole-zero and magnitude plots of a string filter S(z) withf = 33 Hz and a loop filter pole located at =.3. The pole-zero and magnitude plots of the system are shown in (a) and (c) and the corresponding plots using an all-pole approximation of S(z) are shown in (b) and (d). ( 1). As shown in Figure 5.5(a), this places the corresponding zero in the numerator of S(z) close to the origin of the unit circle giving it a negligible a ect on the filter s magnitude response. Figures 5.5(d) shows that the magnitude response of the all-pole approximation is identical to its pole-zero counterpart in Figure 5.5(c). The next observation is that the model of the excitation signal consists of a short-duration pulse with zero amplitude after the first period of vibration as discussed in Section 4.3. The non-zero part of the excitation signal pertains to how the string was plucked, while the remaining part is residual error from the string model. By making a zero-input assumption on the excitation signal after the initial period, the recursion from the numerator of S(z) can be ignored without much a ect to the

64 51 estimation. Taking these observations into account, the numerator of S(z) is discarded and an all-pole approximation is obtained Ŝ(z) = 1 1 z 1 gh F (z)z D I. (5.6) The fractional delay coe cients due to H F (z) must be addressed before the error minimization between the residual and excitation filter can be formulated (i.e. Equation 4.3). H F (z) is an N order FIR filter NX H F (z) = h n (n)z N (5.7) n= where the coe cients for a desired delay can be computed using a number of design techniques. A consequence of realizing a causal fractional delay filter is that an additional integer delay in the amount bn/2c is introduced into the feedback loop of Ŝ(z). In practice, this can be compensated for to avoid de-tuning the SDL by subtracting the added delay from H F (z) o of the bulk delay filter z D I as long as N D I. The required fractional delay D F and the bulk delay D I can be determined from the estimated pitch of the guitar tone discussed in Section and H F (z) is computed using the LaGrange interpolation technique overviewed in Appendix A. The error minimization from Equation 4.4 can now be specified for this particular case E(z) =P b (z)y (z)(1 z 1 g(h + h 1 z h N z N )z D I ). (5.8) By expanding Equation 5.8, rearranging terms and taking the inverse z-transform the error minimization is expressed in the time domain as e(n) =p b (n)+ y(n 1) +... y(n D I )+ 1 y(n D I 1) + + N y(n D I N) y(n) (5.9) where j = gh j, for j =, 1, 2,...,N.

65 Problem Solution Using the convex optimization approach presented in Section 4.4.2, minimizing the L 2 -norm of Equation 5.9 becomes min x khx yk 2 (5.1) subject to.1 apple apple apple j apple.999 for j =, 1,...,N. The first inequality in the minimization ensures that the estimated loop filter pole will lie within the unit circle for stability and have low-pass characteristics. Though = is a stable solution, the resulting filter will not have any damping characteristics on the frequency response of the loop filter so.1 was chosen as a lower bound on. The second inequality constraint relates to the stability of the overall string filter S(z). If the gain g of the loop filter is permitted to exceed unity, certain frequencies could be amplified, which would result in an unstable string filter response. Thus, the product of g with each fractional delay filter coe cient h j is constrained to avoid this. Each h j is constrained by the nature of the fractional filter design leaving g as the free parameter. In addition to the inequality constraints, equality constraints were placed on the minimization in Equation 5.1 to handle continuous excitation boundaries, which was discussed in Section The excitation boundaries were identified using the two-stage onset localization scheme from Section 5.1. While this approach yields 3 segments corresponding to the incident and reflected pulses, it was found that additional segments were needed to adequately model the complex contours of the excitation signal. To reduce the modeling complexity, two equally-spaced boundaries were inserted between the incident and reflected pulses as shown in the top panel of Figure 5.6. Including the boundary after the first period of the signal, this yields a total of 5 boundaries requiring 6 segments to be modeled. 5 th -order polynomial functions were found to provide the best approximation of each segment while maintaining feasibility in the optimization problem since increasing the order also increases the number of unknown variables. Lower order functions are unable to capture the details of the signal, while higher order functions generally resulted in the solver failing to converge on a solution.

66 Results The source-filter estimation scheme was applied to a corpus of recorded performances of a guitarist exciting each of the 6 strings using various fret positions. Multiple articulations were performed at each position, which included using a finger or pick and altering the dynamics, or relative hardness, of the excitation. Additional details about the data are provided in Section 6.3. Figure 5.6 demonstrates the analysis and resynthesis for a tone produced by plucking the open, 1 st string of the guitar. The top panel of Figure 5.6 shows the identification of the boundaries for the excitation signal model within the first period of the recorded tone. The middle panel shows the resynthesized tone and estimated excitation signal using the parameters obtained from the convex optimization. The error computed between the synthetic and recorded tones is shown in the bottom panel of Figure 5.6 along with the error computed between the estimated excitation signal and the residual from inverse filtering. Areas of the error signals with significant amplitude can be attributed to several factors. First, the approximation of the excitation may not capture all the high frequency details present in the recorded signal. Second, the SDL model has fixed-frequency tuning whereas the pitch of the recorded tone tends to fluctuate due to changing tension as the string vibrates, which results in misalignment. Finally, the loop filter model assumes that the string s partials monotonically decay over time even though the decay characteristics of recorded tones are generally more complex. This results in amplitude discrepancy between the analyzed and synthetic signals, which contributes to the error as well. Figure 5.7 shows that the source-filter estimation approach is capable of estimating the loop filter pertaining to string articulations resulting from varying dynamics. Figures 5.7(a) and 5.7(b) show the amplitude decay characteristics of analyzed and synthesized tones produced with a piano articulation, respectively. In this case, the synthetic tone demonstrates the gradual decay characteristics of its analyzed counterpart. As the articulation dynamics are increased to mezzo-forte, the observed decay is more rapid in both the analyzed and synthetic cases in Figures 5.7(c) and 5.7(d). Finally, Figures 5.7(e) and 5.7(f) show a forte articulation defined by a very rapid decay. In all cases, the synthetic signals constructed from the estimated parameters convey the perceptual characteristics of their analyzed counter parts. Figure 5.8 shows a similar plot of analyzed and resynthesized signals for various articulations, but focuses on tones produced on a lower gauge string. In this case, the string s behavior deviates significantly from the SDL model since the amplitude decay rate fluctuates over time. This is

67 Analyzed Signal Residual Excitation Excitation Boundaries.3.2 Amplitude Synthesized Output Estimated Input Signal.3.2 Amplitude Output Error Input Error.3.2 Amplitude Time (msec) Figure 5.6: Analysis and resynthesis of the guitar s 1 st String in the open position (E 4, f = Hz). Top: Original plucked-guitar tone, residual signal and estimated excitation boundaries. Middle: Resynthesized pluck and excitation using estimated source-filter parameters. Bottom: Modeling error.

68 55 Amplitude Time (sec) (a) piano, analyzed Amplitude Time (sec) (b) piano, synthetic Amplitude Time (sec) (c) mezzo-forte, analyzed Amplitude Time (sec) (d) mezzo-forte, synthetic Amplitude Time (sec) (e) forte, analyzed Amplitude Time (sec) (f) forte, synthetic Figure 5.7: Comparing the amplitude envelopes of synthetic plucked-string tones produced with the parameters obtained from the joint source-filter algorithm against their analyzed counterparts. The tones under analysis were produced by plucking the 1 st string at the 2 nd fret position (F# 4, f = 37 Hz) at piano, mezzo-forte and forte dynamics. characteristic of tones that exhibit strong beating characteristics and tension modulation. Although these behaviors are not captured using the joint estimation approach, the optimization routine identifies loop filter parameters that provide the best overall approximation of the tone s decay characteristics.

69 56 Amplitude Time (sec) (a) piano, analyzed Amplitude Time (sec) (b) piano, synthetic Amplitude Time (sec) (c) mezzo-forte, analyzed Amplitude Time (sec) (d) mezzo-forte, synthetic Amplitude Time (sec) (e) forte, analyzed Amplitude Time (sec) (f) forte, synthetic Figure 5.8: Comparing the amplitude envelopes of synthetic plucked-string tones produced with the parameters obtained from the joint source-filter algorithm against their analyzed counterparts. The tones under analysis were produced by plucking the 5 th string at the 5 th fret position (D 3, f = Hz) at piano, mezzo-forte and forte dynamics. To assess the model fit for each signal in the data set, the signal-to-noise ratio (SNR) was

70 57 computed as SNR db = 1 log 1 1 L LX 2 y(n), (5.11) y(n) ŷ(n) n= where L is the length of the analyzed guitar tone y(n) and ŷ(n) is the re-synthesized tone using the parameters from the joint estimation scheme. This metric provides an indication of the average amplitude distortion introduced by the modeling scheme for a particular signal so that in the ideal case there is zero amplitude error distorting the signal. Table 5.1 summarizes the mean and standard deviation of the SNR computed for particular articulations on certain strings. For example, the SNR values for all forte plucks produced with the guitarist s finger along the 1 st string are computed and the mean and standard deviation of these values is reported. No distinction is made for di erent fret positions along a string. It should be noted that in general, the mean SNR value for a particular dynamic (i.e. forte) corresponding to pick articulations is generally lower than the same plucking dynamic produced with the guitarist s finger. This can be explained by the action of the plastic pick, which induces rapid frequency excursions in the partials of the string and other nonlinear behaviors such as tension modulation. These e ects are prominent near the attack portion of the tone and the associated string decay does not exhibit the monotonically decaying exponential characteristics used in the single delay-loop model. The linear time invariant model cannot capture the complexities of the string vibration and the estimated loop filter provides a best fit to match the overall decay characteristics. This leads to a greater amplitude discrepancy between the modeled and analyzed tones and thus a lower SNR value. For the 3 rd string, the SNR values are significantly lower for the pick articulations. A closer inspection revealed that many of these tones exhibited resonant e ects from coupling with the guitar s body. This resonant e ect introduces a hump in the tone s amplitude decay envelope after the initial attack. Since the string model does not consider the instrument s resonant body, this e ect is not accounted for, which leads to increased amplitude error for the a ected portions of the signal. Informal listening tests confirm that the synthetic signals preserve many of the perceptually important characteristics of the original tones, including the transient attack portion of the signal related to the guitarist s articulation.

71 58 Mean and Standard Deviation of Signal-to-Noise Ratio (db) Pick Finger String piano mezzo-forte forte piano mezzo-forte forte ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1.25 Table 5.1: Mean and standard deviation of the SNR computed using Equation The joint source-filter estimation approach was used to obtain parameters for synthesizing the guitar tones based on an IIR loop filter. 5.3 Experiment 2 This section investigates the solution of the joint source-filter estimation scheme when a finite impulse response (FIR) filter is used to implement the loop filter. The problem formulation, solution and results are discussed as well Formulation The Z-Transform for a generalized, length N (order N 1) FIR filter is given by NX H(z) = h k z k, (5.12) k= where each h k is an impulse response coe cient of the filter. By using this filter structure for the string model s loop filter, the transfer function of S(z) becomes 1 S(z) = 1 H l (z)h F (z)z D. (5.13) l For the plucked-string system defined by the transfer function of S(z), the output is computed entirely by a linear combination of past output samples once the transient-like excitation has reached a zero-input state. Estimating the filter coe cients through the error minimization technique discussed in Section becomes complicated since the loop filter coe cients are convolved with the coe cients from the fractional delay filter H F (z), which is also modeled using an FIR filter and

72 59 the contribution of the loop filter cannot be easily separated. In practice, this di culty is averted by resampling the recorded signal y(n) to a frequency that can be defined by an integer number of delays determined by the bulk delay term D I, which allows H F (z) to be dropped. Though this has the e ect of adjusting the frequency of the signal to ˆf o = fs D I, the fractional delay filter can be re-introduced during synthesis to correct the pitch. After the resampling operation, the Z-Transform of the error minimization becomes E(z) =P b (z) Y (z)s 1 (z) = P b (z)(1 (h + h 1 z h N )z D I ). (5.14) Expanding terms and taking the inverse Z-Transform of Equation 5.14 yields the time-domain formulation of the error minimization e(n) =p b (n)+h y(n D I )+h 1 y(n D I 1) + + h N y(n D I N) y(n) (5.15) where the loop filter coe cients h k can be estimated with the convex optimization approach Problem Solution Before solving for the source and filter parameters, several constraints are imposed on the FIR loop filter. Foremost, the loop filter is required to have a low pass characteristic, to avoid amplifying high frequency partials. This is consistent with the assumed operation of the loop filter in relation to the behavior of plucked-guitar tones described in Section where, in general, high frequency partials are perceived as decaying faster than lower frequency partials. The next constraint on the loop filter is that it exhibit a linear phase response to avoid introducing excessive phase distortion into the frequency response of the string filter S(z). These filters also have the convenient property of constant group delay, so as not to drastically de-tune S(z) when the signal is resynthesized. The low pass constraints on the FIR filter can be formulated by constraining the magnitude response on the filter at DC and Nyquist. At DC (! = ), the filter gain is required to be apple 1 and

73 6 yields the following inequality constraints on the filter coe cients H(e j!k )!= apple 1 h + h 1 e j 1 + h 2 e j h N e j N apple 1 h + h 1 + h h N apple 1. (5.16) At Nyquist frequency (! = ), we require the filter to have zero magnitude response. This is expressed as an equality constraint on the filter coe cients H(e j!k )!= = h + h 1 e j + h 2 e j2 + + h N e jn = h + h 1 + h 2 + +( 1) N h N =. (5.17) The linear phase constraint on the filter requires that its filter coe cients are symmetric. This imposes a final set of equality constraints on the coe cients h k = h N 1 k for k =,...,N. (5.18) The process of identifying the boundaries for the segments of the excitation signal is identical to the procedure described in Section and 5 th -order polynomials are also used for segment fitting. Equation 5.19 summarizes the constrained minimization problem after taking the L 2 -norm of Equation 5.15 and imposing the constraints from Equations in addition to the constraints placed on the input signal as specified in Section min x khx yk 2 (5.19) subject to N+1 X k= N+1 X k= h k apple 1 h k ( 1) k =1 h k = h N 1 k for k =,...,N

74 61 Mean and Standard Deviation of Signal-to-Noise Ratio (db) Pick Finger String piano mezzo-forte forte piano mezzo-forte forte ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 1.75 Table 5.2: Mean and standard deviation of the SNR computed using Equation The joint source-filter estimation approach was used to obtain parameters for synthesizing the guitar tones using a FIR loop filter with length N = Results The source-filter estimation scheme using the FIR loop filter was applied to the same corpus of signals used in Experiment 1 and the MATLAB CVX package was again used to solve the minimization from Equation Table 5.2 summarizes the mean and standard deviation of the SNR computed in the same manner as Experiment 1 using Equation These values were computed based on re-synthesizing the plucked-guitar tones using a FIR loop filter with length N = 3. The values reported in Table 5.2 from this experiment are on par with the values obtained in Experiment 1. That is, the FIR modeling approach exhibits roughly the same average SNR values and trends for di erent articulations and strings. However, by comparing the synthetic tones produced by the methods of Experiment 1 and 2, we noted that the FIR filter does not always adequately match the decay rates for the high frequency partials. This yielded synthetic tones that sounded buzzy since the high frequency partials were not decaying fast enough. We attempted to improve the perceptual qualities of the synthetic tones to better match their analyzed counterparts by increasing the length of the FIR loop filter. However, using filters with length N>3often resulted in the overall response of the single delay-loop model becoming unstable. Though the FIR loop filter is inherently stable by design and constraints were placed on the filter at the DC and Nyquist frequencies, the FIR loop filter may occasionally exhibit gains exceeding unity at mid-range frequencies across the spectrum. Since this filter is located in the feedback loop of the single delay-loop model, the overall response is unstable when the excitation signal has energy at

75 62 mid-range frequencies. 5.4 Discussion This chapter presented the implementation details for the joint source-filter estimation scheme proposed in Chapter 4. This included a two-stage onset detection based on a spectral flux computation to estimate the pitch of the plucked-tone and identify the location of the incident pulses used to estimate the source signal. The system was implemented using two di erent loop filter structures which characterize the frequency-dependent decay characteristics of the guitar tones. The first implementation utilized a one pole IIR filter to model the string s decay response. The formulation of the joint estimation scheme using this filter required using an all-pole approximation for the single delay-loop transfer function. By applying the estimation scheme using this formulation, it was shown that the modeling scheme was capable of capturing the source signals and string decay responses characteristic to the articulations in the data set. The articulations produced with the guitarist s pick led to more complex string responses and the source-filter estimation method extracts filter parameters that best approximate these characteristics. Modeling error is attributed to the accuracy of the estimated source signal, which may omit some noise-like characteristics and the non-ideal decay characteristics of real strings, which is generally not monotonic as assumed by the model. The second implementation utilized an FIR loop filter model, which inherently leads to an allpole transfer function for the single delay-loop model and thus, is more flexible in terms of adding additional taps to improve the fit. Though a low order (length N = 3) FIR filter performed similarly to the IIR case in terms of SNR, the low order filter did not adequately taper o the high frequency characteristics of the tones. Increasing the order of this filter led to unstable single delay-loop transfer functions due to the loop filter gain occasionally exceeding unity. Thus, the IIR loop filter proved to be more robust in terms of stability and providing a better match of the string s decay characteristics for high frequency partials.

76 63 CHAPTER 6: EXCITATION MODELING 6.1 Overview In Chapter 3 physically inspired models of the guitar were discussed including the popular waveguide synthesis and the related source-filter models. In particular, the source-filter approximation is attractive for analysis and synthesis tasks because these models provide a clear analog to the physical phenomena incurred with exciting a guitar string: that is, an impulsive-like force from the performer excites the resonant behavior of the string. In Section 4.3, it was shown that analysis via the source-filter approximation can be used to recover excitation signals corresponding to particular string articulations, thereby providing a measure of the performer s expression. In Section 4.4, a technique was proposed to jointly estimate the excitation signal along with the filter model using a piecewise polynomial approximation of the excitation signal, which contains a bias from the performer s relative plucking point position along the string. Including the method proposed in Section 4.4.1, many techniques are available for estimating and calibrating the resonant filter properties for the source-filter model [29, 36, 86], but less research has been invested in the analysis of the excitation signals, which are responsible for reproducing the unique timbres associated with the performer s articulation. This is a complex problem, since there are nearly an infinite number of ways to pluck a string, each of which will yield a unique excitation (using the source-filter model) even when the tones have a similar timbre. In particular, it is desirable to have methods in which particular articulations could be quantified from analysis of the associated excitation signal. For applications, it would also be desirable to manipulate a parametric representation for arbitrary plucked-string synthesis. In this chapter, a components analysis approach is applied to a corpus of excitation signals derived from recordings of plucked-guitar tones in order to obtain a quantitative representation to model the unique characteristics of guitar articulations. In particular, principal components analysis (PCA) is employed for this task to exploit common features of excitation signals while modeling the finer details using the appropriate principal components. This approach can be viewed as developing a codebook, where the entries are principal component vectors that describe the unique characteristics

77 64 of the excitation signals. Additionally, these components are used as features for visualization of particular articulations and dimensionality reduction. Nonlinear PCA is employed to yield a two-way mapping that isolates specific performance attributes which can be used for synthesizing excitation signals. This research has several applications, including modeling guitar performance directly from recordings in order to capture expressive and perceptual characteristics of a performer s playing style. Additionally, the codebook entries obtained in this paper can be applied to musical interfaces for control and synthesis of expressive guitar tones. 6.2 Previous Work on Guitar Source Signal Modeling Existing excitation modeling techniques are based on either the digital waveguide or related sourcefilter models. While both are discussed at length in Chapter 3, the source filter model and its components are briefly overviewed here to re-introduce notation pertinent to the remainder of the chapter. Figure 6.1 shows the model achieved when the bi-directional waveguide model is reduced to a source-filter approximation. The lower block, S(z), of Figure 6.1 is referred to as the single delay-loop (SDL) and consolidates the DWG model into a single delay line z D I in cascade with a string decay filter H l (z) and a fractional delay filter H F (z). These filters are calibrated such that the total delay, D, in the SDL satisfies D = fs f fundamental frequency, respectively. where f s and f are the sampling frequency and H l (z) is designed using the techniques discussed in Section [29, 36, 86] while the fractional delay filter can be designed using a number of techniques discussed in Appendix A. The upper block, C(z), of Figure 6.1 is a feedforward comb-filter that incorporates the e ect of the performer s plucking point position along the string. Since the SDL lacks the bi-directional characteristics of the DWG, C(z) simulates the boundary conditions when a traveling wave encounters a rigid termination. Absent from Figure 6.1 is an additional comb filter modeling the pickup position where the string output is observed. While this a ects the resulting excitation signals when commuted synthesis is used for recovery, it is omitted here since the data used for evaluations is collected using a constant pickup position. While the SDL is essentially a source-filter approximation of the physical system for a pluckedstring, there are several benefits associated with modeling tones in this manner. For example, modifying the source signal permits arbitrary synthesis of unique tones even for the same filter

78 65 p(n) + + C(z) z -λd S(z) H l (z) H F (z) z -D I y(n) Figure 6.1: Source-filter model for plucked-guitar synthesis. C(z) is the feed-forward comb filter simulating the a ect of the player s plucking position. S(z) models the string s pitch and decay characteristics. model. Also, for analysis tasks it is desirable to model the perceptual characteristics of tones from a recorded performance by recovering the source signal using linear filtering operations (see Section on Commuted Synthesis), which is possible with a source-filter model. There are several approaches used in the literature for determining the excitation signal for the source-filter model of a plucked-guitar. A possible source signal includes filtered white noise, which simulates the transient, noise-like characteristics of a plucked-string [31]. A well-known technique involves inverse filtering a recorded guitar tone with a properly calibrated string-model [29, 36]. When inverse filtering is used, the string model cancels out the tone s harmonic components leaving behind a residual that contains the excitation in the first few milliseconds. In [39], these residuals are processed with pluck-shaping filters to simulate the performer s articulation dynamics. For improved reproduction of acoustic guitar tones, this approach is extended by decomposing the tone into its deterministic and stochastic components, separately inverse filtering each signal and adding the residuals to equalize the spectra of the residual [9]. Other methods utilize non-linear processing to spectrally flatten the recorded tone and use the resulting signal as the source, since it preserves the signal s phase information [38, 41]. Lindroos et al. consider the excitation signal to consist of three parts, which include the picking noise, the first impulse detected by the pickup and a second, reflected pulse also detected by the pickup at some later time [44]. The picking noise is modeled with low-pass filtered white noise and the first pulse is modeled with an integrating filter. Despite the range of modeling techniques described above, these methods are not generalizable for describing a multitude of string articulations. For example, Laurson s approach involves storing the residual signals obtained from inverse-filtering recorded plucks, and filters to shape a reference

79 66 residual signal in order to achieve another residual with a particular dynamic level (e.g. piano, forte) [39]. While this approach is capable of morphing one residual into another, the relationship between the pluck-shaping filters and the physical e ects of modifying plucking dynamics is somewhat arbitrary. Additionally, this method does not remove the bias of the guitarist s plucking point location, which is undesirable since the plucking point should be a free parameter for arbitrary resynthesis. On the other hand, Lee s approach handles this problem by whitening the spectrum of the recorded tone to remove spectral bias. However, this requires preserving the phase information resulting in a signal equal to the duration of the recorded tone, which is not a compact representation of the signal. 6.3 Data Collection Overview It is understood by guitarists that exactly reproducing a particular articulation on a guitar string is extremely di cult, if not impossible due to the many degrees of freedom available when exciting the string. These degrees of freedom during the articulation comprise parts of the guitarist s expressive palette including: Plucking device (e.g. pick, finger, nail) Plucking location along the string Dynamics (i.e. the relative hardness or softness during the articulation) These techniques have a direct impact on the initial shape of the string, yielding perceptually unique timbres, especially during the attack phase of the tone. It is important to note that, unlike the waveguide model presented in Chapter 3, the SDL does not allow the initial waveshape to be specified via wave variables (e.g. displacement, acceleration). Instead, signal processing techniques must be used to derive the excitation signals through analysis of recorded tones and it is unclear initially how exactly to parameterize the e ects of the plucking device and dynamics once the signals are recovered. Additionally, a significant amount of data is needed to analyze the e ects of these expressive parameters on the resulting excitation signals. This section details the approach and apparatus used to collect plucked guitar recordings containing the expressive attributes listed above. The recovery of the excitation signals from the data will be explained in Section 6.4.

80 Approach The plucked-guitar signals under analysis were produced using an Epiphone Les Paul Standard guitar equipped with a Fishman Powerbridge pickup. A diagram of the Powerbridge pickup is shown in Figure 6.2 and features a piezoelectric sensor mounted on each string s saddle on the bridge [15]. Unlike the magnetic pickups traditionally used for electric guitars, the piezoelectric pickup responds to pressure changes due to the string s vibration at the bridge. For the application of excitation modeling, the piezoelectric pickup has several benefits over magnetic pickups, including the measurement of a relatively dry signal that does not include significant resonant e ects arising from the instrument s body. Also, magnetic pickups tend to introduce a low-pass filtering e ect on the spectra of plucked-tones, but the piezo pickups record a much wider frequency range, which is useful for modeling the noise-like interaction between the performer s articulation and the string. Finally, recordings produced with the bridge-mounted piezo pickup can be used to isolate the plucking point location for equalization, which will be explained in Section 6.4.2, since the pickup location is constant at the bridge. Piezo Crystals Saddle Saddle Position Screw Bridge Figure 6.2: Front orthographic projection of the bridge-mounted piezoelectric bridge used to record plucked-tones. A piezoelectric crystal is mounted on each saddle, which measures pressure during vibration. Guitar diagram obtained from The guitar was strung with a set of D Addario 1-gauge nickel-wound strings. The gauge reflects the diameter of the first (highest) string, which is.1 inches, while the last (lowest) string

81 68 has a.46 inch diameter. As is common with electric guitar strings, the lowest 3 strings (4-6) feature a wound construction while the highest 3 (1-3) are unwound. Recordings were used using either the fleshy part of the guitarist s finger or a Dunlop Jazz III pick. The data set of plucked-recordings was produced by varying the articulation across the fretboard of the guitar using either the guitarist s finger or the pick. For each fret, the guitarist produces a specific articulation five consecutive times for consistency using the pick and their finger. The articulations were identified by their dynamic level and consisted of piano (soft), mezzo-forte (mediumloud) and forte (loud). The performer s relative plucking point position along the string was not specified and remained a free parameter during the recordings. The articulations were produced on each of the guitar s six strings using the open string position as well as the first five frets, which yielded approximately 1 plucked-guitar recordings. The output of the guitar s bridge pick-up was fed directly to a M-Audio Fast Track Pro USB interface, which recorded the audio directly to a Macintosh computer. Audacity, an open source sound recording and editing tool, was used to record the samples at sampling rate of 44.1 khz at a 16-bit depth [49]. Due to the di erence in construction between the lower and high strings on the guitar, the recordings were analyzed in two separate groups reflecting the wound and unwound strings. In terms of the acquisition system, this a ects how the signals are resampled in Figure 6.3. For the unwound strings, the signals were re-sampled to 196 Hz, which corresponds to the tuning of the open, 3rd string, which is the lowest pitch possible on the unwound set. Similarly, the wound strings were resampled to 82.4 Hz, which is the pitch of the open 6th string and the lowest note possible in the wound set. 6.4 Excitation Signal Recovery On the way to modeling the articulations from recordings from plucked-guitar tones, there are a few pre-processing tasks that must be addressed: 1) Estimate the residual signal from plucked guitar recordings and 2) remove the bias associated with the guitarists plucking point position. As discussed in Section 6.2, a limitation of existing excitation modeling methods is that they do not explicitly handle this bias. The system overviewed in Figure 6.3 addresses these tasks and its various sub-blocks will be explained in this section.

82 Pitch Estimation and Resampling The initial step of the excitation recovery scheme involves estimating the pitch of the plucked guitar tone. This is achieved by using the well-known autocorrelation method, which estimates the pitch over the first 2-3 periods of the signal by searching for the lag corresponding to the maximum of the autocorrelation function (see Section 5.1.2) [61]. The fundamental frequency is computed as f = fs max where f s is the sampling frequency and max is the lag at the maximum of the autocorrelation function. Since the plucked-guitar tones under analysis have varying fundamental frequencies, a resampling operation is required to compensate for di erences in the pulse-width when the residual is recovered. This is a required pre-processing step before principal components analysis, since the goal is to model di erences in articulation that are not related to pitch. Otherwise, the extracted basis vectors will not reflect the di erences in articulation, but rather the di erences between the fundamental periods of the analyzed tones. The resampling operation on the plucked-tone is defined as ŷ(n) =l y(n) (6.1) where is the resampling factor. = T ref and = T indicate the periods, in samples, of the reference frequency and the estimated pitch frequency of the plucked-tone, respectively Residual Extraction There are several methods of extracting the residual from the recorded tone. The most generalized approach was discussed in Section 4.3 and involves inverse-filtering the recorded tone by the caliy(n) Pitch Estimation f Residual Extraction via inverse filtering or joint estimation Plucking Point Estimation p b (n) d rpp Residual Equalization p(n) Figure 6.3: Diagram outlining the residual equalization process for excitation signals.

83 7 Amplitude Time (msec) (a) Magnitude (db) Frequency (Hz) (b) Figure 6.4: Comb filter e ect resulting from plucking a guitar string (open E, f = 331 Hz) 8.4 cm from the bridge plucked-guitar tone. (a) Residual obtained from single delay-loop model. (b) Residual spectrum. Using equation 6.2, the notch frequencies are approximately located at multiples of 382 Hz. brated string model presented in Section 6.2 to yield the residual excitation p b (n). The approach proposed in Chapter 4 outlines an alternate method to jointly estimate the excitation and filter parameters for a plucked guitar tone. It should be notice that the subscript b on p b (n) indicates that the residual contains a plucking point bias, which will eventually be removed Spectral Bias from Plucking Point Location The Plucking Point Estimation block in Figure 6.3 is concerned with determining the position where the guitarist has displaced the string. It is well understood in literature regarding string physics and digital waveguide modeling that the plucking point position imparts a comb-filter e ect on the spectrum of the vibrating string [17, 3, 64]. This occurs because the harmonics that have a node at the plucking position are not excited and, in the ideal case, have zero amplitude. Figure 6.4 shows the residual and its spectrum obtained from plucking an open E string (f = 331 Hz) approximately 8.4 cm from the bridge of an electric guitar. From 6.4(a), the first spike in the residual results from the impulse produced by the string s initial displacement arriving at the bridge pickup. The subsequent spike also results from the initial string displacement, but has an inverted amplitude due to traveling in the opposite direction along the string and reflecting at the guitar s nut. A detailed description of this behavior is provided in Figure 4.2 in Section Unlike a pure impulse which has a flat frequency response, the residual spectrum in 6.4(b) contains deep notches spaced at near-regular frequency intervals. By denoting the relative plucking position along

84 71 the string as d rpp = l L s,wherel is the distance from the bridge and L s is the length of the string, the notch frequencies can be calculated by f f notch,n = n, for n =, 1, 2,... (6.2) 1 d rpp The comb filter bias creates a challenge for parameterizing the excitation signals since the guitarist s relative plucking position constantly varies depending on the position of their strumming hand and their fretting hand. Even when the guitarist maintains the same plucking distance from the bridge, changing the fretting position along the neck manipulates the relative plucking position by elongating or shortening the e ective length of the string. While guitarists vary the relative plucking point location, either consciously or subconsciously, during performance, modeling the excitation signal requires estimation of the plucking point position and equalization to remove its spectral bias. Ideally, it is desirable to recover the pure impulsive signal imparted by the guitarist when striking the string, as shown in Figure 6.9, in order to quantify expressive techniques, such as plucking mechanism and dynamics. Such analysis requires estimating the plucking point location from recordings and equalizing the residuals to remove the bias Estimating the Plucking Point Location Previous techniques in the literature for estimating the plucking point location from guitar recordings have focused on spectral or time-domain analysis techniques. Traube proposed a method of estimating the plucking point location by comparing a sampleddata magnitude spectrum obtained from a recording to synthetic magnitude spectra generated with di erent plucking point locations [83, 84]. The plucking point location for a particular recording was determined by finding the synthetic string spectra with a plucking position that minimizes the magnitude error between the measured and ideal spectra. Later, Traube introduced a plucking-point estimation method based on iterative optimization and the so-called log-correlation, which is computed from recordings of plucked tones [81, 82]. The log-correlation is computed by taking the log of the squared Fourier coe cients for the harmonicallyrelated partials in a plucked-guitar spectrum and applying the inverse Fourier transform using these coe cients. The log-correlation function yields an initial estimate for the relative plucking position, d rpp = min,where min, are the lags indicating the minima and maxima of the log-correlation function, respectively. The estimate of d rpp is used to initialize an iterative optimization scheme,

85 72 which minimizes the di erence between ideal and measured spectra, in order to refine d rpp and improve accuracy. Penttinen et al. exploited time domain-based analysis techniques to estimate the plucking position [58, 59]. Using an under-saddle bridge pickup, Penttinen s technique is based on identifying the impulses associated with the string s initial displacement as they arrive at the bridge pickup. Since the initial string displacement produces two impulses traveling in opposite directions, the arrival time between each impulse at the bridge, t, provides an indication of the guitarist s relative plucking position along the string. Figure 6.5 shows the output of a bridge-mounted piezo-electric pickup for a plucked-guitar tone. By determining the onsets when each pulse arrives at the bridge pickup, Pentinnen shows that the relative plucking position can be determined by d rpp = f s Tf f s, (6.3) where T = f s t indicates the number of samples between the arrival of each impulse at the bridge pickup [58, 59]. As d rpp is in the range of (, 1), the actual distance from the bridge is obtained by multiplying d rpp by the length of the string. Penttinen utilizes a two-stage onset detection to determine T where the first stage isolates the onset of the plucked tone and the second stage uses the estimated pitch of the tone to extract one period of the waveform. The autocorrelation on the extracted period is used to determine T since the minimum of the autocorrelation function occurs at the lag where the signal s impulses are out of phase. Figure 6.6(a) shows one cycle extracted from the waveform in Figure 6.5 and the corresponding autocorrelation of that signal in Figure 6.6(b). t is identified by searching for the index corresponding to the minimum of the autocorrelation function. There are several strengths and weaknesses associated with the methods proposed by Traube and Penttinen. Traube s approach is generalizable to acoustic guitar tones recorded using an external microphone. However, a relatively large time window on the order of 1 milliseconds is required to achieve the frequency resolution required to resolve the string s harmonically related partials and, thus, compute the autocorrelation function. By including multiple periods of string vibration in the analysis, the e ect of the plucking position can become obscured since non-linear coupling of the string s harmonics can regenerate the missing harmonics [16]. By isolating just one period of the waveform near the onset, Penttinen s technique avoids this physical consequence since the analyzed

86 73 Amplitude Δt Time (msec) Figure 6.5: Plucked-guitar tone measured using a piezo-electric bridge pickup. Vertical dashed-lines indicate the impulses arriving at the bridge pickup. t indicates the arrival time between impulses. 2.5 Amplitude Time (msec) (a) Amplitude Time (msec) (b) Figure 6.6: (a) One period extracted from the plucked-guitar tone in Figure 6.5. (b) Autocorrelation of the extracted period. The minimum is marked and denotes time lag, t, between arriving pulses at the bridge pickup. segment results from the string s initial displacement. However, Penttinen s approach requires the guitar to be equipped with the bridge-mounted pickup to isolate the arrival time of the impulses in the first period of vibration. Also, isolating the first period of vibration is di cult and success is dependent on the parameters used in the two-stage onset detection. Handling the e ect of a string pickup location at a position other than the bridge is not explicitly addressed by either method. Similar to spectral bias resulting from the plucking point location, the pickup location also adds a spectral bias since vibrating modes of the string with a node at the pick

87 74 up location will not be measured. Traube s methods are developed for the acoustic guitar recorded with a microphone some distance from the instrument s sound hole. In this case, the pickup is the radiated acoustic energy from all positions along the string and thus shows no particular spectral bias. For electric guitars, if a bridge-mounted pickup is not available, determining the plucking location is particularly di cult due to the lack of consistency where the pickups are placed on the instrument and the number used. The former constraint makes it di cult to determine which impulse (i.e. that left-traveling and right-traveling) pulse is being measured at the output and the latter constraint complicates the problem since some guitars blend the signal from two or more pickups Equalization: Removing the Spectral Bias The next step in the excitation acquisition scheme is to remove the comb filter bias associated with the plucking point position. In Figure 6.3, the Residual Equalization block handles this task. The equalization begins by obtaining an estimate of the relative plucking-point location d rpp along the string. Since the signals under analysis were recorded with a bridge-mounted pickup, Penttinen s autocorrelation-based technique was chosen to estimate d rpp. The two-stage onset detection approach presented in Section 5.1 was used to identify the incident and reflected pulses during the initial period of vibration. d rpp is then used to formulate a comb filter to approximate the notches in the spectrum of the residual H cf (z) =1 µz b Dc, (6.4) where =1 d rpp and D = fs f is the loop delay of the digital waveguide model determining the pitch of the string [74]. b Dc denotes the greatest integer less than or equal to the product D. µ is a gain factor applied to the delayed signal, which determines how deep the magnitude is for the notch frequencies in the spectrum where µ values closer to 1 lead to deeper notches [76]. Intuitively, Equation 6.4 specifies the number of samples, as a fraction of the total loop delay, between the arrival of each impulse at the bridge. The basic comb filter structure in Equation 6.4 and Figure 6.7 (a) provides a good approximation of the spectral nulls associated with the plucking point position. However, it is limited to samplelevel accuracy, which may not adequately approximate the true notch frequencies in the spectrum. For more precise localization, a fractional delay filter is inserted into the feed-forward path to provide

88 75 v(n) μ z -λd + + u(n) (a) v(n) μ F(z) z -λd + + u(n) (b) Figure 6.7: Comb filter structures for simulating the plucking point location. (a) Basic structure. (b) Basic structure with fractional delay filter added to the feedforward path to implement non-integer delay. the required non-integer delay as shown in Figure 6.7 (b) [88]. Thus, the resulting fractional delay comb filter has the form H cf (z) =1 µf (z)z b Dc, (6.5) where F (z) provides the fractional precision lost by rounding the product D. F (z) is designed using several available techniques in the literature, including all-pass filters and FIR LaGrange interpolation filters as discussed in Appendix A. Using the comb filter structure from Equation 6.4 or 6.5, p b (n) can be equalized by inverse filtering P (z) = P b(z) H cf (z). (6.6) Figure 6.8 demonstrates the e ects of equalizing the residual in both the time and frequency domains. Figures 6.8(a) and 6.8(b) show the time and spectral domain plots, respectively, of the residual obtained from a plucked-guitar tone. Figure 6.8(b) also plots the frequency response of the estimated comb filter, which approximates the deep notches found in the residual. A 5 th -order fractional delay was used for the comb filter and a value of.95 was used for the gain term µ. This value was found to provide the closest approximation of the spectral notches for the signals in the

89 76 dataset. Figure 6.8(c) and 6.8(d) show the time and spectral domain plots when the residual is equalized by inverse filtering. In the spectral domain, inverse comb filtering yields a magnitude spectrum that is relatively free of the deep notches seen in 6.8(b). In the time domain plot of 6.8(c) this translates into a signal that is much closer to a pure impulse. Amplitude Time (msec) Magnitude (db) Residual Spectrum Comb Filter Approximation Frequency (Hz) (a) Residual (b) Residual spectrum and comb filter approximation Amplitude Time (msec) Magnitude (db) Residual Spectrum Equalized Spectrum Frequency (Hz) (c) Residual with bias removed (d) Original and equalized spectra using inverse comb filter. Figure 6.8: Spectral equalization on a residual signal obtained from plucking a guitar string 8.4 cm from the bridge (open E, f = 331 Hz) Residual Alignment After equalization, the final step is to align the processed excitation signals with a reference excitation signal. This ensures that the impulse peak of each signal is aligned in the time domain to avoid errors for principal components analysis. In practice, this is accomplished by copying the reference and processed signals and cubing them, which decreases the amplitudes of the samples around the primary peak. The cross correlation is computed between each signal and the reference pulse. The lag indicating maximum correlation is used to indicate the shift needed to align each signal with the

90 77 reference pulse. For excitation signal modeling and parameterization, the residual equalization scheme has several benefits. From an intuitive standpoint, the impulsive-like signals obtained from equalization are more indicative of the performer s string articulation. Also, signals in this form are simpler to model and therefore more adept for parameterization. Finally, removing the plucking point bias allows the relative plucking point location to remain a free parameter for synthesis applications. 6.5 Component-based Analysis of Excitation Signals Analysis of Recovered Excitation Signals By applying the excitation recovery and equalization scheme of the previous section to the corpus of recordings gathered in Section 6.3, analysis of the recovered signals provides insight into the similarities and di erences of excitation signals corresponding to various string articulations. Figure Amplitude.4.6 Amplitude forte mezzo forte piano 1 forte mezzo forte piano Time (msec) Time (msec) (a) (b) Figure 6.9: Excitation signals corresponding to strings excited using a pick (a) and finger (b). 6.9 (a) and (b) shows excitation signals overlayed on each other which were obtained from plucked guitar tones produced by using either a plastic pick (a) or the player s finger (b). For both finger and pick articulations, the dynamics of the pluck consisted of piano (soft), mezzo-forte (moderately loud) and forte (loud). These plots show a common, impulsive-like contour with additional high-frequency characteristics depending on the dynamics used. Comparing Figures 6.9 (a) and (b), it is evident that the signals corresponding to finger articulations are generally wider whereas the pick excitation signals are more narrow and closer to an ideal impulse. Figure 6.1 plots the average magnitude spectrum for each type of articulation in the data set.

91 forte mezzo forte piano 3 2 forte mezzo forte piano Magnitude (db) 1 1 Magnitude (db) Frequency (Hz) (a) Frequency (Hz) (b) Figure 6.1: Average magnitude spectra of signals produced with pick (a) and finger (b). For each type of articulation (finger or pick), increasing the relative dynamics from piano to forte results in increased high frequency spectral energy. An interesting observation is that piano-finger articulations show a significant high frequency ripple. This may be attributed to the deliberately slower plucking action used to produce these articulations, where the string slides slower o the player s finger. When these signals are used to re-synthesized plucked-guitar tones, they often have a qualitative association with the perceived timbre of the resulting tones. Descriptors, such as brightness are often used to describe the timbre, which generally increases with the dynamics of the articulations. The varying energy from the plots in Figure 6.1 provides quantitative support of this observation Towards an Excitation Codebook Based on the observations of Figures 6.9 and 6.1, we propose a data-driven approach for modeling excitation signals using principal components analysis (PCA). Employing PCA is motivated by observing the similar, impulse-like structure of the excitation signals shown in Figure 6.9. As discussed, the fine di erences between the derived excitation signals can be attributed to the guitarist s articulation and account, in part, for the spectral characteristics of the perceived tones. These di erences can be modeled using a linear combination of basis vectors to provide the desired spectral characteristics. The results of this analysis will be used to develop a codebook that consists of the essential components required to accurately synthesize a multitude of articulation signals. At present, PCA has not yet been applied to modeling the excitation signals for source-filter models of plucked-string instruments. However, PCA has been applied to speech coding applications, in which

92 79 principal components are used to model voice-source waveforms including the complex interactions between the vocal tract and glottis [19, 51]. This section presents the application of PCA to the data set and the development of an excitation codebook using the basis vectors. The re-synthesis of excitation signals corresponding to particular string articulations will also be presented Application of Principal Components Analysis The motivation for applying principal components analysis (PCA) to plucked-guitar excitation signals is to achieve a parametric representation of these signals through statistical analysis. In Section it was shown that excitation signals corresponding to di erent articulations shared a common impulsive-contour, but had varying high frequency details depending on the specific articulation. The goal of PCA is to apply a statistical analysis to this data set which is capable of extracting basis vectors that can model these fine details. By exploiting redundancy in the data set, PCA leads to data reduction for parametric representation of signals. PCA is defined as an orthogonal linear transformation of the data set onto a new coordinate system [13]. The first principal axes in this new space explains the greatest variance in the original data set, the second axes maximizes the remaining greatest variance in the data set and so on. Figure 6.11 depicts the application of PCA to synthetic data in a two dimensional space. The vectors v 1 and v 2 define the principal component axes for the data set. The principal components are found by computing the eigenvalues and eigenvectors for the covariance matrix of the data set [5]. This is the well-known Covariance Method for PCA [13]. The v 2 v 1 Figure 6.11: Application of principal components analysis to a synthetic data set. The vector v 1 explains the greatest variance in the data while v 2 explains the remaining greatest variance.

93 8 initial step involves formulating a data matrix 2 3 P = 6p 1 p 2... p N T (6.7) where each p i is a M-length column vector corresponding to a particular excitation signal in the data set. The next step involves computing the covariance matrix for the mean-centered data matrix by taking h i =E (P u)(p u) T (6.8) where E is the expectation operator and u =E[P] is the empirical mean of the data matrix. The principal component basis vectors are obtained through an eigenvalue decomposition of V 1 V = D (6.9) where V =[v 1 v 2...v N ] is a matrix of eigenvectors of and D is a matrix containing the associated eigenvalues along its main diagonal. The LAPACK linear algebra software package is used to compute the eigenvectors and eigenvalues [2]. The columns of V are sorted in order of the decreasing eigenvalues in D such that 1 > 2 > > N. This step is performed so that the PC basis vectors are rearranged in a manner that explains the most variance in the data set. To reconstruct the excitation signals, the correct linear combination of basis vectors is required. The correct weights are obtained by projecting the mean-centered data matrix onto the eigenvectors W =(P u) V. (6.1) Equation 6.1 defines an orthogonal linear transformation of the data onto a new coordinate system

94 81 defined by the basis vectors. The weight matrix W is defined as 2 3 W = 6w 1 w 2... w N T, (6.11) where each w is an M-length column vector containing the scores (or weights) to pertaining to a particular excitation signal in P. These scores indicate how much each basis vector is weighted when reconstructing the signal and they are also helpful in visualizing the data, as will be discussed in the next section Analysis of PC Weights and Basis Vectors Principal component analysis of the excitation signals is divided into two groups to separately examine the set of wound and unwound strings, which have di erent physical characteristics, as described in Section 6.3. For the set of unwound strings, the recovered excitation signals were normalized to a reference length of M = 57 samples, which is approximately twice the length of the period corresponding to the open 3 rd string tuned to 196 Hz. For the set of wound strings, the reference length of the excitation signals was set to M = 91 samples, which is approximately twice the period of the open 6 th string tuned to 82.4 Hz. It should be noticed that normalization was achieved via downsampling to avoid truncating significant sections of the excitation signal. Downsampling to the lowest possible frequency in the set of strings also avoids the loss of high frequency information present in the data set. PCA was applied to both groups of excitation signals using the Covariance Method overviewed in Section To analyze the compactness of each data set, the explained variance (EV ) can be computed using the eigenvalues calculated from PCA EV = M m M m m m (6.12) where M < M. Figure 6.12 plots the explained variance for the sets of unwound and wound strings, respectively. In both cases, the plots of explained variance suggest that the data is fairly low dimensional. Selecting M = 2 basis vectors accounts for > 95% of the variance for the set of

95 Explained Variance (%) 9 85 Explained Variance (%) Number of Eigenvalues Number of Eigenvalues (a) (b) Figure 6.12: Explained variance of the principal components computed for the set of (a) unwound and (b) wound strings. unwound strings while M = 3 is su cient for > 95% of the variance in the wound set. For insight on the relationship between the basis vectors and the excitation signals, Figure 6.13 plots the first three basis functions along side example articulations extracted from the data set consisting of the 1 st,2 nd and 3 rd strings. The general, impulsive-like contour is captured by the empirical mean of the data set. In the case of the excitations derived from pick articulations, the basis vectors plotted provide the high frequency components just before and after the main impulse. In the case of the finger articulations, these basis vectors are negatively weighted and serve to widen the main impulse. This relationship agrees with the physical occurrence of plucking a string with a pick versus a finger, since the physical characteristics of each plucking device directly a ect the shape of the string. Figure 6.14 shows a similar plot for the 4 th,5 th and 6 th strings, which have di erent physical characteristics due to their wound construction. By comparing Figures 6.13 and 6.14, it is evident that the extracted basis vectors are very similar in each case. The di erence, however, is in the empirical mean vector, which is exhibits a pronounced bump immediately after the main impulse. This feature appears to be characteristic of the articulations produced by the finger, which perhaps reflects the slippage of the wound string o of the finger. Figure 6.15 shows projections of how the data pertaining to the string articulations projects into the space defined by the principal component vectors. Figure 6.15(a) shows the projection of articulations from strings 1-3 along the 1 st and 2 nd components. This projection shows that the data pertaining to specific articulations have a particular arrangement and grouping in this space.

96 83 pick excitations principal components Amplitude.5 1 finger excitations forte mezzo forte piano Time (msec) Mean PC 1 PC 2 PC Time (msec) Figure 6.13: Selected basis vectors extracted from plucked-guitar recordings produced on the 1 st, 2 nd and 3 rd strings. pick excitations principal components Amplitude.5 1 finger excitations forte mezzo forte piano Time (msec) Mean PC 1 PC 2 PC Time (msec) Figure 6.14: Selected basis vectors extracted from plucked-guitar recordings produced on the 4 th, 5 th and 6 th strings.

97 84 2nd Principal Component pick forte pick mezzo forte pick piano finger forte finger mezzo forte finger piano st Principal Component (a) 2nd Principal Component pick forte pick mezzo forte pick piano finger forte finger mezzo forte finger piano st Principal Component (b) Figure 6.15: Projection of guitar excitation signals into the principal component space. Excitations from strings 1-3 (a) and 4-6 (b). In particular, the axis pertaining to the 1 st correlates to the articulation strength, which increases independently for pick and finger articulations. Similarly, the projection of the data pertaining to Strings 4-6 is shown in Figure 6.15(b), which shows a di erent arrangement, but a similar clustering of data based on the articulation type Codebook Design The plots of explained variance in Figure 6.12 demonstrate the relative low dimensionality of the extracted guitar excitation signals. Here, we present an approach for designing a codebook to further

98 85 reduce the number of basis vectors required to accurately reconstruct the excitation signals. This step is advantageous for synthesis systems where it is desirable to faithfully capture the perceptual characteristics of the performer-string interaction, while minimizing the amount of data required. Also, this approach separately analyzes the principal component weights for pick and finger articulations to determine the best subset of basis vectors comprising each group of articulations. This method considers that, while PCA yields basis vectors that successively explain the most variance in the data, certain basis vectors may be more essential to synthesize a particular articulation based on the magnitude of the associated weight vector. The codebook design procedure is as follows: 1. Compute the weight matrix for the data set using Equation 6.1. A weight vector w = [w 1 w 2...w M ] is obtained for each excitation signal in the data set. 2. Take the absolute value for each weight vector w and sort the entries in descending order so that w 1 > w 2 > > w M. 3. Select the first M top weights from the sorted weight vector where M top is an integer number. 4. For each of the M top weights selected, record the occurrence of the associated principal component vector into a histogram. 5. Using the histogram as a guide, select a subset L of basis vectors having the highest occurrences in the histogram (see Figure 6.16) where L<M. This yields a subset of basis vectors ˆV V where ˆV =[v 1 v 2...v L ]. These form the codebook entries. Figure 6.16 shows the histogram computed separately for excitation signals associated with pick and finger articulations. It is interesting to note that the function of weight frequency vs. principal component number does not monotonically decrease. This suggests that certain component vectors are more essential than others for representing the ensemble of excitation signals for a particular articulation Codebook Evaluation and Synthesis After the codebook as been designed, a particular excitation signal can be generated by using a desired number of codebook entries (i.e. basis vectors) and the appropriate weightings for each

99 86 25 Pick Finger 2 Frequency Principal Component Figure 6.16: Histogram of basis vector occurrences generated with M top = 2. entry. Equation 6.13 presents the synthesis equation LX p i = p + w i,mˆv m, (6.13) m=1 where L indicates the number of codebook entries used for re-synthesis. The weight values are obtained by projecting the excitation signal onto the basis vectors. The number of codebook entries used for synthesis depends on the desired accuracy. Figure 6.17 demonstrates the reconstruction by varying the number of entries. It is clear that using a single entry does not capture the high frequency details found in the reference excitation signal. However, using 1 entries approximates the contour of the signal and 5 entries captures nearly all the high frequency information. The reconstruction quality can be summarized for the entire data set by computing the signalto-noise ratio (SNR) for each signal in the set. SNR is defined as X 2 p(n) SNR db = 1 log 1, (6.14) p(n) ˆp(n) n where p(n) and ˆp(n) are the original and reconstructed signals, respectively. Each excitation signal was constructed by varying the number of codebook entries used and averaging the SNR for all excitations at particular number of entries. Additionally, separate codebooks were developed for signals associated with pick or finger articulations to improve error when the number of entries is low. Figure 6.18 summarizes the results of this analysis.

100 Codebook Entries Original Reconstructed.2 Amplitude Time (msec) (a) Codebook Entries Original Reconstructed.2 Amplitude Time (msec) (b) Codebook Entries Original Reconstructed.2 Amplitude Time (msec) (c) Figure 6.17: Excitation synthesis by varying the number of code book entries: (a) 1 entry, (b) 1 entries, (c) 5 entries. It is of note that the SNR computed for finger excitation signals is generally higher than SNR computed for pick excitations regardless of the number of codebook entires used. Intuitively, this agrees with previous observations of the excitation signals obtained from our data set. In general, the observed signals pertaining to finger articulations were not as complex as the picked articulations

101 88 35 pick finger 3 SNR (db) Codebook Entries Figure 6.18: Computed Signal-to-noise ratio when increasing the number of codebook entries used to reconstruct the excitation signals. (see Figure 6.1). Thus, the finger articulations may be more accurately represented with fewer components. The results presented in Figure 6.18 present a strong case for applications requiring accurate and expressive synthesis with low data storage requirements. The initial PCA analysis yielded 57 basis vectors (for strings 1-3) each with a length of 57 samples. From Figure 6.18, it is evident that the SNR of the reconstruction error only marginally increases when more than 15 codebook entries are used. 15 codebook entries requires only 26% ( ) of the data obtained from the initial PCA, which significantly reduces the amount of storage required. At a 16-bit quantization level, 15 codebook entries would require approximately 167 kilobytes of storage, which is a modest requirement considering the storage capacities of present day personal computers and mobile computing devices. 6.6 Nonlinear PCA for Expressive Guitar Synthesis The linear PCA technique presented in the previous section provides intuition on the underlying basis functions comprising our data set, it is unclear how exactly the high dimensional component space relates to the expressive attributes of our data. As shown in Figure 6.15, there is a nonlinear arrangement of the data along the axes pertaining to the first two principal components. Moreover, as additional components are needed to accurately reconstruct the source signals, simply sampling the space defined by the first two components is not su cient for high quality synthesis. On the

102 89 other hand, it is di cult to visualize and infer the underlying structure of the data by projecting it along additional components. In this section, we explore the application of nonlinear principal components analysis (NLPCA) to the data extracted from linear PCA to derive a low dimensional representation of the data. We show that the reduced dimensional space derived using NLPCA explains the expressive attributes of the excitation signals in the data set. Moreover, this low dimensional representation can be inverted and therefore adept as an expressive controller using the original linear components Nonlinear Dimensionality Reduction There are many techniques available in the literature for nonlinear dimensionality reduction, or manifold-learning, for the purposes of discovering the underlying nonlinear characteristics of high dimensional data. Such techniques include locally linear embedding (LLE) [65] and Isomap [78]. While LLE and Isomap are useful for data reduction and visualization tasks, their application does not provide an explicit mapping function to project the reduced dimensionality data back into the high dimensional space. For the purpose of developing an expressive control interface, re-mapping the data back into the original space is essential since we wish to use our linear basis vectors to reconstruct the excitation pulses. To satisfy this requirement, we employ NLPCA via autoassociative neural networks (ANN) to achieve dimensionality reduction with explicit re-mapping functions. w 1 w 2 σ σ σ z 1 * σ σ σ ŵ 1 ŵ 2 w 3 σ σ ŵ 3 T 1 T 2 T 3 T 4 Input Layer Mapping Layer Bottleneck Layer De-Mapping Layer Output Layer Figure 6.19: Architecture for a autoassociative neural network.

103 9 The standard architecture for an ANN is shown in Figure 6.19 and consists of 5 layers [34]. The input and mapping layers can be viewed as the extraction function since it projects the input variables into a lower dimensional space as specified in the bottleneck layer. The de-mapping and output layers comprise the generation function, which projects the data back into its original dimensionality. Using Figure 6.19 as an example, the ANN can be specified as a network to indicate the number of nodes at each layer. The nodes in the mapping and de-mapping functions contain sigmoidal functions and are essential for compressing and decompressing the range of the data to and from the bottle neck layer. An example sigmoidal function that can be used is the hyperbolic tangent, which compresses values with a range of ( 1, 1) to( 1, 1). Since the desired values at the bottleneck layer are unknown, direct supervised training cannot be used to learn the mapping and de-mapping functions. Rather, the combined network is learned using back propagation algorithms to minimize a squared error criterion such that E = 1 2 kw ŵk [34]. From a practical standpoint, the mapping functions are essentially a set of transformation matrices to compress (T 1,T 2 ) and decompress (T 3,T 4 ) the dimensionality of the data Application to Guitar Data To uncover the nonlinear structure of the guitar features extracted in Section 6.5.4, NLPCA was applied using 25 scores from the linear components analysis at the input layer of the ANN. Empirically, we found that using 25 scores was su cient in terms of adequately describing the data and expediting the ANN training. As discussed in Section 6.5.4, 25 linear PCA vectors explains > 95% of the variance in the data set and leads to good re-synthesis. At the bottleneck layer of the ANN, we chose two nodes in order to have multiple degrees of freedom which could be used to synthesize excitation pulses in an expressive control interface. These design criteria yielded a ANN architecture, which was trained using the NLPCA MATLAB Toolbox [67]. Figure 6.2 compares the projection of the data into the linear component space and the reduced dimension space defined by the bottleneck layer of the ANN. As shown in 6.2(b). Unlike the linear projection in 6.2(a), the bottleneck layer of the NLPCA space has unwrapped the nonlinear data arrangement so that it is now clustered about linear axes. Figure 6.21 shows an additional linear rotation applied to this new space for a clearer view of how the axes relate to the data set. By examining this space, the data is clearly organized around the orthogonal z 1 and z 2 axes. Selected excitation pulses are also shown, which were synthesized by sampling this coordinate space, project-

104 91 ing back into the linear principal component domain using the transformation matrices (T 3,T 4 ) from the ANN and using the resulting scores to reconstruct the pulse with linear component vectors. 2nd Principal Component pick forte pick mezzo forte pick piano finger forte finger mezzo forte finger piano st Principal Component (a) pick, forte pick, mezzo forte pick, piano finger, forte finger, mezzo forte finger, piano v v 1 (b) Figure 6.2: Top: Projection of excitation signals into the space defined by the first two linear principal components. Bottom: Projection of the linear PCA weights along the axis defined by the bottleneck layer of the trained ANN. The nonlinear component defined by the z 1 axis describes the articulation type where points sampled in the space z 1 < pertain to finger articulations and points sampled for z 1 > pertain to pick articulations. The finger articulations feature a wider excitation pulse in contrast to the pick, where the pulse is generally more narrow and impulsive. In both articulation spaces, moving from left to right increases the relative dynamics. The second nonlinear component defined by the z 2 axis relates to the contact time of the articulation. As z 2 is increased, the excitation pulse width increases for both articulation types.

105 Amplitude.5 Amplitude.5 Amplitude Time (msec) Time (msec) Time (msec) z pick, forte pick, mezzo forte pick, piano finger, forte finger, mezzo forte finger, piano z Amplitude.5 Amplitude.5 Amplitude Time (msec) Time (msec) Time (msec) Figure 6.21: Guitar data projected along orthogonal principal axes defined by the ANN (center). Example excitation pulses resulting from sampling this space are also shown Expressive Control Interface We demonstrate the practical application of this research in a touch-based ipad interface shown in Figure This interface acts as a tabletop guitar, where the performer uses one hand to provide the articulation and the other to key in the desired pitch(es). The articulation is applied to the large, gradient square in Figure 6.22, which is a mapping of the reduced dimensionality space shown in Figure Moving up along the vertical axis of the articulation space increases the dynamics of the articulation (piano to forte) and moving right to left on the horizontal axis increases the contact time. The articulation area is capable of multi-touch input so the performer can use multiple fingers within the articulation area to give each tone a unique timbre. The colored keys on the left-side of Figure 6.22 allow the user to produce certain pitches. Adjacent

93 Strength Articulation Space Keyboard Area Contact Time Figure 6.22: Tabletop guitar interface for the components based excitation synthesis.

106 93 Strength Articulation Space Keyboard Area Contact Time Figure 6.22: Tabletop guitar interface for the components based excitation synthesis. The articulation is applied in the gradient rectangle, while the colored squares allow the performer to key in specific pitches. keys on the horizontal axis are tuned a half step apart and their color indicates that they are part of the same string so that only the leading key on the string can be played at once. Diagonal keys on adjacent strings are tuned to a Major 3 rd interval while the o -diagonal keys represent a Minor 3 rd interval. This arrangement allows the performer to easily finger di erent chord shapes. The synthesis engine for the tabletop interface is capable of computing the excitation signal corresponding to the performer s touch point within the articulation space and filtering the resulting excitation signal for multiple tones in real-time. The filter module used for the string is implemented with the single delay-loop model shown in Figure 6.1. Though this filter has a large number of delay taps, which is dependent on the pitch, only a few of these taps have non-zero coe cients, which permits an e cient implementation of infinite impulse response filtering. Currently, the relative plucking position along the string is fixed, though this may be a free parameter in future versions of the application. The excitation signal can be updated in real-time during performance, which is made possible by the ipad s support of hardware-accelerated vector libraries. These include the matrix multiplication routines to project the low dimensional user input into the high dimensional component space. Through our own testing, we found that the excitation signal is typically computed

Sound Synthesis Methods

Sound Synthesis Methods Matti Vihola, mvihola@cs.tut.fi 23rd August 2001 1 Objectives The objective of sound synthesis is to create sounds that are Musically interesting Preferably realistic (sounds like