Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Similar documents
Applications of Music Processing

SOUND SOURCE RECOGNITION AND MODELING

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Advanced audio analysis. Martin Gasser

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Drum Transcription Based on Independent Subspace Analysis

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Tempo and Beat Tracking

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Tempo and Beat Tracking

An Improved Voice Activity Detection Based on Deep Belief Networks

Speech Signal Analysis

A multi-class method for detecting audio events in news broadcasts

Adaptive Filters Application of Linear Prediction

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

REpeating Pattern Extraction Technique (REPET)

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Speech Synthesis using Mel-Cepstral Coefficient Feature

Voice Activity Detection

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Isolated Digit Recognition Using MFCC AND DTW

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Singing Expression Transfer from One Voice to Another for a Given Song

Cepstrum alanysis of speech signals

Music Signal Processing

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Electric Guitar Pickups Recognition

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Converting Speaking Voice into Singing Voice

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Radio Deep Learning Efforts Showcase Presentation

AMAJOR difficulty of audio representations for classification

Communications Theory and Engineering

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Design and Implementation of an Audio Classification System Based on SVM

EE123 Digital Signal Processing

Voiced/nonvoiced detection based on robustness of voiced epochs

COMP 546, Winter 2017 lecture 20 - sound 2

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

SPEECH AND SPECTRAL ANALYSIS

EE482: Digital Signal Processing Applications

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Automatic Transcription of Monophonic Audio to MIDI

Audio processing methods on marine mammal vocalizations

Extracting meaning from audio signals - a machine learning approach

Complex Sounds. Reading: Yost Ch. 4

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Lecture 5: Sinusoidal Modeling

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Speech Synthesis; Pitch Detection and Vocoders

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Automatic Evaluation of Hindustani Learner s SARGAM Practice

Environmental Sound Recognition using MP-based Features

Measuring the complexity of sound

Audio Imputation Using the Non-negative Hidden Markov Model

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

SGN Audio and Speech Processing

Digital Speech Processing and Coding

Study of Algorithms for Separation of Singing Voice from Music

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

A Novel Approach to Separation of Musical Signal Sources by NMF

Deep learning architectures for music audio classification: a personal (re)view

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

Feature Selection and Extraction of Audio Signal

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

L19: Prosodic modification of speech

Lecture 9: Time & Pitch Scaling

Rotating Machinery Fault Diagnosis Techniques Envelope and Cepstrum Analyses

An Automatic Audio Segmentation System for Radio Newscast. Final Project

Generating Groove: Predicting Jazz Harmonization

Mikko Myllymäki and Tuomas Virtanen

CS 188: Artificial Intelligence Spring Speech in an Hour

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Chapter IV THEORY OF CELP CODING

Digital Signal Processing

Automatic classification of traffic noise

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Speech and Music Discrimination based on Signal Modulation Spectrum.

Speech Coding using Linear Prediction

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Transcription:

Detection Lecture usic Processing Applications of usic Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Important pre-requisite for: usic segmentation usic thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition Detection Detection Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are analyzed Singing voice performs dominant melody above accompaniment Challenges: Complex characteristics of singing voice Large diversity of accompaniment music may play same melody as singing Pitch-fluctuating instruments my be similar to singing 5 2 25 3 35 4 45 Time in seconds Stable pitch Fluctuating pitch Detection Audio Feature Extraction Common approach: Frame-wise extraction of audio features Classification via machine learning Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n) 5 2 25 3 35 4 45 Time in seconds Compute for each analysis frame: Time-domain features features Cepstral feature others

Audio Feature Extraction Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched Audio Feature Extraction features:, linear vs. logarithmic frequency spacing Flatness (SF), Centroid (SC), and many others STFT [db] Gabor Wavelet [db] Linear Prediction Coeff. (LPC) Encodes spectral envelope [Hz] 9 8 7 6 5 4 3 [Hz] 887 635 46 2794 9 293 88 599 2.5.5 2 Time [Sec] 47 277.5.5 2 Time [Sec] Audio Feature Extraction achine Learning Cepstral features: Singing voice as an example Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract agnitude spectrum ultiplicative: excitation filter Log-magnitude spectrum Additive: excitation + filter Liftering Separation into smooth spectral envelope and fine-structured excitation agnitude spectrum Logarithmic magnitude Extraction of spectral envelope via cepstral liftering Observed Spectrum Envelope Excitation Spectrum.5.5 2 2.5 x 4.5.5 2 2.5 (Hz) x 4 Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc achine Learning The Feature Space Learning principles: Unsupervised learning Find structures in data Supervised learning Human observer provides ground truth Semi-supervised learning Combination of above principles Reinforcement learning Feedback of confident classifications to the training Geometric and algebraic interpretation of L problems Features contain numerical values Concatenation of several features Dimensionality The data set contains N observations Cardinality N Illustrative Example SF & SCF of 6 complex tones K SF K K k K k k s k s SC K f k sk k K sk k

The Feature Space The Feature Space Each feature has one value =2 Number of observations N=6 lpnoisetone.wav noisetone.wav hpnoisetone.wav harmonicnoise.wav Centroid Flatness 258.62.59 52.73.99 55.3.92 46.5.27 Each feature has one value =2 Number of observations N=6 apping of features SC to y-axis SF to x-axis Scatter plot with unnormalized axes Centroid 6 5 4 3 2 Scatter plot of Flatness vs. Centroid lpnoisetone.wav noisetone.wav hpnoisetone.wav harmonicnoisetone.wav pianotone.wav harmonictone.wav pianotone.wav 47.93. harmonictone.wav N 43.95...2.3.4.5.6.7.8.9 Flatness The Feature Space Each feature has one value =2 Number of observations N=6 apping of features SC to y-axis SF to x-axis Scatter plot with unnormalized axes Target class labels Provided by manual annotation Target Labels Centroid Flatness 258.62.59 52.73.99 55.3.92 46.5.27 47.93. 43.95. k-nearest Neighbours (knn) k-nearest Neighbours (knn) Decision Trees (DT) L-Dist. (anhattan) d x m y m m L2-Dist. (Euclidean) 2 d 2 x m y m m L -Dist. (aximum) d max x y,, x y

Random Forests (RF) Gaussian ixture odels (G) Σ Gaussian ixture odels (G) Support Vector achines (SV) sgn, Gauss components Deep Neural Networks (DNN) Deep Neural Networks (DNN),, Loss function

Further methods: Hidden arkov odels Transition probabilities between Gs Sparse Representation Classifier Sparse linear combination of training data Boosting Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks ultiple Kernel Learning others Detection el-scale Cepstral Coefficients Frame Filter Bank x t Segment-by-Segment Classification V V V N N N N N N N N N N V V V V N N Singing W log p( x i S ) W log p( x ) tw i tw i i Gaussian ixture odel (G) x, Σ N (), Σ 2 2... N (), Σ G G w w 2 w G + p( x ) 25 Audio osaicing NF-Inspired Audio osaicing Target signal: Beatles Let it be Source signal: Bees Non-negative matrix factorization (NF) [Driedger et al. ISIR 25] Non-negative matrix Components Activations. = fixed learned learned Proposed audio ing approach Target s spectrogram Source s spectrogram Activations osaic s spectrogram osaic signal: Let it Bee fixed fixed learned Basic NF-Inspired Audio osaicing Basic NF-Inspired Audio osaicing Iterative updates Preserve temporal context Core idea: support the development of sparse diagonal activation structures

Basic NF-Inspired Audio osaicing Basic NF-Inspired Audio osaicing Audio osaicing Audio osaicing Target signal: Chic Good times Source signal: Whales Target signal: Adele Rolling in the Deep Source signal: Race car osaic signal osaic signal https://www.audiolabs-erlangen.de/res/ir/25-isir-letitbee https://www.audiolabs-erlangen.de/res/ir/25-isir-letitbee Drum Source Separation Drum Source Separation Signal odel STFT VV V V istft Relative amplitude Log-frequency V V V 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8

Drum Sound Separation Decomposition via NFD Score-based information (drum notation) Drum Sound Separation Log-frequency W U Rows of H U U Relative Log-frequency amplitude.5.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Audio-based information (training Lateral drum slices from sounds) W https://www.audiolabs-erlangen.de/res/ir/26-ieee-taslp-drumseparation