Applications of Music Processing

Similar documents
Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

SOUND SOURCE RECOGNITION AND MODELING

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Advanced audio analysis. Martin Gasser

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Drum Transcription Based on Independent Subspace Analysis

Tempo and Beat Tracking

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

REpeating Pattern Extraction Technique (REPET)

An Improved Voice Activity Detection Based on Deep Belief Networks

Tempo and Beat Tracking

Electric Guitar Pickups Recognition

Voice Activity Detection

Speech Signal Analysis

Extracting meaning from audio signals - a machine learning approach

A multi-class method for detecting audio events in news broadcasts

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Music Signal Processing

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Adaptive Filters Application of Linear Prediction

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Speech Synthesis using Mel-Cepstral Coefficient Feature

A Novel Approach to Separation of Musical Signal Sources by NMF

Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses

Study of Algorithms for Separation of Singing Voice from Music

Design and Implementation of an Audio Classification System Based on SVM

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

Isolated Digit Recognition Using MFCC AND DTW

Cepstrum alanysis of speech signals

Audio Imputation Using the Non-negative Hidden Markov Model

Voiced/nonvoiced detection based on robustness of voiced epochs

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

AMAJOR difficulty of audio representations for classification

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Mikko Myllymäki and Tuomas Virtanen

Roberto Togneri (Signal Processing and Recognition Lab)

Deep learning architectures for music audio classification: a personal (re)view

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Pitch Estimation of Singing Voice From Monaural Popular Music Recordings

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

CS 188: Artificial Intelligence Spring Speech in an Hour

Singing Expression Transfer from One Voice to Another for a Given Song

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

Evaluation of MFCC Estimation Techniques for Music Similarity Jensen, Jesper Højvang; Christensen, Mads Græsbøll; Murthi, Manohar; Jensen, Søren Holdt

Audio processing methods on marine mammal vocalizations

Environmental Sound Recognition using MP-based Features

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Complex Sounds. Reading: Yost Ch. 4

Converting Speaking Voice into Singing Voice

VQ Source Models: Perceptual & Phase Issues

Speech Synthesis; Pitch Detection and Vocoders

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Generating Groove: Predicting Jazz Harmonization

An Automatic Audio Segmentation System for Radio Newscast. Final Project

Lecture 5: Sinusoidal Modeling

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Measuring the complexity of sound

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Radio Deep Learning Efforts Showcase Presentation

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

Communications Theory and Engineering

CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Speech and Music Discrimination based on Signal Modulation Spectrum.

Speaker and Noise Independent Voice Activity Detection

Tools for Advanced Sound & Vibration Analysis

COMP 546, Winter 2017 lecture 20 - sound 2

EE123 Digital Signal Processing

SPEECH AND SPECTRAL ANALYSIS

SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING DEEP RECURRENT NEURAL NETWORKS

CHORD DETECTION USING CHROMAGRAM OPTIMIZED BY EXTRACTING ADDITIONAL FEATURES

Automatic Evaluation of Hindustani Learner s SARGAM Practice

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

EE482: Digital Signal Processing Applications

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Lecture 9: Time & Pitch Scaling

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Transcription:

Lecture Music Processing Applications of Music Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de

Singing Voice Detection Important pre-requisite for: Music segmentation Music thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition

Singing Voice Detection Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are analyzed Singing voice performs dominant melody above accompaniment 10 15 20 25 30 35 40 45 Time in seconds

Singing Voice Detection Challenges: Complex characteristics of singing voice Large diversity of accompaniment music Accompaniment may play same melody as singing Pitch-fluctuating instruments my be similar to singing Stable pitch Fluctuating pitch

Singing Voice Detection Common approach: Frame-wise extraction of audio features Classification via machine learning 10 15 20 25 30 35 40 45 Time in seconds

Audio Feature Extraction Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n) Compute for each analysis frame: Time-domain features Spectral features Cepstral feature others

Audio Feature Extraction Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched Linear Prediction Coeff. (LPC) Encodes spectral envelope

Audio Feature Extraction Spectral features: Spectrogram, linear vs. logarithmic frequency spacing Spectral Flatness (SF), Spectral Centroid (SC), and many others STFT Spectrogram [db] Gabor Wavelet Spectrogram [db] 11000 8870 Frequency [Hz] 10000 9000 8000 7000 6000 5000 4000 Frequency [Hz] 6035 4106 2794 1901 1293 880 3000 2000 1000 0.5 1 1.5 2 Time [Sec] 599 407 277 0.5 1 1.5 2 Time [Sec]

Audio Feature Extraction Cepstral features: Singing voice as an example Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract Magnitude spectrum Multiplicative: excitation filter Log-magnitude spectrum Additive: excitation + filter Liftering Separation into smooth spectral envelope and fine-structured excitation Magnitude spectrum Logarithmic magnitude Extraction of spectral envelope via cepstral liftering Observed Spectrum Spectral Envelope Excitation Spectrum 0 0.5 1 1.5 2 2.5 x 10 4 0 0.5 1 1.5 2 2.5 Frequency (Hz) x 10 4

Machine Learning Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc

Machine Learning Learning principles: Unsupervised learning Find structures in data Supervised learning Human observer provides ground truth Semi-supervised learning Combination of above principles Reinforcement learning Feedback of confident classifications to the training

The Feature Space Geometric and algebraic interpretation of ML problems Features contain numerical values Concatenation of several features Dimensionality M The data set contains N observations Cardinality N Illustrative Example SFM & SCF of 6 complex tones SF K 1 K K 1 k 0 K 1 k 0 s k s k SC K 1 k 0 f K 1 k 0 k sk sk

The Feature Space Each feature has one value M=2 Number of observations N=6 Spectral Centroid Spectral Flatness M lpnoisetone.wav 258.62 0.59 noisetone.wav 512.73 0.99 hpnoisetone.wav 550.13 0.92 harmonicnoise.wav 146.50 0.27 pianotone.wav 47.93 0.01 harmonictone.wav N 43.95 0.01

The Feature Space Each feature has one value M=2 Number of observations N=6 Mapping of features SC to y-axis SF to x-axis Scatter plot with unnormalized axes Spectral Centroid 600 500 400 300 200 Scatter plot of Spectral Flatness vs. Spectral Centroid lpnoisetone.wav noisetone.wav hpnoisetone.wav harmonicnoisetone.wav pianotone.wav harmonictone.wav 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spectral Flatness

The Feature Space Each feature has one value M=2 Number of observations N=6 Mapping of features Target Labels Spectral Centroid Spectral Flatness SC to y-axis SF to x-axis Scatter plot with unnormalized axes Target class labels 0 0 0 1 258.62 0.59 512.73 0.99 550.13 0.92 146.50 0.27 Provided by manual annotation 1 1 47.93 0.01 43.95 0.01

Classification methods k-nearest Neighbours (knn) Singing Voice Accompaniment Unknown data

Classification methods k-nearest Neighbours (knn) Singing Voice Accompaniment Unknown data L1-Dist. (Manhattan) d 1 M m1 x m y m L2-Dist. (Euclidean) d M 2 m1 x m y m 2 d L -Dist. (Maximum) max x y1, 1, x M y M

Classification methods Decision Trees (DT) Singing Voice Accompaniment Unknown data

Classification methods Random Forests (RF) Singing Voice Accompaniment Unknown data

Classification methods Gaussian Mixture Models (GMM) Singing Voice Accompaniment Unknown data Σ

Classification methods Gaussian Mixture Models (GMM) Singing Voice Accompaniment Unknown data Gauss components

Classification methods Support Vector Machines (SVM) Singing Voice Accompaniment Unknown data sgn,

Classification methods Deep Neural Networks (DNN) Singing Voice Accompaniment Unknown data,,

Classification methods Deep Neural Networks (DNN) Singing Voice Accompaniment Unknown data Loss function

Classification methods Further methods: Hidden Markov Models Transition probabilities between GMMs Sparse Representation Classifier Sparse linear combination of training data Boosting Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks Multiple Kernel Learning others 25

Singing Voice Detection Mel-scale Frequency Cepstral Coefficients Frame Filter Bank x t Gaussian Mixture Model (GMM) x, Σ 1 1, Σ 2 2..., Σ G G N () N () Segment-by-Segment Classification V V V V V V V N N N N N N N N N N N N W 1 i0 log p( x tw i S ) Singing W 1 i0 Accompaniment log p( x tw i M ) w w 2 w G 1 + p( x )

Audio Mosaicing Target signal: Beatles Let it be Source signal: Bees Mosaic signal: Let it Bee

NMF-Inspired Audio Mosaicing Non-negative matrix factorization (NMF) [Driedger et al. ISMIR 2015] Non-negative matrix Components Activations. = fixed learned learned Proposed audio mosaicing approach Target s spectrogram Source s spectrogram Activations Mosaic s spectrogram Frequency. = Time source Frequency fixed Time source fixed Time target learned Time target

Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target

Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Iterative updates Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target Preserve temporal context Core idea: support the development of sparse diagonal activation structures

Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target

Basic NMF-Inspired Audio Mosaicing Spectrogram target Spectrogram source Activation matrix Spectrogram mosaic Frequency Frequency Time source. = Frequency Time target Time source Time target Time target

Audio Mosaicing Target signal: Chic Good times Source signal: Whales Mosaic signal https://www.audiolabs-erlangen.de/resources/mir/2015-ismir-letitbee

Audio Mosaicing Target signal: Adele Rolling in the Deep Source signal: Race car Mosaic signal https://www.audiolabs-erlangen.de/resources/mir/2015-ismir-letitbee

Drum Source Separation

Drum Source Separation Signal Model STFT VV V V istft Relative amplitude Log-frequency V V V 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 Time (seconds) Time (seconds)

Drum Sound Separation Decomposition via NMFD Score-based information (drum notation) W Rows of H Log-frequency U U U Audio-based information (training drum sounds) Lateral slices from W Time (seconds)

Drum Sound Separation Relative Log-frequency amplitude 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Time (seconds) https://www.audiolabs-erlangen.de/resources/mir/2016-ieee-taslp-drumseparation