Deep learning architectures for music audio classification: a personal (re)view

Similar documents
arxiv: v2 [cs.sd] 22 May 2017

AUDIO TAGGING WITH CONNECTIONIST TEMPORAL CLASSIFICATION MODEL USING SEQUENTIAL LABELLED DATA

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Applications of Music Processing

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

DEEP LEARNING FOR MUSIC RECOMMENDATION:

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

arxiv: v3 [cs.ne] 21 Dec 2016

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks

Learning the Speech Front-end With Raw Waveform CLDNNs

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Networks 1

Tempo and Beat Tracking

END-TO-END SOURCE SEPARATION WITH ADAPTIVE FRONT-ENDS

HYBRID MUSIC RECOMMENDER USING CONTENT-BASED AND SOCIAL INFORMATION. Paulo Chiliguano, Gyorgy Fazekas

arxiv: v2 [cs.sd] 31 Oct 2017

Music Signal Processing

Training neural network acoustic models on (multichannel) waveforms

Two Convolutional Neural Networks for Bird Detection in Audio Signals

Sketch-a-Net that Beats Humans

ENHANCED BEAT TRACKING WITH CONTEXT-AWARE NEURAL NETWORKS

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

SOUND EVENT ENVELOPE ESTIMATION IN POLYPHONIC MIXTURES

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to publication record in Explore Bristol Research PDF-document

Tempo and Beat Tracking

Music Recommendation using Recurrent Neural Networks

CP-JKU SUBMISSIONS FOR DCASE-2016: A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

Endpoint Detection using Grid Long Short-Term Memory Networks for Streaming Speech Recognition

An Automatic Audio Segmentation System for Radio Newscast. Final Project

A Fuller Understanding of Fully Convolutional Networks. Evan Shelhamer* Jonathan Long* Trevor Darrell UC Berkeley in CVPR'15, PAMI'16

Generating an appropriate sound for a video using WaveNet.

Image Manipulation Detection using Convolutional Neural Network

MULTI-TEMPORAL RESOLUTION CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION

SOUND EVENT DETECTION IN MULTICHANNEL AUDIO USING SPATIAL AND HARMONIC FEATURES. Department of Signal Processing, Tampere University of Technology

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

arxiv: v2 [eess.as] 11 Oct 2018

ACOUSTIC SCENE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

Drum Transcription Based on Independent Subspace Analysis

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

ANALYSIS OF ACOUSTIC FEATURES FOR AUTOMATED MULTI-TRACK MIXING

A New Framework for Supervised Speech Enhancement in the Time Domain

Frequency Estimation from Waveforms using Multi-Layered Neural Networks

EVALUATING THE ONLINE CAPABILITIES OF ONSET DETECTION METHODS

Deep Neural Network Architectures for Modulation Classification


Continuous Gesture Recognition Fact Sheet

arxiv: v1 [cs.sd] 29 Jun 2017

ZERO-MEAN CONVOLUTIONS FOR LEVEL-INVARIANT SINGING VOICE DETECTION

Convolutional Neural Networks for Small-footprint Keyword Spotting

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Lecture 23 Deep Learning: Segmentation

Attention-based Information Fusion using Multi-Encoder-Decoder Recurrent Neural Networks

Neural Networks The New Moore s Law

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

Seismic fault detection based on multi-attribute support vector machine analysis

Using RASTA in task independent TANDEM feature extraction

Deep Learning for Human Activity Recognition: A Resource Efficient Implementation on Low-Power Devices

Convolutional neural networks

Using Deep Learning for Sentiment Analysis and Opinion Mining

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

HOW DO DEEP CONVOLUTIONAL NEURAL NETWORKS

Onset Detection Revisited

Detection and Segmentation. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 11 -

Radio Deep Learning Efforts Showcase Presentation

Cepstrum alanysis of speech signals

SINGLE CHANNEL AUDIO SOURCE SEPARATION USING CONVOLUTIONAL DENOISING AUTOENCODERS. Emad M. Grais and Mark D. Plumbley

Research on Extracting BPM Feature Values in Music Beat Tracking Algorithm

NOTE ONSET DETECTION IN MUSICAL SIGNALS VIA NEURAL NETWORK BASED MULTI ODF FUSION

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

A JOINT DETECTION-CLASSIFICATION MODEL FOR AUDIO TAGGING OF WEAKLY LABELLED DATA. Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D.

COM325 Computer Speech and Hearing

Acoustic modelling from the signal domain using CNNs

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

Harmonic Percussive Source Separation

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Change Point Determination in Audio Data Using Auditory Features

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

Introduction of Audio and Music

SGN Audio and Speech Processing

Audio Effects Emulation with Neural Networks

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

Neural Architectures for Named Entity Recognition

Gated Recurrent Convolution Neural Network for OCR

RECURRENT NEURAL NETWORKS FOR POLYPHONIC SOUND EVENT DETECTION IN REAL LIFE RECORDINGS. Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen

Tag Propaga)on based on Ar)st Similarity

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Number Plate Detection with a Multi-Convolutional Neural Network Approach with Optical Character Recognition for Mobile Devices

SGN Audio and Speech Processing

Neural Network Part 4: Recurrent Neural Networks

Automatic Evaluation of Hindustani Learner s SARGAM Practice

arxiv: v1 [cs.sd] 1 Oct 2016

Audio Effects Emulation with Neural Networks

Robustness (cont.); End-to-end systems

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Transcription:

Deep learning architectures for music audio classification: a personal (re)view Jordi Pons jordipons.me @jordiponsdotme Music Technology Group Universitat Pompeu Fabra, Barcelona

Acronyms MLP: multi layer perceptron feed-forward neural network RNN: recurrent neural network LSTM: long-short term memory CNN: convolutional neural network BN: batch normalization..the following slides assume you know these concepts!

Outline Chronology: the big picture Audio classification: state-of-the-art review Music audio tagging as a study case

Outline Chronology: the big picture Audio classification: state-of-the-art review Music audio tagging as a study case

Deep learning & music papers: milestones 80 70 # papers 60 50 40 30 20 10 0

Deep learning & music papers: milestones 80 70 # papers 60 50 40 30 20 10 0 RNN from symbolic data for automatic music composition (Todd, 1988) MLP from symbolic data for automatic music composition (Lewis, 1988)

Deep learning & music papers: milestones 80 70 # papers 60 50 40 LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002) 30 20 10 0 RNN from symbolic data for automatic music composition (Todd, 1988) MLP from symbolic data for automatic music composition (Lewis, 1988)

Deep learning & music papers: milestones 80 70 # papers 60 50 40 MLP learns from spectrograms data for note onset detection (Marolt et al, 2002) LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002) 30 20 10 0 RNN from symbolic data for automatic music composition (Todd, 1988) MLP from symbolic data for automatic music composition (Lewis, 1988)

Deep learning & music papers: milestones 80 70 # papers 60 50 40 CNN learns from spectrograms for music audio classification (Lee et al., 2009) MLP learns from spectrograms data for note onset detection (Marolt et al, 2002) LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002) 30 20 10 0 RNN from symbolic data for automatic music composition (Todd, 1988) MLP from symbolic data for automatic music composition (Lewis, 1988)

Deep learning & music papers: milestones 80 70 # papers 60 50 40 End-to-end learning for music audio classification (Dieleman et al., 2014) CNN learns from spectrograms for music audio classification (Lee et al., 2009) MLP learns from spectrograms data for note onset detection (Marolt et al, 2002) LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002) 30 20 10 0 RNN from symbolic data for automatic music composition (Todd, 1988) MLP from symbolic data for automatic music composition (Lewis, 1988)

Deep learning & music papers: data trends 80 70 raw audio data # papers 60 50 spectrograms data 40 30 20 10 0 symbolic data

Deep learning & music papers: some references Dieleman et al., 2014 End-to-end learning for music audio in International Conference on Acoustics, Speech and Signal Processing (ICASSP) Lee et al., 2009 Unsupervised feature learning for audio classification using convolutional deep belief networks in Advances in Neural Information Processing Systems (NIPS) Marolt et al., 2002 Neural networks for note onset detection in piano music in Proceedings of the International Computer Music Conference (ICMC) Eck and Schmidhuber, 2002 Finding temporal structure in music: Blues improvisation with LSTM recurrent networks in Proceedings of the Workshop on Neural Networks for Signal Processing Todd, 1988 A sequential network design for musical applications in Proceedings of the Connectionist Models Summer School Lewis, 1988 Creation by Refinement: A creativity paradigm for gradient descent learning networks in International Conference on Neural Networks

Outline Chronology: the big picture Audio classification: state-of-the-art review Music audio tagging as a study case

The deep learning pipeline input machine learning waveform deep learning model or any audio representation! output phonetic transcription describe music with tags event detection

The deep learning pipeline input waveform or any audio representation! front-end back-end output phonetic transcription describe music with tags event detection

The deep learning pipeline: input? input?

How to format the input (audio) data? Waveform end-to-end learning Pre-processed waveform e.g.: spectrogram

The deep learning pipeline: front-end? input front-end waveform? spectrogram

based on domain knowledge? filters config? input signal? waveform pre-processed waveform

CNN front-ends for audio classification Pre-processed waveform e.g.: spectrogram Waveform end-to-end learning 3x1 3x1... 3x1 Sample-level 3x1 3x3 3x3... 3x3 3x3 Small-rectangular filters

based on domain knowledge? no filters config? minimal filter expression input signal? waveform sample-level 3x1 3x1... 3x1 3x1 pre-processed waveform small-rectangular filters 3x3 3x3... 3x3 3x3

Domain knowledge to design CNN front-ends Waveform end-to-end learning Pre-processed waveform e.g.: spectrogram

Domain knowledge to design CNN front-ends Waveform end-to-end learning filter length: 512 stride: 256 window length? hop size? frame-level Pre-processed waveform e.g.: spectrogram Explicitly tailoring the CNN towards learning temporal or timbral cues vertical or horizontal filters

based on domain knowledge? no yes filters config? minimal filter expression single filter shape in 1st CNN layer input signal? waveform sample-level 3x1 3x1... 3x1 3x1 frame-level pre-processed waveform small-rectangular filters 3x3 3x3... 3x3 3x3 vertical OR horizontal or

DSP wisdom to design CNN front ends Waveform end-to-end learning Efficient way to represent 4 periods! Frame-level (many shapes!) Pre-processed waveform e.g.: spectrogram Explicitly tailoring the CNN towards learning temporal and timbral cues Vertical and/or horizontal

based on domain knowledge? no yes yes filters config? minimal filter expression single filter shape in 1st CNN layer many filter shapes in 1st CNN layer input signal? waveform sample-level 3x1 3x1... 3x1 3x1 frame-level pre-processed waveform small-rectangular filters 3x3 3x3... 3x3 3x3 vertical OR horizontal or frame-level vertical AND/OR horizontal

CNN front-ends for audio classification Sample-level: Lee et al., 2017 Sample-level Deep Convolutional Neural Networks for Music Autotagging Using Raw Waveforms in Sound and Music Computing Conference (SMC) Small-rectangular filters: Choi et al., 2016 Automatic tagging using deep convolutional neural networks in Proceedings of the ISMIR (International Society of Music Information Retrieval) Conference Frame-level (single shape): Dieleman et al., 2014 End-to-end learning for music audio in International Conference on Acoustics, Speech and Signal Processing (ICASSP) Vertical: Lee et al., 2009 Unsupervised feature learning for audio classification using convolutional deep belief networks in Advances in Neural Information Processing Systems (NIPS) Horizontal: Schluter & Bock, 2014 Improved musical onset detection with convolutional neural networks in International Conference on Acoustics, Speech and Signal Processing (ICASSP) Frame-level (many shapes): Zhu et al., 2016 Learning multiscale features directly from waveforms in arxiv:1603.09509 Vertical and horizontal (many shapes): Pons, et al., 2016 Experimenting with musically motivated convolutional neural networks in 14th International Workshop on Content-Based Multimedia Indexing

The deep learning pipeline: back-end? input front-end back-end waveform several CNN architectures? spectrogram

What is the back-end doing? same length output same length output back-end back-end latent feature-map latent feature-map front-end front-end Back-end adapts a variable-length feature map to a fixed output-size

Back-ends for variable-length inputs Temporal pooling: max-pool or average-pool the temporal axis Pons et al., 2017 End-to-end learning for music audio tagging at scale, in proceedings of the ML4Audio Workshop at NIPS. Attention: weighting latent representations to what is important C. Raffel, 2016 Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. PhD thesis. RNN: summarization through a deep temporal model Vogl et al., 2018 Drum transcription via joint beat and drum modeling using convolutional recurrent neural networks, In proceedings of the ISMIR conference...music is generally of variable length!

Back-ends for fixed-length inputs Common trick: let s assume a fixed-length input Fully convolutional stacks: adapting the input to the output with a stack of CNNs & pooling layers. Choi et al., 2016 Automatic tagging using deep convolutional neural networks in proceedings of the ISMIR conference. MLP: map a fixed-length feature map to a fixed-length output Schluter & Bock, 2014 Improved musical onset detection with convolutional neural networks in proceedings of the ICASSP...such trick works very well!

The deep learning pipeline: output input front-end back-end waveform several CNN architectures MLP spectrogram RNN attention output phonetic transcription describe music with tags event detection

The deep learning pipeline: output input front-end back-end output waveform several CNN architectures MLP phonetic transcription spectrogram RNN attention describe music with tags event detection

Outline Chronology: the big picture Audio classification: state-of-the-art review Music audio tagging as a study case Pons et al., 2017. End-to-end learning for music audio tagging at scale, in ML4Audio Workshop at NIPS Summer internship @ Pandora

The deep learning pipeline: input? input? front-end back-end output describe music with tags

How to format the input (audio) data? waveform already: zero-mean & one-variance NO pre-procesing! log-mel spectrogram STFT & mel mapping reduces size of the input by removing perceptually irrelevant information logarithmic compression reduces dynamic range of the input zero-mean & one-variance

The deep learning pipeline: input? input waveform log-mel spectrogram front-end back-end output describe music with tags

The deep learning pipeline: front-end? input waveform log-mel spectrogram front-end? back-end output describe music with tags

based on domain knowledge? no yes yes filters config? minimal filter expression single filter shape in 1st CNN layer many filter shapes in 1st CNN layer input signal? waveform sample-level 3x1 3x1... 3x1 3x1 frame-level pre-processed waveform small-rectangular filters 3x3 3x3... 3x3 3x3 vertical OR horizontal or frame-level vertical AND/OR horizontal

Our conclusions: front-ends performance waveform: sample-level >> frame-level (many shapes) > frame-level (single shape) spectrogram: vertical and/or horizontal > vertical or horizontal vertical and/or horizontal ~ small-rectangular filters (but vertical and/horizontal consume less memory!)

Studied front-ends: waveform model sample-level (Lee et al., 2017)

Studied front-ends: spectrogram model vertical and horizontal musically motivated CNNs (Pons et al., 2016 2017)

The deep learning pipeline: front-end? input waveform log-mel spectrogram front-end sample-level vertical and horizontal back-end output describe music with tags

The deep learning pipeline: back-end? input waveform log-mel spectrogram front-end back-end sample-level? vertical and horizontal output describe music with tags

Studied back-end: music is of variable length! Temporal pooling (Dieleman et al., 2014)

The deep learning pipeline: back-end? input waveform log-mel spectrogram front-end sample-level vertical and horizontal back-end temporal pooling output describe music with tags

Results: waveform vs. spectrogram ROC-AUC (%) 100 90 PR-AUC (%) 91.61% 80 70 60 54.27% 50 40 30 20 10 0 Waveform (100k) Waveform (1M) GBT+features (1.2 M) Spectrogram (100k) Spectrogram (1M)

Results: waveform vs. spectrogram ROC-AUC (%) 100 90 91.61% PR-AUC (%) 89.16% 90.13% 80 70 60 50 54.27% 49.25% 52.08% 40 30 20 10 0 Waveform (100k) Waveform (1M) GBT+features (1.2 M) Spectrogram (100k) Spectrogram (1M)

Results: waveform vs. spectrogram ROC-AUC (%) 100 90 91.61% PR-AUC (%) 89.16% 90.13% 91.54% 92.14% 80 70 60 50 54.27% 49.25% 52.08% 57.86% 59.35% 40 30 20 10 0 Waveform (100k) Waveform (1M) GBT+features (1.2 M) Spectrogram (100k) Spectrogram (1M)

spectrogram model > waveform model domain knowledge intuitions are valid guides for designing deep models

Let s listen to some music: human labels female vocals triple meter acoustic classical music baroque period string ensemble

Let s listen to some music: the baseline in action acoustic triple meter string ensemble classical music baroque period classic period

Let s listen to some music: our model in action acoustic string ensemble classical music period baroque compositional dominance of lead vocals major

Deep learning architectures for music audio classification: a personal (re)view Jordi Pons jordipons.me @jordiponsdotme Music Technology Group Universitat Pompeu Fabra, Barcelona