Large-scale Music Identification Algorithms and Applications

Size: px

Start display at page:

Download "Large-scale Music Identification Algorithms and Applications"

Aron Palmer
5 years ago
Views:

1 Large-scale Music Identification Algorithms and Applications Eugene Weinstein, PhD Candidate New York University, Courant Institute Department of Computer Science Depth Qualifying Exam June 20th, 2007

2 Talk Outline Introduction, motivation Algorithms for music identification Acoustic modeling using Gaussian Mixture Models Song set representation using Finite-State Transducers Experiments on clean, noisy, and distorted data Theoretical results New bounds on the size of a factor automaton Conclusion 2

3 Introduction Music identification scenario: Match a few seconds of audio to large song database Many potential applications: music search, content monitoring, song analysis Algorithmic, theoretical, and practical challenges, e.g., Recording can be distorted due to noise, transmission over limited channels Might only get short audio snippet from any point in song 3

4 Past Work Most past work based on hashing, e.g., [Haitsma et al. 01] Exact match required between training and test features HMM system over music sound events: [Batlle et al. 02] Similar to speech recognition with unknown phone set Hypothesize phones in iterative process: 1. Run decoding on training corpus, obtain labels 2. Estimate phones based on decoding output 4

5 Overview Start with database of 15,000+ songs Compute MFCC features over audio Cluster song segments to get initial music phone set Learn phone set and train acoustic model for each phone Generate compact recognition transducer Identify songs using Viterbi decoding Waveform Cepstra Phone Set/ Acoustic Model Transcription Recognition Transducer

Acoustic Model Training Segment training audio based on spectral change Initialize model using k-means clustering over segments Iterative training, repeat: 1.

6 Acoustic Model Training Segment training audio based on spectral change Initialize model using k-means clustering over segments Iterative training, repeat: 1. Run decoder with current model, get transcriptions 2. Use transcription counts to train GMM for each phone Edit distance measures convergence 6 Edit Distance [ICASSP 07] Training Iteration Phone Set/ Acoustic Model Transcription

7 Finite-state Transducers Finite automata with input and output labels on transitions, possibly weighted Widely used in speech and text processing, computational biology, etc. We view each song transcription as a string (music phones are the symbols) and construct a FST to Map substrings (factors) to songs Phone Set/ Acoustic Model 7 Transcription Recognition Transducer

8 Full-song Recognition Want transducer mapping complete music phone sequences to corresponding songs (no snippets for now) Idea: one state chain per song Transition to final state has song identifier as output label (all other output labels are ε s) Using generic automata operations, we construct deterministic minimal transducer for efficient search mp_72:! 1 mp_240:! 2 mp_2: Beatles--Let_It_Be 0 mp_736:! mp_736:! 6 3 mp_28:! mp_736 :! 7 4 mp_240:! mp_349:! 8 mp_448:! 9 5 mp_20:madonna--ray_of_light mp_889:van_halen--right_now 10 8

9 Snippet (Factor) Acceptor Need to recognize song parts, or snippets Make all states initial & final Drop output labels Determinize, minimize Recognizes all snippets But doesn t identify songs! mp_72:! 1 mp_240:! 2 mp_2: Beatles--Let_It_Be 0 mp_736:! mp_736:! 6 3 mp_28:! mp_736 :! 7 4 mp_240:! mp_349:! 8 mp_448:! 9 5 mp_20:madonna--ray_of_light mp_889:van_halen--right_now 10 9

10 Snippet (Factor) Acceptor Need to recognize song parts, or snippets Make all states initial & final Drop output labels Determinize, minimize Recognizes all snippets ε:ε 0 But doesn t identify songs! mp_72:! mp_736:! mp_736:! 1 6 mp_240:! 3 mp_28:! mp_736 :! 7 ε:ε 2 4 mp_240:! mp_2: Beatles--Let_It_Be mp_349:! 8 mp_448:! 9 5 mp_20:madonna--ray_of_light mp_889:van_halen--right_now 10 9

11 Snippet (Factor) Acceptor Need to recognize song parts, or snippets Make all states initial & final Drop output labels Determinize, minimize Recognizes all snippets ε:ε 0 But doesn t identify songs! mp_72:! mp_736:! mp_736:! 1 6 mp_240:! 3 mp_28:! mp_736 :! 7 ε:ε 2 4 mp_240:! mp_2: Beatles--Let_It_Be mp_349:! 8 mp_448:! 9 5 mp_20:madonna--ray_of_light mp_889:van_halen--right_now 10 9

Snippet (Factor) Acceptor Need to recognize song parts, or snippets Make all states initial & final mp_2 mp_20 ε:ε 0 Drop output labels Determinize, minimize Recognizes all snippets But doesn t

12 Snippet (Factor) Acceptor Need to recognize song parts, or snippets Make all states initial & final mp_2 mp_20 ε:ε 0 Drop output labels Determinize, minimize Recognizes all snippets But doesn t identify songs! mp_72:! mp_736:! mp_736:! 1 6 mp_240:! 3 mp_28:! mp_736 :! 7 ε:ε 2 4 mp_240:! mp_2: Beatles--Let_It_Be mp_349:! 8 mp_448:! mp_72 mp_736 4 mp_20:madonna--ray_of_light 2 mp_240 mp_736 mp_240 mp_240 7 mp_889:van_halen--right_now 5 mp_240 mp_2 3 6 mp_2 mp_20 mp_

Weighted Factor Acceptor [ICASSP 07] Use numerical song id s as weights on transitions Automata operations preserve total weight along a given path [Mohri 97] 0 mp_72/0 mp_736/1 4/0 2/0 mp_240/0

13 Weighted Factor Acceptor [ICASSP 07] Use numerical song id s as weights on transitions Automata operations preserve total weight along a given path [Mohri 97] 0 mp_72/0 mp_736/1 4/0 2/0 mp_240/0 mp_736/0 mp_2/0 mp_20/1 mp_240/0 mp_240/0 7/0 5/0 mp_240/0 mp_2/0 3/0 6/0 mp_2/0 mp_20/1 mp_20/0 1/0 mp_72:! 1 mp_240:! 2 mp_2: Beatles--Let_It_Be 0 mp_736:! mp_736:! 6 3 mp_28:! mp_736 :! 7 4 mp_240:! mp_349:! 8 mp_448:! 9 5 mp_20:madonna--ray_of_light mp_889:van_halen--right_now 10 10

14 Final Transducer To construct transducer, turn weights into output labels After decoding, sum decoder outputs, get numeric song id Full-song transducer: 27.5M states, 27.6M transitions Final factor transducer: 32.7M states, 59.6M transitions Automaton recognizing any factor of song transcriptions only ~2x bigger than that of entire songs This is unexpected, considering there are 15,000 songs, 1,700 average phones per song: # possible factors = 15,000 1,

15 Experiments Database: 15,455 songs in MP3 format Average song duration: 3.9 minutes > 1,000 hours of audio Big test set: one 10-second snippet per song Test identification on clean in-set data Small test set: 1,762 in-set and 1,856 out-of-set snippets Test noise robustness and rejection of out-of-set songs 12

16 Results: Identification FS: Full song (best possible performance): 99.7% PS: Partial songs, 10-second snippets: 99.5% Test on a range of beam sizes in Viterbi search All tests faster than real-time (0.48 real-time for beam=12) Accuracy [ICASSP 07] Beam Size FS PS 13

17 Experiments: Detection Detection: distinguish in-set from out-of-set songs Test detection capability using SVM s Construct universal background model by clustering GMM components across all phones SVM features: log-likelihood of best path with in-set acoustic model, background model, and their difference Radial basis function kernel with a sweep over the parameter space ( C and γ ) 14

18 Noise, Distortions [ISMIR 07] Noise condition: additive white noise (harsh environment) Add noise at different mixing levels Speed-up and slow-down of audio (no pitch shifting) Different rate multipliers MP3 encode/decode Different bitrates 15

19 Results: Noise, Distortions [ICASSP 07, ISMIR 07] Condition Identification Accuracy Detection Accuracy Clean 99.4% 96.9% White SNR 98.5% 96.8% White SNR 85.5% 94.5% White SNR 39.0% 93.2% White SNR 11.1% 93.5% Speed up by 2% 96.0% 96.0% Slow down by 2% 96.4% 96.4% Speed up by 10% 43.2% 87.7% Slow down by 10% 45.7% 85.8% MP3 re-encode 64kbps 98.1% 96.6% MP3 re-encode 32kbps 95.5% 95.3% 16

20 Summary So Far Phone set for music ID can be learned automatically Match audio without relying on direct match between feature values -- should be robust to signal variation Acoustic modeling techniques applicable to speech, etc. We formulate music ID as a search problem Use well-established techniques from speech and text processing (FSTs, GMMs) to make effective system FST framework allows efficient matching of song snippets Compact factor transducer can be constructed 17

21 String Matching in Music ID [CIAA 07] Automata allow us to solve string matching problem But will our approach generalize to larger data sets? We thus address a theoretical question What is the size of the smallest deterministic automaton accepting the factors of a set of strings U? For efficiency, Ucan be represented with an automaton A Or, set of strings may be given directly as an automaton More general question: What is the size of the factor automaton of A? 18

22 Past Work Factor automaton of a string x has at most 2 x 2 states, and 3 x 4 transitions [Crochemore 85; Blumer et al. 86] Can be constructed by a linear-time online algorithm Size bounds for a set of strings U has also previously been studied [Blumer et al. 87] If U is the sum of the lengths of all the strings in U Factor automaton of U has at most 2 U 1 states and 3 U 3 transitions We prove a substantially better bound here 19

23 Suffix Automaton We start out with an automaton A recognizing strings in Let S(A) and F (A) be the deterministic minimal automata recognizing the suffixes and factors of A, respectively To construct S(A) make each state of A initial (by adding epsilons), determinize, minimize To construct F (A) make each state of S(A) final, minimize Consequence: F (A) S(A) a 0 1 c 2 a b 3 4 b a 5 20 U

24 Suffix Automaton We start out with an automaton A recognizing strings in Let S(A) and F (A) be the deterministic minimal automata recognizing the suffixes and factors of A, respectively To construct S(A) make each state of A initial (by adding epsilons), determinize, minimize To construct F (A) make each state of S(A) final, minimize Consequence: F (A) S(A) ε ε ε a 3 b a c b a 5 4 ε U ε 20

25 Suffix Automaton We start out with an automaton A recognizing strings in Let S(A) and F (A) be the deterministic minimal automata recognizing the suffixes and factors of A, respectively To construct S(A) make each state of A initial (by adding epsilons), determinize, minimize To construct F (A) make each state of S(A) final, minimize b Consequence: F (A) S(A) ε ε ε a 3 a c b 4 ε ε 1 a b c a 5 0 b 20 c b 3 a a 2 6 a b 4 5 U

26 Suffix Automaton We start out with an automaton A recognizing strings in Let S(A) and F (A) be the deterministic minimal automata recognizing the suffixes and factors of A, respectively To construct S(A) make each state of A initial (by adding epsilons), determinize, minimize To construct F (A) make each state of S(A) final, minimize b Consequence: F (A) S(A) ε ε ε a 3 a c b 4 ε ε 1 a b c a 5 0 b 20 c b 3 a a 2 6 a b 4 5 U

27 Size Bound: Strategy Goal: a bound on F (A) in terms of A Work on bounding S(A) consider suffixes only for now Idea: each state in S(A) accepts a distinct set of suffixes, so count the number of possible sets of suffixes The suffix sets can be arranged in a hierarchy, which is directly related in size to A Motivated by similar arguments for single-string case in [Blumer et al. 86]; string sets in [Blumer et al. 87] 21

28 Suffix Sets Automaton A is k-suffix unique if no two strings accepted by Ashare the same k-length suffix. Suffix-unique if k = 1 Define end-set(x) : set of states in A reachable after reading e.g., end-set(ac) = {2, 3, 4, 5} x y denotes end-set(x) = end-set(y) This is a right-invariant equivalence relation [x] is the equivalence class of x a 0 1 c 2 22 a b 3 4 b a 5 x

, suff(3) = {ab, ba} N(q) is the set of states in Afrom which a non-empty string in

29 Notation N str is number of strings accepted bya If q is a state of S(A), suff(q) is set of suffixes accepted from e.g., suff(3) = {ab, ba} N(q) is the set of states in Afrom which a non-empty string in suff(q) can be read to reach a final state A e.g., a 0 1 N(3) = {2, 1} c 2 b a b 3 4 b a 5 23 b a 0 1 c b 2 S(A) c a b 6 4 b 3 a 5 a q

30 Suffix Set Inclusion

31 ć Suffix Set Inclusion Lemma: Let Abe a suffix-unique A automaton be a suffix-unique and let be two states of S(A) such that N(q) N(q ),, then suff(q) suff(q ) and N(q) N(q ) ć suff(q ) suff(q) and N(q ) N(q) ć or and q q

32 ć Suffix Set Inclusion Lemma: Let Abe a suffix-unique A automaton be a suffix-unique and let be two states of S(A) such that N(q) N(q ),, then suff(q) suff(q ) and N(q) N(q ) ć suff(q ) suff(q) and N(q ) N(q) ć and q q Proof: Let paths in S(A) to q and q be labeled with u and u. or u q S(A) u q

33 ć ą ć Suffix Set Inclusion Lemma: Let Abe a suffix-unique A automaton be a suffix-unique and let be two states of S(A) such that N(q) N(q ),, then suff(q) suff(q ) and N(q) N(q ) ć suff(q ) suff(q) and N(q ) N(q) ć and q q Proof: Let paths in S(A) to q and q be labeled with u and u. Thus A must have a state or p N(q) N(q ) ( ) such that bot A u u p u u q q S(A)

34 ć ą ć Suffix Set Inclusion Lemma: Let Abe a suffix-unique A automaton be a suffix-unique and let be two states of S(A) such that N(q) N(q ),, then suff(q) suff(q ) and N(q) N(q ) ć suff(q ) suff(q) and N(q ) N(q) ć and q q Proof: Let paths in S(A) to q and q be labeled with u and u. Thus A must have a state p N(q) N(q ) ( ) such that bot Thus, exist paths v suff(q) and v suff(q ) from p to final A u p or S(A) v u q v u v u q v

35 A Suffix Set Inclusion u u p v v Since A is suffix-unique, any string accepted by A and ending in v, must also end in uv Thus, any path from initial to p must end in By same reasoning, it must also end in u Hence, uis a suffix of u, or vice versa Assume the former, then suff(q ) suff(q), thus ain similarly QED. u the ot 25 u u u q q v v u v S(A) string, if N(q ) N(q) statement of t x

36 Suffix-unique Bound Theorem: If A is a suffix-unique deterministic and minimal automaton, then the number of states of S(A) is bounded as S(A) Q 2 A Q 3. [CIAA 07] Proof (sketch): Lemma: For any two states of the suffix automaton, either suffix sets are disjoint, or one includes the other We can show that each state q of S(A) corresponds to a distinct equivalence class [x], count these to get bound The equivalence sets induce a suffix sets hierarchy which we will analyze 26

37 Suffix Sets: Non-branching Suffix sets either disjoint or inclusive: hierarchy Count branching, non-branching nodes separately Exclude super-final state F with no outgoing transitions Let q be a state in S(A) with equivalence class [x], x longest The only way to have a branching node is if there exist factors ax, bx(a b) (since is a right-equivalence relation) So q is only non-branching when x is a prefix or suffix Empty prefix ɛ not included in non-degenerate cases s accepted by, then, Total non-branching nodes N nb A Q 2 + N str

38 Suffix Sets: Branching i r [ɛ] roo [a 1 ] [a 2 ]... [a Nstr ]... [a Nstr +k] If are the distinct final symbols of each i string r accepted by then each is a child of the root [ɛ] a 1,..., a Nstr A [a i ] Let tree rooted at [a i ] have n ai leaves( n ai 1branching nodes) Total number of leaves is A Q 2 (not initial and super-final) Total branching Total size of tree Add super-final state, get roo N b N str +k i=1 (n ai 1) + 1 A Q 2 N str N nb + N b 2 A Q 4 S(A) Q 2 A Q 3. QED.

39 Final Size Result If A is a deterministic minimal automaton representing a set of strings U then S(U) Q 2 A Q 2. F (U) Q 2 A Q 2 S(U) E 3 A E 4 F (U) E 3 A E 4 S(U) Q 2 U 1 F (U) E 3 U 3 Substantial improvement over previous: When A is k-suffix unique accepting n strings and A k is the part of A after removing all suffixes of length k S(A) Q 2 A k Q + 2kn 3, F (A) Q 2 A k Q + 2kn 3. he S(A) automaton E 2 Aof k E + obtained 3kn 3k b 1 F (A) E 2 A k E + 3kn 3k 1 Proof idea: add terminal symbols to make string set suffixunique, construct suffix automaton, remove symbols 29

40 Music ID Experiments [CIAA 07, ISMIR 07] In our music ID application, we have Factor automaton size scales linearly with # of songs F (A) E 2.1 A E 6e+07 5e+07 # States factor # Arcs factor # States/Arcs Non-factor 4e+07 Size 3e+07 2e+07 1e # Songs

Non-unique songs 12000 10000 8000 6000 4000 2000 0 0 5 10 15 20 25 30 35 40 45 k (suffix length)

41 Music ID Experiments For 15,000+ songs, transcription set is 45-suffix unique Number of collisions among song suffixes/factors drops off rapidly with increasing length Non-unique songs k (suffix length) Non-unique Factors Factor Length 31

42 Automata Summary We have addressed the size of a factor automaton of a set of strings, or more generally of another automaton We have proven substantially better size bounds This suggests factor automata are useful for indexing potentially very large sets of strings Our conclusions are verified experimentally in our music identification system 32

43 Future/Ongoing Work More experiments: test accuracy in presence of different kinds of noise, distortions Analyze song structure Find repeated phone sequences: chorus detection, etc. Find common sequences between songs Work on an on-line linear time algorithm for suffix/factor automaton construction Do a finer theoretical analysis Get rid of the kn term in the k-suffix unique bound 33

44 References E. Weinstein and P. Moreno. Music Identification with Weighted Finite-State Transducers. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, M. Mohri, P. Moreno, and E. Weinstein. Factor Automata of Automata and Applications. To appear at the International Conference on Implementation and Application of Automata (CIAA), July 2007, Prague, Czech Republic. M. Mohri, P. Moreno, and E. Weinstein. Music identification, detection, and analysis in adverse conditions. To appear at the International Conference on Music Information Retrieval (ISMIR), September 2007, Vienna, Austria. M. Bacchiani and M. Ostendorf. Joint lexicon, acoustic unit inventory and model design. Speech Communication, 29:99 114, November E. Batlle, J. Masip, and E. Guaus. Automatic song identification in noisy broadcast audio. In IASTED International Conference on Signal and Image Processing, Kauai, Hawaii, A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, and R. McConnell. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34: , A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M.T. Chen, and J. Seiferas. The smallest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31 55, M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45(1):63 86, M. Fink, M. Covell, and S. Baluja. Social and interactive television application based on real time ambient audio identification. EuroITV 2006, May J. Haitsma, T. Kalker, and J. Oostveen. Robust audio hashing for content identification. In Content-Based Multimedia Indexing (CBMI), Brescia, Italy, September M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2): , M. Mohri. Statistical Natural Language Processing. In M. Lothaire, editor, Applied Combinatorics on Words. Cambridge University Press, M. Mohri, F. C. N. Pereira, and M. Riley. Weighted Finite-State Transducers in Speech Recognition. Computer Speech and Language, 16(1): 69 88,

45 The End Thank You! 35

AUTOMATIC identification of music has been the subject. Efficient and Robust Music Identification with Weighted Finite-State Transducers

AUTOMATIC identification of music has been the subject. Efficient and Robust Music Identification with Weighted Finite-State Transducers IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, JANUARY 28 1 Efficient and Robust Music Identification with Weighted Finite-State Transducers Mehryar Mohri, Pedro Moreno, and