Music Recommendation using Recurrent Neural Networks

Size: px

Start display at page:

Download "Music Recommendation using Recurrent Neural Networks"

Imogene Walters
5 years ago
Views:

1 Music Recommendation using Recurrent Neural Networks Ashustosh Choudhary * ashutoshchou@cs.umass.edu Mayank Agarwal * mayankagarwa@cs.umass.edu Abstract A large amount of information is contained in the sequence of songs in a particular playlist. Typical collaborative filtering based recommendation approaches do not consider this information for recommendations. This project models the music recommendation problem as a sequence prediction problem and explores the application of Long Short Term Memory (LSTM) networks to the problem. 1 Introduction Music listening activity typically spans over a sequence of songs rather than a single song in isolation. The relative position of each song in the user s playlist has significance, and the user satisfaction can significantly change if the same songs are played but in a different order. Typical collaborative filtering approaches do not model the sequence aspect of music recommendation. Long Short Term Memory network is a Recurrent Neural Network (RNN) architecture that unlike traditional RNNs are capable of learning longterm dependencies in the data. Since the user playing history can span years, LSTMs are apt to model the music recommendation problem as a sequence and also learn the long term dependencies in the user s music preference. This project experiments with the application of LSTM network to music rcommendation. The network is trained on the Million Song Dataset along with Last.fm user playing history dataset. 2 Related Work Liebman et al. in [1], model the muusic playlist recommendation problem as a sequence decision making task. The paper introduces a new Reinforcement Learning based method for music recommendation task, and concludes that modelling music recommendation as a sequence decision making task gives the proposed method a small but significant boost in performance compared to only reason about song preferences. Oord et al. in [2], apply convolutional neural networks to learn latent factors of music audio and use these to predict songs to users. This paper showed that recent advances in deep learning methods along with the suggested approach translates very well to the music recommendation task. Learning from audio data alone, the suggested model performs sensibly on the Million Song Dataset. 3 LSTM LSTMs have been used extensively in language models and text generation tasks. The reason for this reliance on LSTMs for sequence based tasks is the property of LSTMs to effectively propagate long-term dependencies in a sequence of objects. In case of text, these objects are either one-hot encoded vectors or word-embeddings. Embeddings are fixed size vector representation of input datapoint i.e. input feature vector. LSTMs achieve the mentioned effective gradient propagation by making use of memory units called cell states as well as a combination of gates. These gates have an added advantage of learning to give appropriate weights to more prominent or essential parts of the sequence. There are other types of RNNs such as Gated Recurrent Units(GRUs) which are also used for similar tasks, however, unlike LSTMs they have lesser number of gates hence lesser control on gradient flow as well as they do not have memory units and expose the complete memory. Both the gated networks are superior than a simple RNN but there is no conclusive evidence to claim one is more superior than the other.[3]. Given a sequence of inputs X = {x 1, x 2,..., x nx } an LSTM associates each time step with an input gate, a memory gate and an output gate, de-

2 noted by i t,f t and o t. Let e t represent the inputembedding unit i.e. word in a sentence or song in a playlist. If c t and h t represent the cell state vector and produced hidden state at time t, then vector representation h t for each time step is given by: i t f t o t = l t tanh W. [ ht 1 e s t ] (1) c t = f t.c t 1 + i t.l t (2) h s t = o t.tanh(c t ) (3) where W i, W f, W o, W t R K 2K 4 Sequence-to-Sequence Models As the name suggests, sequence-to-sequence(seqto-seq) models are used to generate a sequence with another sequence as its input. Seq-to-seq models have been used extensively in machine translation and dialog systems. Introduced by Cho et. el [4], a basic seq-to-seq model consists of two LSTMs: an encoder that processes the input sequence and a decoder that generates the output sequence. The encoder is used to obtain a vector representation of input sequence. The final hidden state of encoder is this representation which is then fed to the decoder as part of its input at each time step. The decoder, thus at each time step, has the information pertaining to the dependent sequence, current sequence partial information till that time step and the sequence element generated in the previous time step. To increase the capacity of the model, multi-layer cells have been successfully used in seq-to-seq models [5]. If input sequence lengths are really large, terms seen long back in the sequence may be forgotten by a basic seq-to-seq model. This is despite the fact that gated units are able to remember quite long sequences. However, it has been shown that for sequence lengths of more than 30 terms, lead to gradients decaying to zero. Thus no useful information is passed from the terms more than a threshold terms away. To allow decoder a more direct access to the input, an attention mechanism was introduce in [6]. In this mechanism, the model allows decoder to peek into the input at every decoding step. Figure 1: A basic seq-to-seq model with encoder decoder architecture. 5 Proposed model We tried to model a user playlist as a sequence of songs and tried to recommend another sequence i.e. a playlist for the user. This is based on a minimalistic approach wherein no rating data needs to be available for user-song pair in order to recommend next few songs. This approach thus seems to tackle the cold start or data sparsity problem. A sequence-to-sequence model was chosen to recommend a sequence of songs to the user based upon a sequence he had already listened to. A new user s first song is suggested based on the song which is most likely to be listened first by a user. This learning is imparted in the deep RNN network by passing each decoder sequence s start as a special SOS ( start of sequence ) symbol. Each song is represented as an embedding of fixed size 65. This embedding is either learned from the sequence data or is extracted from additional data as explained later in the datasets section. An encoder with LSTM unit is used to encode the sequence of songs as an fixed vector and a decoder is used to output another sequence of songs which in our problem statement corresponds to the recommended playlist. Since the length of playlist listened by a single user, as provided in the dataset, is over a span of years, the use of attention mechanism is an obvious extension to the model as discussed in the future works section. The average length of song playlist is around 900 songs. The equation for the encoder and decoder LSTM remains the same as in (1),(2) and (3). This supervised learning of song-sequence is based upon the premise of considering the next song played being the ground truth of the output for the current time step. Thus for a playlist of songs S = (s 1, s 2,...s N ) and e t being the embedding of t th input song, the x and y vectors are: x = S[0...N 1] = e(s 0 ), e(s 1 ),...e(s N 1 ) (4)

3 y = S[1...N] = e(s 1 ), e(s 2 ),...e(s N ) (5) where each e(s t ) R 1 D. The final encoder state after an entire sequence of x has been processed by the encoder is the required vector representation of the playlist of the user. Now, this vector h maxt R 1 H is concatenated with each input e(s t ) in the decoder to generate or recommend a song appropriate for this time step. This is the recommended or predicted song ŷ. The loss is defined at the decoder final output state and back-propagated through the entire decoderencoder model end-to-end. If embeddings are being learned, then the loss is back-propagated even to the embedding layer. The loss used is cross-entropy on log softmax. The softmax function is given by: (z) j = e z j k k=1 ez j for j = 1, 2,..K (6) Thus softmax provides a probability distribution over all possible songs in the song vocabulary, while the ground truth probability distribution is a one hot encoded vector of the vovabulary size as well. A cross-entropy is calculated between these two probability distribution using formula: pairs. Thus, while the MSD provides us with the song metadata and audio features, the Last.fm dataset provides us with the user s song listening history. 6.1 Data preparation The two datasets used for this project lacked a common key to correlate the two dataset entries. Therefore, the two datasets were loosely correlated using a matching on the artists musicbrainz ID, and the track name. We call this a loose approach because a small number of records remained unmatched in the song database because they didn t have the artist musicbrainz ID, or a minor difference in the track name. 6.2 Training, Test, and Validation splits Playlist data for 992 users was split into three components: 70% as training data, 15% as validation data, and 15% as test data. This split was carried on a user level, i.e., either a user and his entire playing history is tagged as training data, test data, or validation data. Finally, the training data contained 58,913 user-song pairs, the validation data contained 11,349 user-song pairs, and the test data contained 14,536 user-song pairs. H(p, q) = x p(x)logq(x) (7) 7 Features Here, p(x) is the predicted probability distribution while q(x) is the actual probability distribution. 6 Datasets The Million Song Dataset (MSD) [7] is a freelyavailable collection of audio features and metadata for a million contempory popular music tracks. The dataset is available in HDF5 format and contains the track, song, release, and artist information for every track included in the dataset. The complete MSD is 273 Gbs in size and therefore, presents a challenge in storage and also increases the time required for pre-processing of data. Thus, we decided to work with a subset of the MSD containing 10,000 randomly sub-sampled songs from the original million songs. While the Million Song Dataset provides us with the audio features and metadata for songs, we also require the playlist data for multiple users for training. For this purpose, we chose the Last.fm dataset [8] which contains the listening history for 992 unique users, and nearly 19 million user-song The features used to encode the song and the user information is listed in Table 1, and Table 2 respectively. The song encoding is a 56 dimensional vector, while the user encoding is 6 dimensional. Thus, every data sample is 62 dimensional vector. 8 Evaluation metric To evaluate the performance of the music recommender system, we chose the Mean Average Precision (MAP) evaluation metric. MAP is a ranked precision metric that places emphasis on highly ranked correct precisions [9]. Mean Average Precision is defined as: MAP = Q q=1 AP (q) Q where AP(q) is the average precision of each user. The average precision AP(q) is defined as:

4 Feature name artist familiarity artist hotness artist latitude, and artist longitude artist tags total beats danceability energy key loudness mode release total sections section pitches song hotness tempo time signature Year Feature name Gender Age Age Age 2 Latitude Longitude Description 0-1 scale familiarity as determined by the EchoNest API 0-1 scale hotness of the artist as detemined by the EchoNest API Latitude of the country the artist is based in. Top 3 tags associated to the artist Total number of beats in the song 0-1 scale danceability on the songs as determined by the EchoNest API duration : Duration (in seconds) of the song Energy in the song from the listeners point of view The key the song is in Loudness of the song in db Mode the song is in (major/minor) Album name Total number of sections present in the song Pitches of the longest 3 segments in the song 0-1 scale hotness of the song as determined by the EchoNest API Tempo of the song Time signature of the song Year of release Table 1: Song features Description Gender of the user Age of the user Square root of the age of the user Square of the age of the user Latitude of the country the user is in. Longitude of the country the user is in. Table 2: User features AP (n) = n k=1 P (k) m where P(k) is the precision at cutoff point k in the song recommendation list, and m is the number of correctly predicted nodes. 9 Experiments and Results 9.1 Baseline Baseline for the experiments is a simple collaborative filtering technique using Matrix factorization.for the sake of comparison, for the baseline techniques involving Matrix factorization, only user-item pairs with ratings present are considered. A deep model baseline for the experiments is chosen to be a simple LSTM network without the encoding-decoding logic. The LSTM is trained on sequence of songs listened by the user and is trained to start predicting after receiving a special symbol SOS. In one part of the baseline experiments, embeddings for each song is learned from the data and hence has information pertaining to its neighboring songs. In the other part, this embedding is hand-crafted with additional data as mentioned in the dataset section. To obtain recommendations in this case at each time step top-k (top-30) values have been extracted based on their probability values after the softmax layer. 9.2 Experiments Experiments have been performed using different configurations of each of the components of the seq-to-seq model. As done in the baseline, two options for song embeddings have been used: learned from data, extracted from additional data. Also, experiments with number of layers in the LSTM cell is done with number of layers in the set 1, 2, 5. Initial learning rates of AdamOptimizer have been taken in a range of 1e 5, 1e 4, 1e 3. Decay rate is the default Results Results are tabulated in Table Conclusions This project aims to utilize the sequence information present in music playlists to provide better recommendations to the user. Results show that modelling the music recommendation problem as

5 Model Embedding type Number of layers Top K for MAE Learned Extracted 1 layer 2 layers 5 layers top 10 top 20 top 30 Matrix Factorization N.A N.A. N.A. N.A N.A. N.A. Basic LSTM Seq-to-Seq Table 3: Results and comparison with baselines a sequence decision making task through LSTM and a Seq-to-Seq model improves the Mean Average Precision of the recommender system. In comparison with the baseline Matrix Factorization model, which does not take the sequence information in consideration, LSTM and seq-toseq models show better recommendations and better MAP values. This validates our premise that modelling music recommendation as a sequence decision problem improves the quality of recommendations of the system. Kaggle - Mean Average Precision. MeanAveragePrecision References Liebman, et al. Dj-mc: A reinforcement-learning agent for music playlist recommendation. Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, Van den Oord, et al. Deep content-based music recommendation. Advances in neural information processing systems Chung, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arxiv preprint arxiv: (2014). Cho, Kyunghyun, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arxiv preprint arxiv: (2014). Cho, et al. Sequence to sequence learning with neural networks. Advances in neural information processing systems Bahdanau, et al. Neural machine translation by jointly learning to align and translate. arxiv preprint arxiv: (2014). Thierry Bertin-Mahieux, et al. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011) last.fm - listen to free music and watch videos with the largest music catalogue online. last.fm/

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks Stephan Baier 1, Sigurd Spieckermann 2 and Volker Tresp 1,2 1- Ludwig Maximilian University Oettingenstr. 67, Munich, Germany 2- Siemens