Face2Mus: A Facial Emotion Based Internet Radio Tuner Application

Size: px

Start display at page:

Download "Face2Mus: A Facial Emotion Based Internet Radio Tuner Application"

Lee Bridges
6 years ago
Views:

1 Face2Mus: A Facial Emotion Based Internet Radio Tuner Application Yara Rizk, Maya Safieddine, David Matchoulian, Mariette Awad Department of Electrical and Computer Engineering American University of Beirut Beirut, Lebanon {yar01, mhs36, dsm05, mariette.awad}@aub.edu.lb Abstract We propose in this paper, Face2Mus, a mobile application that streams music from online radio stations after identifying the user's emotions, without interfering with the device s usage. Face2Mus streams songs from online radio stations and classifies them into emotion classes based on audio features using an energy aware support vector machine (SVM) classifier. In parallel, the application captures images of the user s face using the smartphone or tablet s camera and classifying them into one of three emotions, using a multiclass SVM trained on facial geometric distances and wrinkles. The audio classification based on regular SVM achieved an overall testing accuracy of 99.83% when trained on the Million Song Dataset subset, whereas the energy aware SVM exhibited an average degradation of 1.93% when a 59% reduction in the number of support vectors (SV) is enforced. The image classification achieved an overall testing accuracy of 87.5% using leave one out validation on a home-made image database. The overall application requires 272KB of storage space, 12 to 24 MB of RAM and a startup time of approximately 2 minutes. Aside from its entertainment potentials, Face2Mus has possible usage in music therapy for improving people s well-being and emotional status. Keywords Affect recognition, Audio and image classification, mobile applications, Support Vector Machine I. INTRODUCTION The emergence of smartphones has rendered mobile applications the latest trend in the software development world, creating a multi-billion dollar industry [1] with music related apps attracting a large number of downloads [2, 3]. In addition, the use of music as a therapeutic tool has also emerged as a promising alternative in the medical field to help treat ailments such as stress, pain, and high blood pressure; it affects the mood and overall health of individuals [4]. Therefore, a smart emotion sensitive music app would be of high value for entertainment and medication purposes. In this paper, we introduce such an application as a new contribution to the currently popular affective computing research area. Face2Mus is a mobile application that streams music from online radio stations, based on the user s emotion. This emotion is deduced from image-based emotion recognition algorithms applied on the user s face, captured by the device s camera. No physiological sensors are needed, thus leading to a more accessible application with a simpler interface. An accuracy of 87.5% was achieved on a homemade image database with leave one out cross validation using a multiclass SVM. Song features were retrieved from a web service, simplifying the song classification preprocessing. Testing an energy aware support vector machine (SVM), an SVM with less SV than the regular SVM, on a subset of the Million Song Database gave an overall accuracy of 97.96% versus 99.83% for regular SVM while reducing the number of support vectors (SV) on average by 59%. In what follows, we provide a summary of related work done in the fields of affective computing and application development in Section 2. Section 3 presents a description of the proposed solution while Section 4 reports on the experimental results obtained. Finally, Section 5 concludes with follow on research potentials. II. LITERATURE REVIEW Many mobile applications have been developed tackling the user s emotion recognition [5, 6] and emotion based music classification [7, 8] separately, but none have combined them as suggested in this paper. Published work on this topic includes Affective DJ, a stand-alone device which chooses calming or energetic songs from a local playlist based on the individual s affective state deduced from skin conductivity [9]; unlike our proposed work, their method requires sensors not available in widely available smart phones. A lot of work has been done on image-based affect recognition. Classification using facial features such as the eyebrows, mouth, eyes, etc. [10, 11, 12, 13] achieved accuracies above 90%. For example, [11] reported a 93% average accuracy whereas [13] reached 96% for some emotion classes. Using physiological features such as skin conductivity [9], cardiac activity, muscle tension [10], etc. attained comparable results. In some cases, both facial features and body gestures were used [14] to achieve accuracies in the range of 82 to 96% [15]. Although physiological features were more indicative of a person s emotional state than facial expressions, they required special measuring equipment, which is not feasible for a mobile application. Several methods were implemented in the literature to tract facial features in images for emotion recognition. One /14/$ IEEE 257

2 method developed by Ekman, extracted action units (AUs) that describe a specific feature in the face such as a raised eyebrow [16]. Many papers used this method for emotion recognition [13, 17, 18]. Reference [17] achieved accuracies ranging from 64.29% to 100% on certain AUs; [13] realized comparable results. Another approach represented facial features by motion units (MUs) which describe the direction of motion of a feature and its concentration [18, 19], information not captured by AUs. Reference [19] used various classifiers, whose classification error ranged from 4.43 to 13.23%. However, this method is computationally expensive [20] and person dependent [18]; it does not generalize well. Hence, AUs would be potentially better for a mobile application with limited computational resources. Considering affect recognition in music, researchers have developed many theories to assign emotions or moods to songs based on their audio features. Hevner s theory, a categorical theory, divided moods into eight clusters of adjectives over an adjective circle [21]. On the other hand, Russell s valence vs. arousal axes [22], and Thayer s energy vs. stress axes [23] are examples of dimensional models that work over continuous multidimensional spaces. Several methods exist to identify the emotion in songs derived from the above mentioned theories. Some used audio features including mode, tempo, harmony and rhythm [24]; others analyzed the lyrics of the song using natural language processing and some combined both. Reference [25] extracted tempo, loudness and harmony using Short Time Fourier Transform (STFT) to build an audio spectrogram that implies the required information. Other sound descriptors are low-level audio descriptors (LLD) such as spectral descriptors, harmonic descriptors, and perceptual descriptors that included Mel-Frequency Cepstral Coefficients (MFCC), loudness, sharpness, spread, and roughness [25]. In addition to these descriptors, several open source libraries were available to extract audio features including Java MIR (jmir) [26] which extracts low level and high level audio features, in addition to cultural level features and C++ Library for Audio and Music (CLAM) which extracts features like mode and harmony [26, 27]. The Echonest is a music intelligence platform web service that analyzes music and offers APIs of audio analysis that return details related to tempo rhythm, time signature, loudness, mode, and key confidence [28]. Diverse classifiers were used to perform the classification tasks given the extracted features. Reference [29] implemented a hybrid music mood and theme classifier, which combined SVM with Radial Basic Function (RBF) kernel trained on audio features and Naïve Bayes Multinominal classifier trained on social tags. Using highest probability distribution, a weighted sum of the classifiers was adopted to maximize precision, recall, accuracy and F1 measure. III. FACE2MUS Face2Mus is an Android application that streams music from online radio stations based on the emotion of the app user deduced by emotion recognition techniques applied to their facial images. The workflow for Face2Mus is illustrated in Fig. 1. To infer the user s emotion, Face2Mus captures images of the user s face at an adaptive rate dependent on the sustainability of this emotion, using the mobile device s camera. Android s Native Face Detection localizes the face in the frame and crops out the face and locates the midpoint and distance between the eyes. The cropped image and extracted parameters are sent to a server running Matlab to perform the emotion recognition. On the server, a tracking algorithm is used to locate the points of interest (POIs) based on color and geometric properties of facial features. Then, wrinkles and distances between POIs are generated; these features are shown in Fig. 2.They include wrinkles on the forehead, between the eyebrows and on the cheeks and distances between points on the mouth and eyes to describe the curvature of the mouth and between the eyebrows and eyes. Finally, multiclass SVM categorizes the frame into one of three emotions (neutral, happy, or sad), based on these features. In parallel, the metadata of now-playing internet radio station songs, broadcasted by Streamfinder [30], is retrieved by the device. Based on its metadata, the song s audio features such as tempo, mode, loudness and time signature are retrieved from Echonest [28]. A variant of SVM classifier categorizes the songs into one of three classes: calm, happy, and sad. Since SVM s decision hyper plane is in terms of SV which can reach up to 50% of the training set, real time classification and low memory requirements, necessary for Face2Mus success, might be unattainable using a regular SVM. Therefore, LMSVM [31] was adopted to generate a decision hyper plane with a reduced SV set. The reduced SV set requires less computations and memory requirements leading to lower power consumption when classifying a new song, rendering LMSVM energy aware. This is achieved by clustering the training data, selecting heterogeneous clusters (clusters with points from different classes) and training on these boundary points to reduce the hyper plane s complexity. Finally, the Song Selector block uses the emotional state of the individual and the list of emotion-tagged songs to decide on an appropriate song to play. IV. EXPERIMENTAL RESULTS The application is tested for its overall functionality and classification accuracy. First, the training and testing results of the audio and image classifiers are presented separately, followed by the overall performance of Face2Mus. The application was tested on a Samsung Galaxy Tab 10.1 model, with a server running on a 64-bit Intel Core i5 processor. The application was developed using Android 3.1 API and the image processing and classification code was written in Matlab R2011b. 258

3 Fig. 1. Face2Mus overall block diagram. Table I. Most of the misclassified images were from the sad class. This was not surprising because the neutral and sad images were similar and difficult to distinguish. Furthermore, the feature set containing distances only performed better than the distances and wrinkles. The wrinkles contained relatively high noise and therefore did not contribute enough useful information to improve the classification accuracy. Table II reports on the testing accuracy of individual classifiers before aggregating them into a majority vote architecture. The test set is composed of 24 images that were not used while training the various classifiers; one image per person per class. The best classifiers were saved on the server and used to classify images sent to the server by the application running on the tablet. The size of the files saved on the server was approximately 72 KB. The process of identifying POIs, generating the features (distances only) and classifying an image needed approximately 0.25 seconds to complete. Including wrinkles, the run time increased to approximately 0.67 seconds. Since the wrinkles did not improve the classification accuracy and significantly worsened the run time for the image processing blocks, the wrinkles were removed from the feature set. C. Audio Database Classifier architectures were trained and tested on the Million Song Dataset [37] subset which included 10,000 instances and 4 features: mode, tempo, time signature and loudness. The instances were clustered into three classes using k-means clustering algorithm. The database was not balanced; 2683 were tagged as sad, 5765 were happy, and 1552 were calm. A. Image Database Fig. 2. Facial features. The image classifiers were trained on a home-made database of colored images captured by a Samsung Galaxy Tab 10.1, under similar lighting conditions. The database is composed of 120 images of eight people, 5 images for each acted emotion per person. The images were classified into 3 classes: neutral, sad and happy. For illustration purposes, a subset of these images for each class is displayed in Fig. 3. B. Image Classification Face2Mus was trained and tested on Neural network and SVM classifiers, from the Neural Network toolbox in Matlab and libsvm [36] library respectively. A grid search was performed to find the best classifier architecture for the task at hand. Majority vote classifiers given different feature sets (distances or distances and wrinkles) were compared. Finally, leave one out validation technique was used to validate the results of the classifiers. The majority vote architecture composed of one vs. one SVM classifiers, using RBF kernel on a feature set containing distances only, produced the best results. The results for some of the SVM architectures are reported in Fig. 3. Sample images from the database. TABLE I. IMAGE CLASSIFICATION RESULTS Aggregation Rule Features Accuracy (%) 1 vs. 1 Majority vote Distances vs. 1 Majority vote Distances + Wrinkles TABLE II. INDIVIDUAL CLASSIFIERS RESULTS Features Class SVM RBF Accuracy Regularization Sigma (%) Neutral vs. Happy Distances Neutral vs. Sad Sad vs. Happy Neutral vs. Happy 2 Distances Neutral vs. Sad 2 Wrinkles Sad vs. Happy

4 D. Audio Classification A neural network (NN) and a set of one vs. all LM-SVM classifiers were trained to classify the songs given the 4 features. A grid search was performed to find the optimal box constraint and kernel parameter values. Several kernels including the Gaussian and linear kernels were also tested. The training was validated using 5-fold, cross fold validation. The best results were produced by the multi-class SVM models; the accuracies of the individual one vs. all classifiers are reported in Table III. The classifiers were combined into a one vs. all hierarchical structure and achieved 99.83% accuracy when tested on 600 songs that were not included in the training set. The order of the classifiers did not affect the results of the classifier. The number of SV was 6.06, 4.60 and 5.18% for the Calm, Sad and Happy classes respectively when using regular SVM. SV set cardinality was reduced to 1.68, 2.35 and 2.29% with a minor reduction in prediction accuracy, as shown in Table III. Table IV shows the percentage of degradation in accuracy and the percentage of SV reduction obtained. Although the generalization capabilities of the classifier are slightly degraded, the gains in computational resources saved are significant. The best models obtained from the training of the various song classifiers were saved on the Android tablet. They were used to classify new songs obtained from the audio acquisition block. The computation time depends on the number of SV used by the model. The time to classify a new song on the Android device was measured to be approximately 4.1 seconds on average. The memory needed to save the classifier models also depends on the number of SVs; the size of these files was approximately 32kB. SVM LM-SVM TABLE III. AUDIO CLASSIFIER RESULTS Accuracy (%) Number of SVs (%) Calm Sad Happy Calm Sad Happy TABLE IV. PERCENTAGE DEGRADATION AND REDUCTION Degradation in Accuracy (%) Reduction in SV (%) Calm Sad Happy Average TABLE V. RUN TIME FOR EACH BLOCK OF FACE2MUS Run Time (seconds) Image Acquisition & Face Localization & Extraction 5.78 Transmit data to the server POI Localization 0.15 Feature Generation Image Classification Audio Acquisition 9.78 Audio Features Acquisition 1.00 Song Classification 4.10 E. Overall Face2Mus Performance The size of the application is approximately 272KB, considered small compared to similar applications and the standard built-in memory storage in tablets which is at least 16GB. The application uses between 12 to 24 MB of RAM out of the available 1GB on the device. It also needs an approximate of 2 minutes to start playing the first song. This is mainly due to the time required to obtain several songs from Streamfinder and classify them. However, this time can vary based on the speed of the internet connection. The run times for the individual blocks of Fig. 1 are included in Table V. The Audio Acquisition time is the average time to obtain one song from Streamfinder, given a satisfactory internet connection. Inadequate connectivity would result in many connection timeouts, significantly slowing down the application. The lack on information on the remaining play time of a song presents a challenge to the application which might store a link to a song which is almost complete. To reduce the startup time, the structure of the application can be modified to run the audio processing blocks on the server, instead of the device. This would result in a server continuously tagging songs from online radio stations and storing them in a playlist. Whenever a user logs in to the application, the application will simply need to identify the user s emotion and request a song corresponding to this emotion from the server. However, based on a limited number of trials performed using the application, the results were satisfactory, despite the shortcomings of some of the blocks mentioned above. V. CONCLUSION This work proposed an automated online song streaming application which plays songs based on the emotion of the user. The emotion is detected using facial feature identification, tracking and classification. The song classifiers achieved an average accuracy of 99.83% over all classes when using multiclass SVM and of 97.96% when using LM-SVM with an SV average reduction of 59%. The image classifiers achieved 87.5% accuracy using a one vs. one majority vote SVM classifier tested on a balanced test set of 24 images. The overall application needs 1 MB of memory, 12 to 24 MB of RAM and approximately 2 minutes to start up. Further improvements, and the focus of future work, can be made especially in the image classification blocks where accuracy can be enhanced by generating more robust features. Real time performance can be further improved by eliminating network latency for image classification and running the audio classification on a server. ACKNOWLEDGMENT This work was funded by the University Research Board at the American University of Beirut. REFERENCES [1] M. Walsh (2011, October 7). Mobile App Biz Soars: $12B by 2015 [Online]. 260

5 mediapost.com/publications/article/160163/mobileapp-biz-soars-12bby-2015.html [2] The Independent (2011, November 9). Mobile application trends for 2012: the top ten applications [Online]. independent.co.uk/life-style/gadgets-and-tech/news/mobileapplication-trends-for-2012-the-top-ten-applications html [3] J. Imam (2012, June 16). Young listeners opting to stream, not own music [Online]. edition.cnn.com/2012/06/15/tech/web/music-streaming/index.html [4] C.E. Guzzetta. Effects of relaxation and music therapy on patients in a coronary care unit with presumptive acute myocardial infraction. Heart & Lung: The journal of critical care, vol.18, issue 6, p.609, [5] Asanka Senavirathna (2012). Face Mood Detector [Online]. play.google.com/store/apps/details?id=com.wideapps.android.facemo oddetector [6] DSS (2012). Mood Scanner [Online]. play.google.com/store/apps/details?id=com.dikkar.moodscanner [7] Syntonetic (2011). Moodagent [Online]. itunes.apple.com/lb/app/moodagent/id ?mt=8 [8] JVC Kenwood Corporation (2012). Kenwood Music Control [Online]. play.google.com/store/apps/details?id=com.jvckenwood.kmc&hl=en [9] R. Picard et al., A new affect-perceiving interface and its application to personalized music selection, Proc. from the 1998 Workshop on Perceptual User Interfaces, [10] J.N. Bailenson et al., Real-time classification of evoked emotions using facial feature tracking and physiological responses, International journal of human-computer studies, vol. 66 issue 5, pp , [11] M.S. Bartlett, G. Littlewort, I. Fasel and J.R. Movellan, "Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction," in Conf. on Computer Vision and Pattern Recognition Workshop, 2003, pp.53. [12] N. Agarwal et al., Mood Detection: Implementing a facial expression recognition system. (CS229 project, 2009). [13] M.S. Bartlett et al., Measuring facial expressions by computer image analysis. Psychophysiology, 36: , [14] H. Gunes, M. Piccardi, and M. Pantic. From the lab to the real world: Affect recognition using multiple cues and modalities. in J. Or, editor, Affective Computing: Focus on Emotion Expression, Synthesis, and Recognition, pages Vienna, Austria, [15] P. Viola and M. Jones, Robust real-time object detection, International Journal of Computer Vision, [16] P. Ekman and W. Friesen, Facial Action Coding System. Consulting Psychologists Press, [17] Y. Tian et al., Recognizing action units for facial expression analysis, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) [18] P. Lucey et al. The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression, in Computer Vision and Pattern Recognition Workshop on Human- Communicative Behavior, [19] Y. Sun et al., Authentic emotion detection in real-time video, in Int l Workshop on Human-Computer Interaction, Lecture Notes in Computer Science, vol. 3058, Springer, 2004, pp [20] H. Tao and T. S. Huang. Connected vibrations: A modal analysis approach to non-rigid motion tracking, In CVPR, pp , [21] Hevner, K. Experimental studies of the elements of expression in music, American Journal of Psychology, vol. 48, pp , [22] Russell, J. A. A circumplex model of affect, Journal of Personality and Social Psychology, vol. 39, pp , [23] Thayer, R. E. The Biopsychology of Mood and Arousal. New York: Oxford University Press, [24] O. C. Meyers, A Mood-Based Music Classification and Exploration System, Master s thesis, Massachusetts Institute of Technology, [25] T. Jehan, Creating Music by Listening, PhD thesis, Massachusetts Institute of Technology, [26] C. McKay, Automatic Music Classification with jmir, PhD thesis, McGill University, [27] X. Amatriain, P. Arumi, D. Garcia, CLAM: A Framework for Efficient and Rapid Development of Cross platform Audio Applications. In Proc. of the 14th annual ACM int l conf. on Multimedia, pp , [28] The Echonest (2011). Dev Center [Online]. [29] K. Bischoff, C.S. Firan, R. Paiu, W. Nejdle, C. Laurier, M. Sordo, Music Mood and Theme Classification A Hybrid Approach. In Proc. of the 10th ISMIR (Int l Society for Music Information Retrieval) conf., pp , [30] StreamFinder (n.a.). Commercial Online Radio Station Data API Access [Online]. streamfinder.com/commercial-internetradio-api/ [31] Y. Rizk, N. Mitri, and M. Awad, A Local Mixture Based SVM for an Efficient Supervised Binary Classification, in International Joint Conference on Neural Networks, 2013, Dallas, TX. [32] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, [33] T. Bertin-Mahieux, D. Ellis, B. Whitman and P. Lamere, The Million Song Dataset, in Proc. 12th Int l Society for Music Information Retrieval Conf.,

Emotion Based Music Player

ISSN 2278 0211 (Online) Emotion Based Music Player Nikhil Zaware Tejas Rajgure Amey Bhadang D. D. Sapkal Professor, Department of Computer Engineering, Pune, India Abstract: Facial expression provides