Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design

Size: px

Start display at page:

Download "Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design"

Colin White
5 years ago
Views:

University of Colorado, Boulder CU Scholar Computer Science Graduate Theses & Dissertations Computer Science Spring 1-1-2013 Personalized Identification of Unusual User Events in Mobile Cloud Systems

1 University of Colorado, Boulder CU Scholar Computer Science Graduate Theses & Dissertations Computer Science Spring Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design Junho Ahn University of Colorado at Boulder, junho.ahn@colorado.edu Follow this and additional works at: Part of the Computer Engineering Commons, and the Computer Sciences Commons Recommended Citation Ahn, Junho, "Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design" (2013). Computer Science Graduate Theses & Dissertations This Dissertation is brought to you for free and open access by Computer Science at CU Scholar. It has been accepted for inclusion in Computer Science Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact cuscholaradmin@colorado.edu.

2 Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design by Junho Ahn B.A., Hongik University, 2006 M.S., Yonsei University, 2008 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2013

3 This thesis entitled: Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design written by Junho Ahn has been approved for the Department of Computer Science Prof. Richard Han Prof. Shivakant Mishra Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 iii Ahn, Junho (Ph.D., Computer science) Personalized Identification of Unusual User Events in Mobile Cloud Systems via a Hybrid Fusion Design Thesis directed by Prof. Richard Han We demonstrate the feasibility of constructing a mobile cloud system that efficiently, conveniently and accurately fuses multimodal smartphone sensor data to identify and log unusual personal events in mobile users daily lives. Our myblackbox system is designed to leverage a smartphone as a personalized blackbox-like recorder. In the system, we develop new location-based classifiers for audio and accelerometer that are personalized and noise-resistant. The system incorporates a hybrid architectural design that combines unsupervised classification of audio, accelerometer and location data with supervised joint fusion classification to achieve good accuracy, customization, convenience and scalability. We identify the best supervised learning algorithm for fusing together multi-modal mobile sensor data for unusual event identification and characterize its improvement in accuracy over location-based audio and activity classifiers. Finally, we show the feasibility of the myblackbox concept by implementing and evaluating an end-to-end system that combines Android smartphones with a cloud server over a deployment consisting of fifteen users for over a one month period.

5 Dedication This thesis is dedicated to my family & friends.

6 v Acknowledgements Much of the research in this thesis was conducted in collaboration with Prof. Richard Han. My PhD was primarily funded by the National Science Foundation and the University of Colorado.

7 Contents Chapter 1 Introduction Thesis Statement: Research Contributions: Related Works 5 3 System Challenges and Design Design assumptions and goals Algorithm Challenges and Design System Design Challenges System Architecture User Behavior Classifiers Data Collection Location-based Audio Classifier Training the Basic Audio Classifier Location-based Audio Modeling for Unusual Event Detection Performance Results of the Audio Classifiers Location-based Activity Classifier Detecting Unusual Locations

8 vii 5 Fusion Algorithms and Evaluation Fusion algorithms Optimal Fusion Algorithms Determining fusion parameters Results for an optimal classification period Optimal threshold for the CI algorithm Convergence speed of training General versus Personalized Fusion Model Fusion Performance vs. Individual Classifiers End-to-End myblackbox Mobile Cloud System 62 7 myblackbox Performance Evaluation Accuracy of the Fusion Algorithms Noise Removal Fusion Performance vs. Location-based Activity and Audio Classifiers Performance of Location Classifier myblackbox System Evaluation Summary Discussion and Future Work 80 9 Conclusions 83 Bibliography 84

9 Tables Table 4.1 Audio results Example of one user s audio classification results for two different locations Gaussian distribution results of the above audio classification results for one user Test sample results and classifications Audio classification results according to the mobile s carrying location Example of one user s activity classification results for two different locations Gaussian distribution results of the above activity classification results for one user

10 Figures Figure 2.1 Existing unusual location event applications Process for building an unusual event detection model using mobile sensor data Diagram of unsupervised and supervised learning algorithms with general versus personalized model choices Architectures of the myblackbox mobile component and the myblackbox cloud server Diagram to store sensor data on the mobile phone Phone survey: Displaying sensor data Sound patterns detected by the MFCC algorithm (a) for low level noise, (b) talking voice, (c) music sound, (d) and angry sound pattern frequencies An example showing a similar percentage pattern of the four audio classifications for one subject s repeated visits in one location (a) Histogram of 30-minute audio classifications for one subject s repeated visits to the same location, (b) Quantile Quantile plot using the histogram data An example for measuring standard deviations for each audio type over 10 visits for one subject Standard deviations of four audio classifications measured for 20 subjects in two different locations: 1 and

11 x 4.7 Comparison of our location-based audio classifier for detecting unusual user events with a basic existing approach that is location-independent Accuracy variation of unusual voice classification depending on mobile placement on human body Unusual high impact activities measured on the mobile phone High impact-based normal activity detected from the mobile phone survey subjects data An example showing a similar percentage pattern for three types of normal impact activity for one subject s repeated visits in one location Quantile Quantile plot using 30-minute activity classifications for one subject s repeated visits to the same location (a) Standard deviations of three impact activities measured for 20 subjects in two different locations: 1 and 2, (b) An example for measuring standard deviations for each impact activity over 12 visits for one subject Comparison of our approach with existing approach for detecting unusual user events in shopping centers Location history collected from 20 subjects for one week Accuracy measurements of six thresholds (1% to 6%) Standard deviation measurement of location history collected from 20 subjects for one week Classification design using the four binary algorithms Precision, recall, f-measure, and accuracy of four classification algorithms for detecting normal events Precision, recall, f-measure and accuracy of four classification algorithms for detecting unusual events The accuracy of four classification algorithms according to individual survey subject for identifying unusual events

12 xi 5.5 Precision, recall, and accuracy according to a period for detecting normal events (a) activity data (b) audio data Precision, recall, and accuracy according to a period for detecting unusual events (a) activity data (b) audio data Precision, recall, and accuracy of the confidence interval in detecting normal events (a) activity data (b) audio data Precision, recall, and accuracy of the confidence interval in detecting unusual events (a) activity data (b) audio data Precision, recall, and accuracy according to the number of days of training data for detecting normal events (a) activity data (b) audio data Precision, recall, and accuracy according to the number of days of training data for detecting unusual events (a) activity data (b) audio data Accuracy evaluation of the General model versus the Personalized model for identifying unusual mobile user events Comparing performance results of individual classifiers and the fusion algorithm using one week s data Screen shot of myblackbox application Collection server using Spring Roo Web server using MongoDB Precision, recall, and accuracy of four classification algorithms for detecting normal events Precision, recall, and accuracy of four classification algorithms for detecting unusual events Average of precision, recall, f-measure, and accuracy for detecting unusual events using CI Average of precision, recall, f-measure, and accuracy for detecting normal events using CI Individual accuracy measurement for event detection using CI Comparing performance results of the hybrid fusion algorithm to location-based activity and audio classifiers using one month of data

13 xii 7.7 Performance results of our location classifier using one month s data One hour of CPU usage of the myblackbox application Evaluation of (a)network and (b)storage I/O usage of the myblackbox application Smartphone statistics of participants using the myblackbox application Emulation results of data transfer from a smartphone to our cloud server for one 30-minute period myblackbox application s battery usage for one 30-minute period

14 Chapter 1 Introduction People who find themselves in dangerous or emergency situations such as victims of crimes, health emergency situations or accidents, such as a car accident, will usually attempt to call 911 using their smartphone to get help in these situations. The NENA (National Emergency Number Association) [30] has estimated that there are 240 million emergency calls made to 911 in the U.S. every year and 70 percent of these calls reported by the FCC (Federal Communications Commission) [28] were placed from smartphones. The smartphone is thus an essential device that can help people be rescued quickly in emergency situations. However, in an emergency situation, people are often too injured, or incapable, of calling 911 or a relative to get help. According to statistics [23] from a 1999 national survey reported on the Keep Schools Safe website, more than one in three high school students had been in a physical fight within the previous year. In the midst of such situations, students involved in physical fights may be seriously wounded or hurt, without being able to get help. For example, they might be involved in a fight, violent attack, health emergency, or trapped in a vehicle or structure. In these cases, people will often yell, cry, or scream out loud to obtain help [44]. If people in the vicinity hear these unusual human voice sounds, or cries for help, they can seek help for the victims by calling 911 for them. However, if nobody can come to the victims rescue or call 911 for them, they may be seriously wounded or continuously victimized or endangered and never receive help. Traditional blackboxes [5, 7, 8, 6, 11] are used in emergency situations to record the events leading up to a disaster, such as a plane crash, in order to aid investigators in determining the causative factors. They are typically designed to record the most recent data, and are ruggedly built to withstand loss of power and

15 2 extreme physical stress. In myblackbox, our idea is to adopt the basic spirit of recording unusual events with black boxes and extend this functionality into the mobile phone domain. myblackbox provides a basic system that efficiently and accurately detects and records unusual events of smartphone users, both on the mobile and in the cloud [92, 9, 1, 83, 40], based on mobile sensor data. There already exist some types of human event detection systems [52, 68, 89, 42, 90] similar to ours, which are used to identify unusual human behaviors and patterns, using activity, video, or audio sensors. These systems, however, are either not developed for the mobile phone at all (e.g., use only a stationary video recorder) or are minimally implemented on the smartphone (e.g., only use the activity sensor of the mobile). For our current research, we have designed our system to utilize all the following sensors of the mobile phone: audio, location, activity, and mobile status, simultaneously, in order to detect and log mobile user s daily events, and then to classify them as either normal or unusual events. We define unusual/abnormal events to be infrequently generated behaviors of mobile phone users, such as extremely increased or decreased physical activity, infrequently visited locations, or unusual audio identification. To decide on the threshold frequency for determining whether an event was unusual or not, we needed feedback from the users. The threshold frequency may vary across users, so as a result, we queried each participant in our studies on the mobile to verifiy whether a particular user would label an infrequently occurring event as an unusual event in their life. From this survey, we were able to determine for example that the majority of the 15 subjects in our research experiment confirmed that the places they visited less than 2% during a one-month time period were places visited that constituted unusual events in their normal schedule. By building myblackbox, our goal is to develop a system that is robust and general purpose enough to support all but the most demanding of such applications. Today s smartphones possess a wide array of heterogeneous sensors, e.g. location, audio, accelerometer, etc. Therefore, our challenge is to devise a practical strategy that fuses together data from all of the mobile sensors in order to accurately detect and record unusual human behavior events of users of our system. It must also account for other types of limitations imposed by the mobile scenario, such as sporadic disconnection. Fundamental and common to all these types of applications is the underlying core system of mobile + cloud that efficiently detects and

16 3 logs normal and unusual mobile user events. 1.1 Thesis Statement: We demonstrate the feasibility of constructing a mobile cloud system that efficiently, conveniently and accurately fuses multimodal smartphone sensor data to identify and log unusual personal events in mobile users daily lives. We devise a hybrid fusion algorithm that achieves accurate, convenient, customized, and scalable identification of unusual user events, and demonstrate the feasibility of the myblackbox concept by implementing the hybrid fusion design into a real-world end-to-end system. 1.2 Research Contributions: This thesis work claims the following research contributions: (1) Improved classifiers combining location history with audio sensor data, as well as location history with accelerometer sensor data, were developed to identify unusual user events by exploiting personal historical data and filtering noise. (2) A hybrid architecture was designed that combines unsupervised classification of location-based audio and location-based activity with supervised joint fusion classification to achieve acceptable accuracy, customization, convenience, and scalability. (3) This work identified the best supervised learning algorithm for fusing together multi-modal sensor data for identifying unusual user events and characterized its improvement in accuracy over location-based audio and activity classifiers. (4) A complete end-to-end mobile cloud system, called myblackbox, for efficient detection and logging for unusual user events was implemented and evaluated in a real world deployment on Android smartphones and a cloud server. We begin by describing related work in this field in Section 2. Section 3 describes the design goals and assumptions of the myblackbox mobile cloud system, as well as the algorithmic and architectural

17 4 challenges faced by our research. Section 4 describes the new location-based audio and location-based activity classifiers that were developed for unusual event detection. Section 5 describes our approach for fusing together the results of the location-based audio and video classifiers in a manner that minimizes user involvement while achieving reasonable accuracy, and identifies the best performing fusion algorithm. In Section 6 and 7, we describe the end-to-end implementation of the myblackbox system and evaluate the performance of the system for robustness and scalability. Ideas for future work are presented in Section 8. The conclusions are presented in the final section section 9.

18 Chapter 2 Related Works There are mobile-based applications, or systems [85, 73, 80, 74, 66, 79, 94, 81, 82, 63] which are able to predict users behaviors using mobile phone sensors. These applications use adaptive algorithms that save the mobile s battery power and are sometimes used to infer or predict mobile users future behavior or behavior patterns. The mobile users behaviors (activity, movement, location, audio pattern, etc.) are measured by the mobile s sensors (accelerometer, GPS, Wi-Fi, audio, etc). The accelerometer sensor on the mobile phone can be used for identifying users activity and movement patterns. The GPS and Wi-Fi sensors are used for locating users in outdoor or indoor areas. The audio sensor is used for classifying sounds in a user s immediate environment (e.g., talking, music, background noise, etc.) to help identify a user s location or behavior. Each sensor s functionality in these applications is used to build optimal or efficient algorithms that can predict mobile user s behavior or conserve battery power on the mobile phone. These applications, however, are limited in their scope and either do not combine any sensor data measurements at all or only simply combine some, but not all sensor data. They, therefore, do not provide a comprehensive classification of a user s behavior that constitutes an unusual event in a user s daily life pattern, since they can only measure one behavior at a time (e.g., movement into an unusual/new location, unusual sounds, unusual activity). For our research presented here, we have further investigated how to design and implement a similar mobilebased application in which multi-sensor data can be fused together and used to identify when unusual events occur in a mobile user s daily behavior patterns. Specifically, there are existing systems or algorithms [85, 73, 80, 74, 66, 91] that are used to identify a normal user s location, based on the mobile GPS sensor. These systems and algorithms have not focused

19 6 on unusual location events, but have been more focused on solving energy-accuracy trade-off problems, that use adaptive methods which expend more power to increase accuracy and save power when less accuracy is required. To solve this limitation of the duty-cycling method, adaptive algorithms [73, 85] have been proposed as a solution. A-Loc [73] continually fine tunes the energy consumption to meet changing accuracy requirements, by using the GPS, Wi-Fi, Bluetooth, and Cell-Tower sensors on the mobile phone. A user chooses a destination and the algorithm adjusts to use the most optimal sensor, according to the destination s distance from the user. If a user is far from the destination, the algorithm uses a cell tower or Wi-Fi to localize the user and to conserve battery power. However, if the user is close to the destination, the algorithm uses the GPS sensor to obtain a more accurate location, but it consumes considerable battery power. This algorithm is also used to detect a user s daily walking path and gait, but as noted, it requires the user to first select a destination in order to use the adaptive algorithm. This algorithm is also unable to provide accurate localization data the farther a user is from the destination. RAPS (rate-adaptive positioning system) [85] localizes a user s location without requiring the user to select a destination, by using an adaptive algorithm supported by mobile sensors such as GPS, Wi-Fi, Bluetooth, Cell-Tower, and Accelerometer. This algorithm has increased location accuracy and consumes less battery power, while also measuring the user s walking path and pattern. VTrack [91] improves localization accuracy by using a map matching scheme and traveltime estimation method that interpolates sparse data to identify the most probable road segments driven by the user and to attribute travel times to those segments. Existing research in this area is primarily focused on saving the mobile battery power or improving the accuracy, and is not used to identify unusual location events. Some other GPS tracking systems [42, 88, 54] have been used to identify unusual locations encountered by Alzheimer patients, who use mobile phones installed with these systems. These systems learn the patients daily-visited areas and movement range within their houses, or within a prescribed indoor/outdoor area within a certain range, and then divide this range into safe and unsafe areas. The patients usually stay within one or two main area ranges, and these systems simply identify when the patients wander into new/unusual, possibly unsafe, locations. This GPS tracking approach is sufficient for purposes of monitoring Alzheimer patients, but is limited for use in applying it to normal mobile users, because it focuses

20 7 on tracking users movement within a very limited range. Normal everyday mobile users visit many more places and move within much wider ranges throughout their daily lives. In reviewing other related research work, we found similar human event detection systems to ours that are used to identify human behavior and patterns using activity, video, or audio sensors. In one research study [42, 90], the researchers used the activity sensor on the mobile phone for detecting any unusual falling events of the subjects. This research focused on detecting falling behaviors of elderly people or hospital patients, but is limited in its application for measuring other activity and behavior patterns of the general population. Other research [52, 36] we investigated, made use of a video camera to record and measure subjects behaviors, by installing a video camera in one room where the research took place. However, this approach is very limited to a specific range and fixed location, whereas our mobile application is transportable and easily deployable for multiple users over a wide area and range. In another study, we found audio detection systems being used to detect unusual events in subjects immediate environments. The audio system [68, 89] for detecting unusual, and possibly dangerous events, was a more simplified system than ours employing audio sensors in a specific area to capture any generated threatening or unusual human voice sounds, such as screaming, shouting, etc. The system could detect and classify unusual sound events of people in a prescribed, limited area, but it was only tested in a laboratory setting, with no background noise (e.g., traffic, large crowds talking) being present. The two main limitations of a basic system like this is that not all abnormal, loud audio sounds are actual unusual events, and to fully test such a system, it would require implementing it in a real-world setting. Further testing in a real-world setting, with training data and a mechanism to analyze all human voice and background noise sounds in the area of the deployed sensors is needed to verify the system s accuracy in correctly identifying loud noise data of human-generated sounds as unusual events in the subjects lives. Existing unusual event detection applications [25, 24] provide family members current locations or historical locations to detect unusual location events, as shown in Figure 2.1. Family Locator Monitor [25] and Family Locator PRO [24] applications are able to share users locations among family members using the GPS on the mobile phone. However, they require family members who use this application to monitor another member, and to manually check that member s location frequently. These mobile applications also

Some sound detection applications [95, 89, 79, 94] are capable of classifying sounds into general sound types, such as a gunshot, a screaming voice sound, speaking voice sounds, etc.

21 8 Figure 2.1: Existing unusual location event applications. only use one dimension of the mobile s sensor data. For our research, we classified audio data using existing audio classification algorithms that can measure combined audio patterns (i.e., low level noise, talking voice, music, and loud emotional sounds). Some sound detection applications [95, 89, 79, 94] are capable of classifying sounds into general sound types, such as a gunshot, a screaming voice sound, speaking voice sounds, etc. The popular algorithms used in these applications to classify audio data are the MFCC(Mel-Frequency Cepstral Coefficients) [39, 78, 55, 93] and GMM (Gaussian Mixture Models) [61, 51, 43, 56] algorithms. Our system also used the MFCC algorithm to extract sound features and the GMM algorithm to find matches in the mobile users data to the trained features for identify the mobile users audio patterns. We describe in detail in this paper how we have used these algorithms in our system to develop a location-based audio classifier for unusual user events. A key theme of this research is fusing together multi-modal mobile sensor data to arrive at an improved classification of unusual user events. We investigated tree-based binary fusion classification algorithms that could be used to identify when mobile users encounter an unusual situation or are involved in an unusual event (different from their daily patterns). We compared four popular fusion classification algorithms: Bagging [77], Adaboost [96], SVM(Support Vector Machine) [49], and CI (Confidence Interval) to find the best fusion algorithm for use in our system. Bagging is a bootstrap aggregating algorithm that uses a machine learning ensemble method to build an improved classification model, using average predictions or majority voting from multiple models. The Adaboost algorithm is another algorithm that uses the ensemble method to build a strong classification model with weak classifiers to improve the performance accuracy.

22 9 SVM is a supervised learning algorithm, which is used to build an optimal linear classification model. CI is an algorithmic method, used to determine the optimal interval range in which the probability of a given hypothesis can be said to be true or not. We built a classification model according to each of these tree-based binary fusion algorithms and compared them to find the best fusion algorithm to use in our system.

23 Chapter 3 System Challenges and Design 3.1 Design assumptions and goals The foremost design requirement for our proposed system is to achieve blackbox-like functionality for mobile users by both detecting unusual user events based on smartphone sensor data and logging those events for later retrieval in the cloud and on the mobile. In case the mobile is lost, then logging of data to the cloud acts as a reliability mechanism that can help locate the last known whereabouts of a missing mobile user and/or recover the history of the mobile. Additional benefits accrue from having a cloud server store a log of a user s event trace. For example, parents wanting to monitor the safety of their children can login to the myblackbox cloud server to inspect the record of activities forwarded from their kid s myblackbox mobile application, even if their kid s mobile is disconnected/off. Moreover, the parents can create callbacks on the server that notify them when a specific set of circumstances occur. Caching logged data on the mobile brings its own benefits. It helps bridge temporary disconnection issues faced by wireless users even in seemingly well-connected cities. Moreover, if a user is lost hiking in a remote location without their cell phone, the discovery of their mobile and its logged data can help locate the last known whereabouts of that user as well as reconstruct a history of the lost user s actions leading up to that event, which may help in search and rescue. Our system is designed to satisfy a variety of other design goals. Due to the diversity and dynamics of indoor and outdoor environments, user locations, and user activities, we need to accurately identify the unusual events that users generate (with fine granularity and robustness), with low latency, and without incurring too much overhead on the mobile device. Further, our system should use a combination of sensors

24 11 to improve classification of whether a user is experiencing an unusual or abnormal situation. Also, our system should log seamlessly and automatically in the background, and not require much, or any, user intervention. In addition, our mobile application is designed to be sufficiently energy-conserving so that the phone may operate for at least twelve hours a day with the application running. Finally, we intend to leverage a cloud server for logging of multiple users unusual events in a scalable manner. Our solution does not make unrealistic assumptions about the existence of specialized infrastructure to assist with any of the above tasks, such as the existence of elaborate sensors. We assume only the capabilities and existence of sensors common to most standard smartphones (e.g. today s iphones [13] and Android phones [3]) such as audio sensors capable of capturing continuous sounds, GPS capable of measuring locations, and accelerometers capable of measuring activities. We assume that the Wi-Fi works indoors, which we ve verified to be practically true in typical indoor settings. We assume there may be occasional wireless disconnection. We do not assume the existence of gyroscopes on the phone, since not all smartphones support them. We expect users to behave in a normal manner, namely carrying and using the phone as they typically would with our application running in the background, and exhibiting other usual behavior such as recharging the phone every evening/night [27, 41, 60] Our task is then to show how under these limiting assumptions we can still construct a system that successfully supports our goals of achieving unusual event detection in practical, real-life settings. 3.2 Algorithm Challenges and Design Given these above system goals, we are then faced with the following algorithmic challenges. How can we classify personal user behaviors using multi-modal mobile data? We confined ourselves to three common mobile sensors, namely location, activity, and audio. Figure 3.1 shows that our approach was to first build single-mode behavior classifiers and to then fuse their results into an integrated decision to identify unusual events [72]. The intent was to exploit the increased accuracy that should result from utilizing multiple sensors to make a collective decision identifying an event. The individual user behavior classifiers categorized mobile users activity (e.g., stationary, slow walking,

25 12 Collection of mobile sensor data Behavior classification Historical Behavior Model Fusion Algorithm Unusual event detection model Unsupervised and Personalized model Supervised and general model Activity Sensor Stationary, Slow Walking Walking, Running Activity pattern Location Sensor Audio Sensor Visited location Low level noise, Talking, Music, Emotional voice Frequent, Infrequent, New location Audio pattern Fusion Unusual event detected or not Figure 3.1: Process for building an unusual event detection model using mobile sensor data. walking, running), audio (e.g., low level sound, music, talking, loud emotional voice), and locations (ranked list of frequently visited locations). Based on data obtained from 20 mobile users, as explained later, we observed that an activity s normality/abnormality is context sensitive to both location and user. That is, a given user tends to exhibit repeatable activity on a set of frequented locations, such as sitting at work, or running in the gym. Thus, simply identifying whether a user is running or not is insufficient for accurately identifying abnormality, which is a function of the user s location. Instead, it is important to develop a location-based model that identifies what is a user s normal behavior at each location. In addition, it is important that a different location-based model be developed for each distinct user. It may be normal for the audio data to reveal loud noises and screaming at a user s home if that user particularly enjoys watching action/horror movies at home. For another user, this may be highly unusual. Therefore, on top of the basic single-mode classifiers, we built location-aware audio models and location-aware activity models for each user. These are then used to generate early stage classification results based only on pairwise data. To combine all three modalities of data and improve accuracy, namely to combine the results of the location-based audio models with the location-based activity models, we employed fusion algorithms as described below. What kind of hybrid fusion corollary algorithm should we use to fuse multi-modal data to improve accuracy?

26 13 A variety of fusion algorithms exist that could be employed in our system. We investigated four common fusion algorithms, namely Bagging, Adaboost, Support Vector Machines (SVM), and Confidence Intervals (CI) and compared their accuracies. Other factors beyond accuracy were also considered, such as the complexity to train the algorithms and the complexity to implement the algorithms on resourceconstrained mobile devices. We found the Bagging-based hybrid fusion algorithm to perform the best of all of these algorithms for fusing three-dimension modality mobile sensor data and accurately detecting unusual events in the mobile users behavior patterns. However, we selected the trained CI-based algorithm, which was the second best performing one of the four algorithms for our application, because it achieved similar accuracy and its computational complexity was simplest to implement on the mobile phone. Thus, based on the CI algorithm, we implemented our unusual event detection algorithm for the myblackbox system. How can we automatically detect unusual user events and thereby minimize inconvenience to the end user, i.e. not require the mobile user to manually label significant unusual events to train the system before it can be made useful to the individual? As noted above, developing classifiers that are personalized to each user s normal/abnormal behavior should yield the best results for unusual event classification. However, we wish to minimize and ideally eliminate manual labeling of data by the mobile end user, yet achieve personalized classification. Our approach was to employ a hybrid fusion approach in which unsupervised learning [71, 35, 64, 53] (without training data) is employed for the initial phase of location-based audio and location-based activity classification, while a supervised learning (with labeled training data) [84, 59, 70, 46] approach is used for the fusion classifier. Unsupervised learning means we can forgo user training and labeling, yet develop reasonably accurate models of unusual events that are customized to both location and user. These models can be developed on a user s mobile device in an automated manner, without requiring user input. We then rely on supervised learning at the next fusion stage to boost the accuracy. Ultimately, doesn t each mobile user still need to be involved in the training loop to label whether a fusion result is truly a significant unusual personal event? We avoid this burdensome manual labeling requirement by the following key finding from our research: we can approximate the accuracy of a

27 14 Historical Behavior Model Fusion Algorithm Historical data Unsupervised Learning Supervised Learning Labeled training data General model Personalized model General model Personalized model Figure 3.2: Diagram of unsupervised and supervised learning algorithms with general versus personalized model choices personally trained fusion classifier with a classifier trained jointly from a general collection of mobile users. That is, our research results show that we can develop a general fusion classifier based on supervised learning that is trained by hiring say N users/strangers and paying them to label their data, and that this general fusion classifier achieves a similar accuracy to a personalized fusion classifier. In this way, we eliminate as well the need for the user to train the fusion classifier. Figure 3.2 zooms in on the two critical classification stages of Figure 3.1, and summarizes the combined decision-making process of our hybrid fusion design for the myblackbox system. We now see that the pattern recognition and fusion algorithm steps comprise the unsupervised learning stage and supervised learning stage respectively. In the pattern recognition stage, which is based on cumulative historical user data, we chose a personalized model over a general model because we needed to measure individual users behavior patterns over time as accurately as possible. If we had chosen a general model, it would not have been as sensitive to each individual s unique daily behavior patterns. For example, for this unsupervised learning algorithm we measured the probability of each individual s 30-minute segments of personalized behavior pattern data as to whether it fell within their user-specific historical normal range of cumulatively measured daily behavior patterns. In the fusion stage, which is our supervised learning algorithm that uses labeled training data (user subjects feedback on our classification algorithms), we determined that develop-

28 15 ing a general unusual event model based on generic users s labeled data achieves basically commensurate accuracy in our system compared to a personalized model. Therefore, our hybrid fusion design is hybrid in two dimensions, namely it combines both unsupervised and supervised learning, and also combines both personalized and general models of unusual user event behavior. The end result is a hybrid fusion design that is both accurate and convenient to the mobile end users. 3.3 System Design Challenges In addition to algorithmic challenges, we also faced the following system challenges given the design goals of our system: How do we devise an end-to-end system that combines mobile smartphones with remote cloud execution while balancing the limitations of the resource constraints of mobile devices with the desire for our cloud server to scale to thousands of users? Our myblackbox system was initially designed to handle the data of hundreds of mobile users at one time. At first, we were challenged in determining where to place execution of the myblackbox algorithms and procedures - on the mobile phone or on the server. The mobile phone has more limited resources than the server, such as lower battery power and slower CPU; however the server cannot always handle the processing or network traffic overhead of a large number of users simultaneously, when the algorithmic processing takes inordinate amounts of time for each user s data. Our mobile application has to collect and send mobile sensor data periodically to either the phone or the server for processing. The myblackbox system also utilizes classification algorithms (location, activity and audio) and a fusion algorithm, that need to be run on either the phone or server, to detect unusual events occurring in a user s daily life. For example, suppose our mobile application records one and half minutes (six 15-second audio files) of audio every 30 minutes. If it then sends these 90-second audio files (12 Mbytes every 30 minutes) to the server, with 100 users there would be approximately 1.2 gigabytes of data that the network would have to process all at once. This large amount of data transfer would slow down the network considerably the more

29 16 users, the less capable the network would be at handling these data. We found that the server also had an overhead processing time of 45 seconds in order to classify the six 15-second audio files every 30 minutes, thus it would take the server 30 minutes to handle only 40 mobile users data of one 90-second audio file from each of them. Considering these limitations of the server, we determined a more scalable solution was to distribute data collection and algorithmic processing for the myblackbox system to each mobile user s phone, and not execute these on the server. Hence, in our basic experiment, we implemented and ran our system on each user s smartphone and measured the processing times of each sensor s data collection to be: approximately 7.5 seconds every 5 minutes for audio classification, within 0.1 second for each activity classification, within 1 second every 30 minutes for location classification, and within 1 second every 30 minutes to process our fusion algorithm. We found that the audio classification takes the most processing time, so we analyzed the scalable issue based on it. While this alleviates the load on the server and network, it increases the processing on the mobile handheld. Our reasoning was that today s multi-core smartphones should have the ability to process classification of individual data, but that it was essential that we find suitable low complexity classifiers for hybrid fusion that are also accurate. We demonstrate later how we were able to achieve such classification on mobile devices. To support blackbox-like remote logging to the cloud server in a scalable and bandwidth-efficient manner, we choose to send only summaries of the latest data from the mobile to the cloud server. We set up the mobile application to classify each individual mobile sensor s data, and to summarize and store the results, using our fusion algorithm, on the smartphone. Each mobile user s summary results are then sent to the server. In this manner, we found we could reduce the overhead of the server by running our application on the users mobile phones. In addition, we tested and emulated this process and found our system could handle a maximum of 20,000 mobile users on our server. 3.4 System Architecture Given these above system goals and challenges, we made the following design decisions. The mobile and cloud server components are assembled according to the architecture as shown in Figure 3.3. In this

30 Mobile Component myblackbox Server 17 Unusual Event Unusual Audio Detection Unusual Activity Detection Unusual Location Detection BlackBox Audio classification Log Location Log Activity Log Raw sensor data Log Communication Backup data communication Data Exchange by Internet Communication Backup data communication Periodic Measurement Audio Management Accel Management GPS Management Wi-Fi Management Status Management Audio Audio classification Mobility Estimation Location Estimation Pattern Activity Pattern Location Pattern Audio Pattern Activity Estimation Activity classification BlackBox Audio classification Log Location Log Activity Log Hardware Power Manager Wi-Fi GPS Accelerometer Audio Figure 3.3: Architectures of the myblackbox mobile component and the myblackbox cloud server. architecture, the myblackbox application fuses together information from sensors like the microphone, GPS, and accelerometer to estimate the likelihood of an unusual event. The fusion algorithm is built on top of individual classifiers for each of the three following sensor dimensions: audio, location, and activity. These three sensors thus far have provided relatively strong indicators of unusual mobile user events, though we continue to explore other sensing dimensions. The architecture of myblackbox consists of two logical components: Mobile component: This component runs on the user s smartphone. The mobile component implements two important functions. First, it implements an unusual event detection system that automatically fuses the phone s sensors measurements in order to detect any unusual events experienced by the user. Second, it implements blackbox functionality to record all sensor measurements that are stored as clues and evidence on the phone. This information is automatically sent to the myblackbox server in the cloud. The myblackbox application reads sensor data from the mobile phone in order to identify both usual and unusual mobile user events. The mobile application continuously activates the Accelerometer sensor to measure a user s activity and periodically activates the audio sensor every 5 minutes to record audio data as a wav file for 15 second periods. After collecting the audio data, it analyzes the audio file with the MFCC and GMM algorithms and classifies the audio as numeric data. It also records the users location with either Wi-Fi, every 3 minutes, or with the GPS sensor every 5 minutes, when Wi-Fi is not available. With this method of using the GPS sensor ONLY when Wi-Fi is not available, the myblackbox application is able to

31 Audio sensor Activity sensor Location sensor Read sensor data User login info. 18 Audio files (wav) Raw sensor data Login data Audio classification 30-minute summary data Summarize the data Calculating μ, σ 30-minute prediction data Historical data Figure 3.4: Diagram to store sensor data on the mobile phone. conserve more battery power. We determined that our mobile application on the smartphone cannot share the audio sensor capabilities simultaneously with other mobile applications (e.g., phone calling, game playing, Skype, etc.). Whenever our myblackbox application is activating the audio sensor, other applications (such as phone calling) cannot utilize the audio sensor. Because of this issue, we decided it would be best not to conflict with any possible audio-sensor use of the myblackbox research participants who were using our system. Thus, we did not activate the audio sensor whenever the mobile user was interacting with their phone in any way: e.g. turning on the screen (e.g. to surf the Web, text, etc.) or calling. Using the methods described above, our application continually collects 30-minute segments of a mobile user s location, activity and audio sensor data on the mobile phone, and then with the 95% Confidence Interval (CI) determines whether the user is experiencing an unusual or normal daily event. Our mobile myblackbox application also provides a blackbox-like functionality to periodically store the sensing data and classification data of a user s audio, location, and activity on the database of the smartphone as shown in Figure 3.4. For the audio sensor data collection and storage, the audio is stored as a wav file with a different file name every 5 minutes. We design five tables (raw sensor data, 30-minute summary data, historical data, 30-minute prediction data, and login data) on a database of the smartphone to store the raw sensor data and our analysis data. We build the tables to store raw sensor data every 5 seconds collected from the mobile s audio, activity, and location sensors, whenever there is data readily available from these sensors. Every 30 minutes, we build a table for the percentage-based summary data, which is calculated

32 19 from the raw activity, audio and location sensor data. This table stores a log of the percent of time a user spends engaged in varying activities, and a user s variation in audio and movement patterns. This 30-minute data from the summary table is used to build the historical data table of the mobile user. The historical table includes the averages and standard deviations of a user s cumulative sensor data calculated from the summary table data, beginning at day 4 up to a period of one month, based upon the user s normal distribution patterns. We also build a prediction table to store the probability values of the 30-minute summary sensor data for each of audio, activity and location measurements. Using these probability values, we then store a binary classification result of whether the 30-minute period of data constituted a normal or unusual event for the user. Lastly, we design a login table, where the mobile user s password and login ID information is stored. This table is used to provide access for users to our system s website, where they can to view graphs of their historical personal behavior data. The 30-minute summary data and 30-minute prediction are stored for up to 30 days, and then removed from the mobile s databases. Raw sensor data is stored on the phone for up to 3 days only. Users can easily access their raw data and summary and historical data on their own smartphones during these time periods. We do not store the raw sensor data and audio files on the server because of user privacy issues and some users limited mobile data plans. myblackbox Cloud Server: The remote black box server component of our system also records and stores all event data collected from each mobile user, providing an extra degree of redundancy in case the mobile device is lost, destroyed or stolen. Logged data, clues, and evidence can be recovered despite disconnection from the mobile device. The server also provides a web service for mobile users who wish to access their own historical personal behavior data through a web browser (or for parents to monitor the safety of their children). We design the collection component of our server to handle XML-based files which are easily processed by computer. This component receives 30-minute periodic data, 30-minute prediction data, historical data, and login data from users smartphones. Each file of data received also includes the unique mobile id that identifies each user s individual data. We also design our collection component to ensure that no data packets sent from the mobile devices will be lost. If the server correctly receives the data, it sends an acknowledgement back to the phone. If the mobile phone receives an error message or no acknowledgement

33 20 from the server, it simply re-sends the unacknowledged data in the subsequent 30-minute period. We also build a web component for the mobile users to access their own historical behavior pattern data through a browser. To access the web component, the user needs to provide a personal id and password on their mobile device. By having users ids entered and stored only on their smartphones, we can better protect users privacy. Users can then log into our web server to view their individual historical myblackbox system data: a map of visited locations and graphs of daily behavior patterns.

34 Chapter 4 User Behavior Classifiers In this section we describe the first stage of our hybrid classification process, namely the design of our location-based audio and activity classifiers. We further discuss the performance of both classifiers by evaluating them with respect to survey data obtained from 20 mobile users. 4.1 Data Collection In order to obtain data for training and testing of our location-based audio and activity classifiers, we constructed a mobile data collection system. We built an Android-based mobile application that would periodically record raw sensor data (audio, accelerometer, location values, etc.), as shown in Figure 4.1. The sampling rates of different sensors were adjusted to conserve power. The program measured the average type of accelerometer-based activity (e.g., walking, jumping, shaking phone, etc.) every three seconds and obtained and recorded subjects locations using GPS and WiFi every 3 minutes. We also recorded all audio sounds in the user s immediate vicinity on the mobile phone for 15 seconds, every 5 minutes, and stored the orientation, proximity, and light sensor data along with these recorded sounds, to determine the user s behavior related to the audio sounds. The battery power of the mobile phone, when running this entire application, lasted for about 12 hours. This time was enough to measure the user s daily behavior patterns and activities, because the users were normally back at home in the evenings and would recharge the phone at the end of each day. The data was collected from the mobile sensors of 20 mobile users during a one-week period. For the data collection from the 20 different subjects, we used four mobile phones: two HTC Nexus [12] Ones and

35 22 Accelerometer Proximity Orientation GPS Wi-Fi Audio Figure 4.1: Phone survey: Displaying sensor data. two HTC Inspires, with our application installed on them, that we lent to the participants to carry with them for one week. We paid $10 for participation in this study and required the subjects to carry the phone with them at all times in their pocket or purse, keeping it in the same place as their mobile phone, during the data collection period. During the time when subjects were carrying the phone, sensors installed on the mobile phone were also periodically recording external sounds and the users activity and locations. The data was collected and stored on the phone during the time they were actively participating in the study, carrying the phone, and was retrieved from the phone after the participants returned the phones at the end of the study. This research was approved by the University of Colorado s Institutional Review Board (IRB) [69]. The collected data was then used to develop location-based models of audio and activity. First, for each user we identified static locations where a user dwelled for 30 minutes or longer. For each such location, we developed both a model measuring the distribution of the kinds of sounds recorded at that location, as well as a separate model incorporating the distribution of different kinds of activity initiated at that location. Four days of historical data were used for training of the classifiers, and three days of data were used for testing/evaluation. Note that the test data was divided into 30 minute epochs, upon which basic audio and activity classifiers were run on each epoch to generate distributions of the kinds of sounds

36 23 or activities experienced during that epoch at that location. If the distribution of the test audio or activity deviated too much from historical norms for a given user at a given location, or more precisely if the probability of a given audio or activity classification was less than 5% or greater than 95%, then this was detected as an unusual event. An unusual event was also detected if a static location had less than a 2% probability of occurring. The location-based audio classifier is described in Section 4.2. The location-based activity classifier is described in Section 4.3. The unusual event detection based only on location is described in Section Location-based Audio Classifier We sought to develop audio classifiers that could identify common, every-day sounds encountered by users of mobile phones. We classified four types of sounds (low level noise, normal speaking voice, music, loud emotional voice (yelling, crying, screaming and loud angry voice sounds)) to detect and analyze mobile users audio patterns generated or encountered in their daily lives. We used the audio data collected from a mobile phone and paper survey of 20 subjects to develop a location-based audio classifier for unusual events Training the Basic Audio Classifier We begin by discussing the design and implementation of our basic audio classifier, which categorized sounds into four types using existing audio classification algorithms. Sound detection algorithms [95, 89, 79, 94] are capable of classifying sounds into general sound types, such as a gunshot, a screaming voice sound, speaking voice sounds, etc. These algorithms cannot detect specific meanings of certain sounds in human speech, but they can detect and classify low level noise sounds such as explosions, gunshots, screaming, and the human speaking voice. Recently, some of the algorithms have been implemented on the mobile phone [79, 94]. These algorithms have been shown to successfully identify unusual human voice sounds, but have only primarily been tested and implemented in laboratory-type settings. Our proposed model uses a combination of existing algorithms to develop a practical real-life situation application. Our proposed audio-detection algorithm is designed to be more robust and to apply to outdoor as well as indoor

37 24 settings including the ability to identify mixed low level noise sounds that include such sounds as building construction noise, traffic noise, car horn sounds, etc. To build our audio classification model, we initially investigated people s audio sound environments to find the types of sounds that are generated and to determine the categories of audio that would be sufficient for our research. We surveyed 10 users to ask them to list the places they most frequently visit in their normal daily lives. The five main types of locations they mentioned were: home, work place, shopping centers, restaurants, and bars. We then visited these five types of locations and recorded the sounds in these environments. We found that the audio recorded in these locations could be divided into three main types of sounds: background noise sound, human talking voice sounds, and music sounds. We also researched existing audio classification algorithms [95, 89, 79, 94] to learn which sounds the researchers who developed these algorithms were trying to detect. Their algorithms also classified audio sound into three main categories: background sound, talking voice, and music sound. In addition, their algorithms further classified human voice sounds as unusual sounds whenever the voice sounds generated were high frequency, such as screaming or yelling voice sounds. Thus, according to our own investigation and the existing audio classification research, we decided to base our audio classification model on four main categories of sound: background noise, talking voice, music sounds, and unusual (high-frequency) voice sounds. We decided on four main classifications for our sound detection algorithm: low level noise, normal talking speech, music anything from rock to classical, and loud emotional voice sounds. We used two algorithms, the MFCC(Mel-Frequency Cepstral Coefficients) [39, 78, 55, 93] and GMM (Gaussian Mixture Models) [61, 51, 43, 56, 20]. The sound features extracted with the MFCC algorithm are compared with the existing models of sound classifications based on the GMM algorithm, and if a match is found, one of the four classification types is identified. The GMM algorithm [20] has been found to be statistically efficient when it comes to analyzing clustered data or for high density estimation. This meant that the GMM algorithm was ideal for purposes of our audio classification model that is based on location, and especially on frequently-visited location pattern data (e.g., home and workplace). The MFCC and the GMM algorithms were implemented to run on each recorded audio file with a Hz sampling rate and stereo sound on the Android-based mobile phone. Every half second, the GMM

38 25 algorithm collected 12 data points of sound, using a Hz sampling rate, and analyzed the 12 data points to classify the sound as one of the four types. Each data point consisted of 13 frequencies extracted by the MFCC algorithm. We used the GMM algorithm to build the audio training models, using the audio data collected from our 20 survey subjects as well as music audio and Youtube [21] audio data collected on the internet. To create the low level noise and talking voice models, we used the recorded sounds collected from the 20 phone survey subjects and trained the models using the GMM algorithm with three probability states. We used 50 low level noise sound samples, collected from each of the 20 mobile users phones, recorded in different places such as home, school, sidewalks, markets, restaurants, etc. We also used 50 talking voice sound samples, collected from the 20 survey subjects during periods when they were conversing with friends, family, etc. For training the music model, we used 60 music sound samples that were created by combining 10 samples from the recorded music sounds of the survey subjects and 50 music audio samples downloaded from websites. It was necessary to supplement the subjects phone survey data with web data, because the number of recorded background music sounds from the survey subjects was not large enough to use for classifying a variety of music types for the music model. The total combined phone and web music samples consisted of rock, rap, popular and classical music. For the loud emotional voice model, we collected samples of audio data from 50 Youtube videos of people involved in actual accidents, fighting scenarios, etc., who were screaming, crying, or shouting loudly. We used the GMM algorithm to train these audio recordings to create the loud emotional voice model. Figure 4.2 shows the frequencies extracted by the MFCC algorithm (Mel-Frequency Cepstral Coefficients) from sounds collected during our experiment: low level noise, talking voice, music sound, and loud (angry) emotional voice. The MFCC algorithm provides fine details on the frequency bands that are used to define sound features extracted from the sounds. The figure illustrates that these four sounds have their own unique frequencies. These unique frequencies can be used to detect the mobile users self-generated or environmental sound patterns (their own or others speaking or angry voice, music, low level noise, etc.). Although the GMM algorithm classified each recorded sound into four classifications, the algorithm did not always correctly classify all of the sounds in our experiments. For example, the first experiment

39 Frequency Frequency Frequency Frequency (a) sound 01 sound (b) sound 01 sound sound 03 sound 03 5 sound 04 5 sound 04 sound 05 sound 05 0 sound 06 0 sound sound 07 sound sound 07 sound 08 sound 09 sound sound sound Background Noise Frequency Pattern sound 11 sound Talking Voice Frequency Pattern sound 11 sound (c) sound 01 sound (d) sound 01 sound 02 5 sound 03 sound 04 5 sound 03 sound 04 sound 05 sound 05 0 sound 06 0 sound sound sound 07-5 sound 08-5 sound 08 sound 09 sound sound 10 sound sound 10 sound Background Music Frequency Pattern sound Angry Voice Frequency Pattern sound 12 Figure 4.2: Sound patterns detected by the MFCC algorithm (a) for low level noise, (b) talking voice, (c) music sound, (d) and angry sound pattern frequencies of Table 4.1 shows results of the classification algorithm obtained from sounds recorded in a food court mixed sounds measured while the researcher walked around the food court area. This recording of mixed sounds from the food court included loud emotional voice sounds 4.34 percent of the time, voice sounds percent, music 6.95 percent, and low level noise percent of the time. However, in actuality there were only talking voice sounds and low level noise being generated in this environment. The 6.95 percent of music sounds and the 4.34% of loud voice sounds were incorrectly identified by the algorithm, likely because of some of the high-pitched, loud kitchen equipment sounds encountered in this food court. None of these high pitch sounds identified as loud emotional voice sounds were actual human voice sounds Location-based Audio Modeling for Unusual Event Detection A simple audio-only detector of unusual events would be to define the loud emotional voice sounds produced by the basic audio classifier as unusual events encountered in the mobile phone user s life, as is done in other previous research studies [95, 89, 79]. The emotional voice sounds, expressed by people involved in unusual situations such as yelling, crying, screaming and loud angry voice sounds [44] can Table 4.1: Audio results Visiting loud voice talking music low level noise 1st 4.34% 23.47% 6.95% 65.21% 2nd 2.38% 17.85% 14.28% 65.47% 3rd 3.12% 7.81% 20.31% 68.75% 4th 1.51% 7.57% 15.15% 75.75%

Num of 30 min segments Audio Percentage (%) 27 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 Visiting Sequence of one subject at same location Quiet Talking Music Loud Figure 4.

40 Num of 30 min segments Audio Percentage (%) Visiting Sequence of one subject at same location Quiet Talking Music Loud Figure 4.3: An example showing a similar percentage pattern of the four audio classifications for one subject s repeated visits in one location Histogram Talking % of each segment (a) (b) Figure 4.4: (a) Histogram of 30-minute audio classifications for one subject s repeated visits to the same location, (b) Quantile Quantile plot using the histogram data

41 28 be used as an unusual event indicator. People frequently vocalize such sounds when they are involved in accidents or crime situations, either as victims or perpetrators. We watched 50 Youtube videos related to actual dangerous situations that included fighting, car accidents, angry threatening situations, etc. The people in these videos were using vocal sounds and words that involved yelling or screaming at each other, loud crying, or loud angry voices shouting at each other. If the audio-detection algorithm on the mobile phone could be designed to correctly classify these loud emotional voice sounds as distinct from other sounds (such as speaking voice, low level noise), then we could identify uncommon, sometimes dangerous situations encountered by mobile phone users. However, as seen in Table 4.1, all of the occurrences identified as loud voices would have generated false unusual events. To overcome the limitations of this simple audio classification approach, we used location-based pattern recognition to improve our classification model. First, we repeated the measurements of the noise in this same food court multiple times and found that a similar percentage of mix of sounds occurred across all of the experiments, as seen in Table 4.1. We also analyzed the percentage pattern of the four audio classifications for each of the 20 phone survey subjects when they repeatedly visited the same place. Figure 4.3 shows the results of one subject s recorded audio sounds when repeatedly visiting the same location. For each of the 10 times that the subject visited the same place, the mixed percentage of quiet, talking, music, and loud emotional sounds was very similar. Additionally, we investigated the entire audio data pattern for one mobile user, collected over a period of more than one month, to see if our hypothesis that the mobile users data would follow the normal distribution was correct. We analyzed this subject s classified audio data when repeatedly visiting the same location as shown in Figure 4.4(a). We measured how closely the classified audio data pattern matched the normal distribution using Q-Q plot (Quantile Quantile plot) [16], which is a graphical method for comparing the normal distribution with the input distribution by plotting their quantiles against each other. If the input data is similar to the normal distribution, it overlays a linear line, as shown in figure 4.4(b). This figure shows the Q-Q plot using the one subject s audio pattern data. We found that most of data closely overlaid the linear line and thus followed the normal distribution. Next, we analyzed audio data collected from 20 survey subjects who frequented one location repeatedly to see how much the percentage of audio sound classifications would vary across all visits. Figure 4.6

42 Deviation Deviation 29 1st Visit 2nd Visit n Visit σ1, σ2, σ3, σ4 Standard Deviation Figure 4.5: An example for measuring standard deviations for each audio type over 10 visits for one subject Quiet 6 Talking 4 Music 2 Loud Subjects Subjects Location 1 Location Quiet Talking Music Loud Figure 4.6: Standard deviations of four audio classifications measured for 20 subjects in two different locations: 1 and 2

43 30 shows, in two charts, the standard deviations for the four audio classifications measured for each of the 20 subjects audio data collected, calculated as shown in Figure 4.5, when they frequented two different locations. When the subjects frequented the first place (chart 1), the standard deviations across each of the four audio types were quite small for all of the 20 subjects, less than 10 standard deviation points for any one audio type, based on a maximum of 100 points for the combined four audio classifications. Additionally, when the 20 subjects frequented the second place (chart 2), a similar small standard deviation, less than 12 points, and a similar stable pattern mix of the four audio types was found across each of their visits. Also in comparing the two charts, the standard deviation pattern mix of audio classifications within subjects varied for each subject according to the locations visited. On average across all subjects, it appears that in location 2, there were more talking audio classifications than in location 1. This could indicate that location 2 was their workplace for most subjects and location 1 was their home. The conclusion is that the distribution of audio sounds is relatively stable for a given location and user, i.e. there is not much deviation from the norm, and that a different audio model must be developed for different locations. We leverage these findings to develop a location-based audio classifier that detects unusual events when the audio standard distribution deviates too much from the norm. By investigating the above mixed percentage patterns of audio classifications, we found that when the mobile survey subjects repeatedly visited the same place, the percentage pattern mix of audio types across all subjects were very similar. To determine how similar these patterns were, we used the Gaussian (Normal) distribution to find the probability of how close each subject s audio type was to the normal audio classification probability. The normal Gaussian distribution is bell-shaped and its probability is calculated by the average and standard deviation. We used this normal distribution to analyze our historical locationbased audio data (i.e. audio sets collected in the same location). If the probability is close to 0.5 (the center of the distribution), the measured data will be classified as a normal audio pattern (i.e., a normal event), but if the probability is far from the center of the distribution, the audio event will be classified as an infrequently generated event (i.e., unusual event). Therefore, if we choose suitable thresholds, such as a confidence interval, we can determine infrequently generated unusual user events in different locations using this normal distribution method. To use this approach, we assume that mixed percentage patterns of

44 31 audio classifications follow the normal distribution. Table 4.2: Example of one user s audio classification results for two different locations Location low level noise talking music loud voice 1st 89.02% 9.76% 0% 1.22% 1st 65.98% 32.99% 0% 1.03% 1st 79.27% 15.85% 0% 4.88% 2nd 33.93% 55.36% 3.57% 7.14% 2nd 68.18% 16.67% 9.09% 6.06% 2nd 33% 55% 5% 7% Table 4.2 shows an example of the audio classification results identified by the GMM algorithm using one user s audio data, collected when the user visited two different locations three different times. Both our approach and existing approaches classify audio data using the GMM algorithm in this way. However, most existing approaches operate only on single occurrences of loud audio events, such as screaming or shouting, to flag and classify loud voice sounds as unusual event sounds. If we apply this approach, all of the classification results that include the loud voice sounds would be detected as unusual events. However, we know from our research and collecting ground truth data (survey feedback) from users, such as the user in this example, that none of these occurrences of loud voice sounds were actual unusual events. Table 4.3: Gaussian distribution results of the above audio classification results for one user Location low level noise talking music loud voice 1st st st nd nd nd Instead, we seek to measure how much the audio distribution deviates from the norm at a given location for a given user. To achieve this, we used the Gaussian normal distribution method, as explained above, for each audio type at each location for each user, and then calculated the location-based distribution probabilities for a given user-location measurement. To then determine the threshold measurements by which to classify the audio data as either a normal or unusual event, we used the standard confidence interval (CI) thresholds of the normal distribution bell curve. Using the CI technique, if these probabilities are either less than 5% or greater than 95%, then an unusual event is detected. For example, Table 4.3 shows the generated probabilities corresponding to the same rows in Table 4.2 when compared to the trained location-

45 32 based normal model. We see that rows 1-3 corresponding to three visits to location 1 are roughly similar in their distribution, and therefore their CI values are clustered around a normal value of 0.5. However, row 5 corresponding to the 2nd visit to location 2 has a distribution of measurements that looks somewhat anomalous and is a candidate to consider as an unusual event. That is, the percentage of quiet is much higher than the other measurements, the percentage of talking is much lower than rows 4 and 6, and the measured percentage of music is much higher than normal. Indeed, the corresponding row 5 in Table 4.3 reflects these observations with CI values for quiet at 0.88, talking at 0.07, and music at However, even though it seems that row 5 is quite different, it is almost but still not sufficiently different in any of the dimensions from historical norms to trigger an unusual event detection. The trigger threshold for detecting an unusual event is set at either below 0.05 or above This illustrates that our CI-based approach is quite stringent in identifying unusual events. We describe how we decide optimal CI thresholds in detail in Section 5. Our current myblackbox system is location-based, and thus is limited to only being able to detect and identify events as unusual versus normal mobile user events whenever a user is in a stationary position. Our system cannot currently detect or classify unusual user events when a user is transitory moving from one location to another. Initially we tried to incorporate this capability in our system, but found that the data were irregular and ground truth data collected from some initial participants was also irregular in how they identified normal versus unusual walking paths. For future work, we hope to find a way to expand our system to also operate accurately for users when they are in transit, moving from one location to another Performance Results of the Audio Classifiers Effectiveness of unusual event detection using location-based audio classifier: We compared the detection performance of our location-based audio classifier for detecting unusual events with the basic existing approach in which any loud sound is viewed as unusual. Our approach used location-based mixedaudio pattern classification that employed the normal Gaussian distribution and detected as unusual any event with a probability outside of the CI 5% and 95% thresholds. We found that for less noisy environments (e.g, office, home), our approach was only slightly better than the existing approach, but that for more noisy

46 Our Approach Existing Approach recall precision accuracy f-measure Figure 4.7: Comparison of our location-based audio classifier for detecting unusual user events with a basic existing approach that is location-independent. environments, such as the shopping center example here, our approach was much better. Figure 4.7 shows the recall, precision, and accuracy measurements of our approach compared to those of the existing approach when five users visited shopping centers more than seven times. With our approach, we were able to improve the accuracy and f-measure measurement by 46 and 11 percentage points higher than the accuracy and f- measure achieved with the existing approach. Our audio classification approach was able to achieve a higher accuracy than the existing approach because it is audio-pattern based, whereas the existing approach is only human voice level-based. Our approach uses mixed-audio pattern recognition and is location based, whereas the existing approach is single audio, non-location based. Thus, our approach is more accurate for identifying unusual audio events in noisy locations, because measures and analyzes similar noise patterns that are repeated in the same noisy areas, whereas the other approach is overly sensitive to noise in such areas. Basic audio classification performance: Next, we consider the effectiveness of our basic audio classification subsystem. As described earlier, we used training data to develop the GMM models for low level noise, talking voice, music, and loud emotional voice. These four models of sound patterns were developed using the collected frequencies from the MFCC algorithm, which were then classified by the GMM algorithm. We then used 40 test sound samples that are separate from our training data for analyzing each of the four classification groups. Each sound sample lasted from 30 sec to 1 min and was segmented into half-second snippets. We measured the accuracy of our model that includes the four sound pattern classifications: low

Table 4.4: Test sample results and classifications 34 Sounds (%) Loud voice Normal voice Music Low level Loud Voice 65.84 27.55 4.59 2.03 Normal Voice 1.88 90.42 3.93 3.77 Music 5.66 37.59 50.08 6.

4 shows the accuracy of our model for classifying unusual voice sound, talking voice, music, and background sound patterns. The accuracy of loud emotional voice sound pattern detection was 65.

The accuracy of normal voice sound pattern detection was 90.42 percent and the algorithm only incorrectly detected other sound patterns 9.58 percent of the time.

The misclassification of loud emotional voice sounds occurred from the music samples, which produced 5.66 percent unusual voice sound patterns.

47 Table 4.4: Test sample results and classifications 34 Sounds (%) Loud voice Normal voice Music Low level Loud Voice Normal Voice Music Low level level noise, talking voice, music, and loud emotional voice sound patterns. Table 4.4 shows the accuracy of our model for classifying unusual voice sound, talking voice, music, and background sound patterns. The accuracy of loud emotional voice sound pattern detection was percent, with the error in detection consisting of percent talking voice, 4.59 percent music, and 2.03 percent background sound patterns. The accuracy of normal voice sound pattern detection was percent and the algorithm only incorrectly detected other sound patterns 9.58 percent of the time. The accuracy of music sound pattern detection was percent and the algorithm incorrectly detected other sound patterns percent of the time. The misclassification of loud emotional voice sounds occurred from the music samples, which produced 5.66 percent unusual voice sound patterns. This was because some of the singers voices in the music having high-pitched, screaming voices e.g., within loud rock music sound samples. The accuracy of the background noise sound pattern detection was percent. The mobile phone s carrying location can affect the accuracy of our loud emotional voice classification algorithm because a sound will sometimes only partially reach the mobile audio sensor. We measured the accuracy of our audio classification algorithm according to the user s mobile carrying location on the body. According to a survey [67] about the mobile phone s carrying location, the mobile phone is usually A D E B C Figure 4.8: Accuracy variation of unusual voice classification depending on mobile placement on human body.

48 35 carried in a trouser pocket, upper-body pocket, shoulder bag, a backpack, or a belt enhancement. We located the mobile phone in a user s hand, pocket, purse, and backpack as shown in Figure 4.8. We generated 20 actual emotional human voice sounds (i.e., screaming, yelling, loud angry voice) downloaded from YouTube and the mobile phone recorded these sounds. The mobile phone application classified the recorded sounds and the accuracy of these classifications for different mobile carrying positions were compared. The accuracy of the emotional sound classification measurements varied according to the carrying location of the phone: 100% for hand-carrying, and varying levels of accuracy for the other positions. Table 4.5: Audio classification results according to the mobile s carrying location Location Name Loud voice classification Percentage A Hand % B Purse % C Trouser pocket % D upper-body pocket % E Backpack % Table 4.5 shows the accuracy of loud emotional voice classification according to the mobile phone s location. We located the mobile phone in 5 places (A: hand, B: purse, C: trouser pocket, D: upper-body pocket, E: backpack). All of the loud emotional voice events were accurately detected for 835 of the sound events when the user carried the mobile phone by hand. The hand-based events was compared with other positions. The number of events detected varied according to the different phone carrying locations. Other than hand-carrying, the highest classification accuracy was 91 percent when the phone was carried in a upper-body pocket and the lowest accuracy was 76 percent, when carried in a purse. The purse consisted of leather, so the experiment-generated sounds had difficulty penetrating through the leather. Although the accuracy of the classification was reduced to 76 percent in the case of the leather purse, the mobile phone still accurately classified three quarters of the sound snippets as the emotional human voice sounds. According to a survey [67] about the mobile phone s carrying location, the mobile users usually carry the phone in the same location 93% of the time. The mobile carrying position affected the accuracy of audio classification, but the classified patterns are not much changed because the mobile phone is usually carried in the same position.

49 Location-based Activity Classifier We developed a location-based activity classifier that is similar in concept to our location-based audio classifier. The intent was to develop an activity model for each static location of each user and classify as unusual any measurement that deviated substantially from the normal activity model. Research related to human defensive behavior [44] provides information about how people act when they are in a threatening situation. Human beings, when threatened, usually do one of the following: attack, run, or freeze. These behaviors involve high impact activities or long stationary activity. Such dramatic, abrupt, and high impact activity can be measured easily by the mobile phone s sensors the GPS, Wi-Fi, and accelerometer sensors. We used these sensors on the mobile phone to survey a wide range of user activities: from normal, every-day activities to uncommon, unusual events or activities. We measured the user s activities using a three-axis accelerometer (x-axis, y-axis, and z-axis). The impact strength is calculated by measuring the amplitude between the positive peak point and the negative peak point on an accelerometer. We calculated the average impact of the user s activities from the data from the three axes, regardless of the orientation of the phone, by using the following Equation 4.1. Strength(t) = Avg(Length (HighP eakx,lowp eak x,t), Length (HighP eaky,lowp eak y,t), Length (HighP eakz,lowp eak z,t)) (4.1) In order to detect an individual s activity, we first calculate the individual s average normal walking impact using the accelerometer on the mobile phone. We measured the average of the mobile phone user s daily movement activities, using the GPS sensor when the user was walking in outdoor areas. The above Equation 4.1, running on the mobile phone, calculated various walking impact activities over time and stored the average impact of the user s daily walking impact activities. Initially, we tried to identify and classify an event as an unusual event for the user whenever the mobile phone detected a high impact activity with the phone s accelerometer sensor, as similarly done in

50 Activity Getting Fighting hit by with someone someone Being pushed Falling Normal Walking Figure 4.9: Unusual high impact activities measured on the mobile phone. other previous research studies [42, 90]. To then identify what some types of high impact activities might look like when recorded by the accelerometer, we artificially generated unusual types of activities with two people acting them out. Figure 4.9 shows the varying amplitudes recorded by the phone s accelerometer for the events of hitting, fighting, pushing and falling, as compared to the accelerometer reading of a person s normal walking. The amplitude measurements for all of these activities are higher than that of normal walking activity. However, this simple approach was not always suitable for detecting and classifying unusual events. Figure 4.10 shows data collected from the 20 subjects of the mobile phone survey, in which the accelerometer measured various normal human activities i.e., jumping, lying down on a bed, running, sitting down quickly, pressing a car brake pedal, pressing a bike s hand brake, dancing, walking, shaking the phone, dropping the phone, turning the phone over, and strongly pressing the phone s screen. We compared the accelerometer s amplitude measurements for each of these activities to the amplitude of a person s normal walking activity. In reviewing the results from both these investigations, illustrated in the two figures, we found that this simple approach for measuring and identifying a high impact activity as an unusual event often yielded many false positives. Many normal activities generated impact amplitude measurements higher than the average normal walking activity measurement. To overcome the limitation of the impact activity classification algorithm, as a similar approach with location-based audio classifier, we also used location-based pattern recognition to help identify and classify unusual events. By further analyzing the collected activity data from 20 people, we found that the percentage

51 Activity Lying down on bed Sitting down quickly Bike breaking Jumping Running Car breaking Dancing Shaking phone Dropping phone Turning Normal phone walking Pressing screen Figure 4.10: High impact-based normal activity detected from the mobile phone survey subjects data breakdown of types of mobile users impact activities (stationary, walking, running) were often the same and often repeated in the same general areas. We therefore analyzed the percentage pattern mix of the three most frequently observed normal impact activities when each phone survey subject repeatedly visited the same location. Figure 4.11 shows an example of when one subject frequented the same place 12 times. For each time the subject visited this one location, the percentage mix of stationary, walking and running across these 12 times was similar. Additionally, we investigated the entire activity data pattern for one mobile user, collected over a period of more than one month, to see if our hypothesis that the mobile users data would follow the normal distribution was correct. We analyzed this subject s classified activity data when repeatedly visiting the same location. We measured how closely the classified activity data pattern matched the normal distribution using Q-Q plot (Quantile Quantile plot). Figure 4.12 shows that the Q-Q plot using the subject s activity pattern data closely overlaid the linear line and thus followed the normal distribution. We also analyzed this same normal activity data for all 20 survey subjects when they frequented two different locations. Figure 4.13 (a) shows, in two charts, the standard deviations for the three impact activities measured for each of the 20 subjects, calculated as shown in Figure 4.13(b), when they frequented two different locations. When the subjects frequented the first place (chart 1), the standard deviations across each of the three impact activities were quite small for all of the 20 subjects, less than 5 standard deviation points for any one impact activity type, based on a maximum of 100 points for the combined three types. Additionally, when the 20 subjects frequented the second place (chart 2), a similar small standard deviation

Activity Percentage (%) 39 100 80 60 40 20 Stationary Walking Running 0 1 2 3 4 5 6 7 8 9 10 11 12 Visiting Sequence of one subject at same location Figure 4.

52 Activity Percentage (%) Stationary Walking Running Visiting Sequence of one subject at same location Figure 4.11: An example showing a similar percentage pattern for three types of normal impact activity for one subject s repeated visits in one location Figure 4.12: Quantile Quantile plot using 30-minute activity classifications for one subject s repeated visits to the same location

53 Deviation Deviation Stationary 6 Walking 4 Running Subjects Subjects Location 1 Location 2 (a) Stationary Walking Running 1st Visit 2nd Visit n Visit σ1, σ2, σ3 Standard Deviation (b) Figure 4.13: (a) Standard deviations of three impact activities measured for 20 subjects in two different locations: 1 and 2, (b) An example for measuring standard deviations for each impact activity over 12 visits for one subject. of less than 10 points was observed. Also in comparing the two charts, the standard deviation pattern mix of the three activity classifications within subjects varied for each subject according to the locations visited. On average across all subjects, it appears that in location 2, there were more movement activity classifications (i.e., running and walking) than in location 1. This might also indicate that location 2 was their workplace for most subjects and location 1 was their home. As with audio, the conclusion is that the distribution of activities is relatively stable for a given location and user, i.e. there is not much deviation from the norm, and that a different activity model must be developed for different locations. We leverage these findings to develop a location-based activity classifier that detects unusual events when the activity distribution deviates too much from the norm. By investigating the above mixed percentage patterns of activity types, we found that when the mobile survey subjects repeatedly visited the same place, the percentage pattern mix of activity types across all subjects were very similar. To determine how similar these patterns were, we used the Gaussian (Normal) distribution [48, 38, 47, 62] to find the probability of how close each subject s activity type was to the normal activity probability. For this approach we assume that mixed percentage patterns of activity types follow the normal distribution. Table 4.6 shows an example of the activity classification results identified by a high-impact activity method using one user s activity data, collected when the user visited two different locations three different times. Both our approach and existing approaches classify activity data using high-impact activity methods.

54 Table 4.6: Example of one user s activity classification results for two different locations 41 Location stationary walking running 1st 99.71% 0% 0.29% 1st 98.93% 0.53% 0.27% 1st 99.46% 0% 0.27% 2nd 90.93% 1.42% 0.85% 2nd 90.37% 2.14% 2.14% 2nd 92.57% 1.06% 0.8% However, most existing approaches operate only on single occurrences of activity events, such as jumping or falling (e.g., possibly accidents), to flag and classify extreme activity movements as unusual events. If we apply this approach, all of the classification results that include the extreme activity events would be detected as unusual events. However, we know from our research and collecting ground truth data (survey feedback) from users, such as the user in this example, that none of these occurrences of the high impact events were actual unusual events. Table 4.7: Gaussian distribution results of the above activity classification results for one user Location stationary walking running 1st st st nd nd nd Instead, we seek to measure how much the activity distribution deviates from the norm at a given location for a given user. To achieve this, we developed a normal distribution model for each activity type at each location for each user, just as for audio, and then calculated the location-based Gaussian distribution probabilities for a given measurement. Using the CI technique, if these probabilities are either less than 5% or greater than 95%, then an unusual activity event is detected. For example, Table 4.7 shows the generated probabilities corresponding to the same rows in Table 4.6 when compared to the trained location-based normal model. We see that rows 1-3 corresponding to three visits to location 1 are roughly similar in their distribution, and therefore their CI values are clustered around a normal value of 0.5. Indeed, none of the measurements deviates significantly from the norm according to Table 4.7, and therefore no unusual events would be generated among the six samples shown.

55 Our Approach Existing Approach recall precision accuracy f-measure Figure 4.14: Comparison of our approach with existing approach for detecting unusual user events in shopping centers We compared the detection performance measurement, for detecting an unusual activity event, of our approach with the existing high impact approach. We found that for more stationary environments (e.g, office, home), our approach was only slightly better than the existing approach, but that for frequented environments in which users are more mobile, such as the shopping center example here, our approach was much better. Figure 4.14 shows the recall, precision, f-measure, and accuracy measurements of our approach compared to those of the existing approach when five users visited shopping centers more than seven times. When using the single-activity event classification method of the existing approach, we classified only high impact activity, and identified high impact activities as unusual events. For our approach, which uses mixedactivity pattern classification, we also employed the normal Gaussian distribution probability within the CI 5% and 95% thresholds, to determine whether any audio events were unusual events. With our approach, we were able to improve the accuracy and f-measure measurement by 38 and 18 percentage points higher than the accuracy and f-measure achieved with the existing approach. 4.4 Detecting Unusual Locations The location data is stored on the phone whenever the user moves from one location to another, and this information is used to build a historical location-based map. The historical map is then used to estimate whether or not the mobile user is in a new location, a place never visited before, in a habitually visited place, or in an infrequently visited place. First, we analyzed the location data collected individually from each of the 20 subjects, using GPS or

56 3% 2% 1% 1% 0% 0% 5% 8% 15% 65% Location 1 Location 2 Location 3 Location 4 Location 5 Location 6 Location 7 Location 8 Location 9 Location Figure 4.15: Location history collected from 20 subjects for one week Wi-Fi sensors, to come up of a list of locations where a user spent substantial time. In particular, a significant static or visited location is defined as one where a mobile user stayed within a 100 meter radius for more than 30 minutes. As noted earlier, this definition was essential for helping us create location-specific audio and activity models. In addition, we felt that we could reuse a list of such visited locations to identify unusual locations where the user spent significant visiting time. We felt this definition would succinctly capture the case where a user travels to a new location, such as a new store or park for the first time, or rarely. We only need to compare the endpoint of the travel to determine the unusual nature of this travel, and need not compare the entire path, or each 100 m increment along a transited path, to determine that the travel was unusual. Thus, our definition represents a low complexity solution for flagging certain types of unusual location behavior. Note this definition does not capture the case where a user travels between two frequently visited locations but uses a new/rare path. This could be a direction for future work. AvgP ercentage(rank) = Avg(T otalt ime(location(user x )) /T otalt ime(alllocations(user x )), rank) (4.2) Based on this list of visited locations for each user, we then sought to create a ranked list showing which location was where each user spent the most time, the second most time, etc. We determined the average proportion of total time each subject visited and stayed within each different location within a one

57 Accuracy Percentage Threshold (%) Figure 4.16: Accuracy measurements of six thresholds (1% to 6%) week period as shown in Equation 4.2. We individually measured a percentage of staying time in each of 10 locations for each of the 20 users, and ranked them in descending order of length of staying time in each location. We calculated the average percentage of staying time for each location in the same ranked position, 1 through 10, across all 20 subjects. The most infrequently visited places, which were usually visited for the shortest amount of time either brand new locations or rarely visited locations were likely to be unusual events in the subjects lives. Figure 4.15 shows the average percentage of staying, or visiting, time subjects spent in each of 10 different locations. We found that the subjects spent 92% of their one week s time period in four main repeatedly visited places. The 20 subjects spent two thirds (65%) of their total week s time in the first ranked location, 15% of their time in a second ranked location, and 8% and 5%, respectively in the third and fourth ranked top locations visited. To determine the percentage cutoff for what constituted an unusual event in the subjects lives, we validated by ground truth data that 2% was a reasonable, general threshold for identifying an unusual event for a subject (i.e., a place very infrequently visited or a brand new location visited in a one week period). The majority of the 20 subjects confirmed that the places they visited less than 2% of the time, as shown in Figure 4.16, were actual places visited that constituted unusual events in their normal week schedule. We defined an infrequently visited location for purposes of our analysis, and for querying the subjects, as a place that was visited less than once a week for a short period of time or a place the subjects had never visited before.

58 Percentage (%) Location Figure 4.17: Standard deviation measurement of location history collected from 20 subjects for one week Figure 4.17 shows the standard deviations for the length of time the 20 users visited each different location within a one week period. We calculated these deviations for each of the 10 separate locations, that was previously rank-ordered as to maximum to minimum visited time length, across all 20 users data. For example, for location 1, the location visited for the longest period of time by each user, we used the standard deviation formula across all 20 users data for that location to get the first standard deviation shown in this figure. We repeated this process for each subsequent ranked location across all the users data. The first location s deviation was 23% and is larger than all the other ranked locations deviations. The deviation is smaller when the ranked location is smaller and this shows that each ranked location s stationary time is stably varied.

59 Chapter 5 Fusion Algorithms and Evaluation In this section, we describe how we identify an optimal fusion algorithm for unusual event detection. The goal of jointly fusing together all three dimensions of mobile sensor data - activity, audio, and location - to arrive at a joint decision about whether an event is unusual or not is to improve the accuracy of overall classification. The general idea is that jointly combining information from all the sensor modalities should strengthen the case that an event is unusual. In the following we identify CI as our most desirable fusion algorithm, explore its performance when adjusting various parameters, make the case that supervised training of a general CI model approximates the accuracy of supervised training of a personalized CI model (thus avoiding user manual labeling), and show that the general CI model is more accurate than the simpler location-based audio and activity classifiers of the prior chapter. 5.1 Fusion algorithms Since the final overall decision is binary - either an event is classified as unusual or not - then we investigated binary fusion algorithms. We evaluated four different fusion algorithms and compared their accuracy for unusual event detection among the mobile users daily activity, audio, and location data we had collected on the mobile sensors (accelerometer, audio, GPS, and Wi-Fi sensors). We limited our algorithmic evaluation to investigating only that data which the location sensor data indicated included repeatedly visited locations of the mobile users. The users audio data we analyzed with these algorithms was comprised of four audio patterns: low level noise, talking voice, music, and loud emotional voice. The accelerome-

60 47 ter data was comprised of three different impact patterns: stationary, walking, and running. These seven patterns were evaluated with the algorithms to determine whether a mobile user might be experiencing an unusual event i.e., demonstrating unusual behavior or involved in an unusual situation. We investigated these classification algorithms to find the best one to use in our system for determining unusual behavior events when analyzing these seven patterns. We evaluated four popular fusion classification algorithms: Bagging [77, 87], Adaboost [96, 58, 57, 86], SVM(Support Vector Machine) [49, 50, 37, 45, 65], and CI (Confidence Interval). These algorithms efficiently combine the location, activity, and audio classification results using the OR method expression and their own optimized methods to classify normal or unusual events. Whenever the results from one or more of the three main classifiers (i.e., location, activity or audio) are above the given threshold unusual events are detected. We briefly describe these algorithms functionality below. Bagging: Bagging is a bootstrap aggregating algorithm that uses a machine learning ensemble [75, 76] method that creates classifications for its ensemble by training each classifier individually (e.g., audiotalking voice, accelerometer-walking impact, etc.) with a random redistribution of the training set to improve machine learning of the statistical classification. It creates different models obtained from bootstrap samples of training data. The bagging algorithm builds an improved classification model by using average predictions and majority voting from multiple models. Adaboost: Adaboost is a boosting algorithm that uses a machine learning ensemble method that is used to improve the performance of individual weak classifiers (e.g., audio-talking voice, accelerometerwalking impact, etc.). It is a widely used boosting algorithm that weights a set of weak classifiers according to a function of the classification error. Each weak classifier will have a different performance accuracy and the Adaboost algorithm weights each of these classifiers accordingly to achieve a stronger classification. It constructs a stronger classifier as a linear combination of the weaker classifiers h(x) as shown in Equation 5.1. T f(x) = α t h t (x) (5.1) t=1

61 48 Training data Training data Training data Training data 20 subjects Adaboost Bagging SVM CI Testing data Testing data Testing data Testing data Prediction Prediction Prediction Prediction Figure 5.1: Classification design using the four binary algorithms. In the equation above, AdaBoost generates a new classifier for each of a series of rounds t=1, 2,..., T. A distribution α of weights is updated in the data set for the new classification. On each round, the algorithm applies weights according to each of the individual weak classifiers and creates a stronger classification. SVM (Support Vector Machine): SVM (support vector machines) is another supervised learning algorithm, that uses associated classification and regression. SVM is a binary classification algorithm [49] which is a margin classifier. Given data in multidimensional feature space (i.e., the seven different patterns of audio and accelerometer classifiers), it draws an optimal linear hyper-plane that defines a boundary line as a proposed threshold to classify normal versus unusual event data. The algorithm finds a maximum margin within which the hyper-plane is optimal. CI (Confidence Interval): The confidence interval method uses a normal distribution to determine a given probability that data evaluated is either above or below a specific threshold. It is used to indicate the reliability of an estimate, such as whether the data falls above the normal expected result (e.g., above a 95% confidence interval, etc.). The CI is a range of values, above and below a hypothesized finding. The method can be used as a binary classification algorithm whether or not an input value is in the range. We compared the performances of these four binary fusion classification algorithms to find the best fusion algorithm to use in evaluating our training data and testing data. Figure 5.1 shows a classification design using the four binary fusion algorithms. Comparing four algorithms performance accuracy, we then

62 49 found the best algorithm of the four to use as the optimal classification model for our system. 5.2 Optimal Fusion Algorithms We investigated the seven behavior patterns of our mobile users (low level noise, talking voice, music, loud emotional voice, stationary status, walking activity and running activity), measured by the audio and activity classifiers, with our four fusion algorithms to determine the best algorithm for identifying whether a mobile user is involved in an unusual situation. We used a four-day period of training data and a three-day period of testing data to analyze a 30-minute segment of the seven days pooled data for the 20 mobile user subjects. The 20 subjects training data were used to train four classification models designed to detect unusual events. We used four algorithms (Bagging, Adaboost, CI 95 percent, SVM) for our unusual event classification model development. For each algorithm in our model, we first used the training data (comprised of ground truth data and the four days of the 20 subject s collected behavior patterns) to build our four classification models. We used these models as the baseline from which to analyze the testing data to see which of the four algorithms predicted the best overall combined performance for recall, precision, f-measure and accuracy measurements of the audio and activity data. For the Bagging and Adaboost algorithms, which required some input parameters for their iteration cycles, we found that the fourth stage iteration of these algorithms yielded the best results, so we used this stage to obtain the performance measurements with these two algorithms. The confidence interval of 95 percent yielded the best results in our previous investigation, so we used this cutoff for the CI algorithm. If the probability of a given audio or activity classification is less than 5% or greater than 95%, then this is detected as an unusual event. We analyzed recall, precision, and accuracy measurements of the four classification models: Bagging, Adaboost, CI and SVM, to find the best performing one for identifying both normal and abnormal events. Figure 5.2 shows the performance measurements of the four classification algorithms when using them to detect normal events. We found the best algorithm for detecting normal events was the Bagging algorithm, with combined performance measurements highest of any of the other three algorithms with.98 recall,.97 precision,.98 f-measure and 0.96 accuracy measurements. The least proficient performing algorithm of the four was the Adaboost algorithm, with a.98 recall, a.93 precision, a.95 f-measure, and a 0.92 accuracy

63 recall precision accuracy f-measure 0 Adaboost Bagging SVM CI Figure 5.2: Precision, recall, f-measure, and accuracy of four classification algorithms for detecting normal events recall precision accuracy f-measure 0 Adaboost Bagging SVM CI Figure 5.3: Precision, recall, f-measure and accuracy of four classification algorithms for detecting unusual events.

64 51 measurement. Figure 5.3 shows the performance measurements of the four classification algorithms when using them to detect abnormal/unusual events. Once again, the Bagging algorithm was found to be the best performing of the four algorithms for detecting unusual events with the highest accuracy of.84 recall,.87 precision,.86 f-measure, and 0.96 accuracy measurements. The least proficient performing algorithm of the four was the Adaboost algorithm, with a.56 recall, a.81 precision, a.66 f-measure and a 0.92 accuracy measurement. One of the reasons we believe the Bagging algorithm performed the best, compared to the other three algorithms, was its iterative process of finding the best average cutoff for the variance in the pool of the subjects performance measurements. The ground truth data, collected from the 20 subjects, about which events they considered to be unusual events, contained a lot of variance, since the subjects often defined unusual and usual events differently. The training data included this data and was also based upon this data, thus our training data model s performance measurements also contained this variance. The Bagging algorithm is the least sensitive algorithm to high variance or deviation in the data. The Adaboost algorithm, however, is one of the worst algorithms when dealing with high variance in the data, since it is very sensitive to high variance and uses an iterative weighting process that is based upon the variance across the data points in the data set. In this experiment, Adaboost was found to be the poorest performing one of the four algorithms. The CI algorithm, with a 95 percent CI, provided the second highest results for our model, but it was less optimal than the Bagging algorithm. The SVM algorithm classified unusual events with a line-based threshold. This approach was also less efficient than the Bagging algorithm, and less efficient than the CI algorithm. Figure 5.4 shows how the accuracy measurements varied across each of the 20 subjects for each of the four classification models. Once more, we found the best performing algorithm to be the Bagging algorithm, which accurately predicted the occurrence of unusual events for 38 percent of the users. The CI, SVM, and Adaboost algorithms were less efficient in predicting the occurrence of unusual events for the users: 32, 17, and 11 percent, respectively. The average accuracy measurements across all 20 users for the four algorithms Bagging, Adaboost, CI, and SVM were 0.97, 0.92, 0.95, and 0.93, respectively.

65 adaboost bagging CI SVM Survey Subject Figure 5.4: The accuracy of four classification algorithms according to individual survey subject for identifying unusual events. Considering these findings across individuals and the results from the pooled testing data of the 20 users, we found the Bagging algorithm to be the best algorithm overall for classifying unusual events. However, the CI algorithm performed nearly as well in most cases, and also has the virtue of a relatively low complexity implementation. These factors favored incorporating the CI algorithm into our actual end-to-end system design. 5.3 Determining fusion parameters Given our interest in the CI algorithm for its relative accuracy and simplicity, we wanted to understand how its performance varied as we adjusted various parameters such as the duration of time over which data is collected and a classification made, the CI threshold for triggering detection of an unusual event, and the amount of training data, i.e. how fast the classifier converges Results for an optimal classification period We sought to find an optimal time period in which we could identify and classify unusual events. If the time period under which data is collected and grouped for classification is too long, then it may mix unusual events with normal behavior, and thus dilute detection of unusual events. If the time period is too short, then there may not be enough data to distinguish between unusual and normal behavior. We analyzed four different time periods of 15, 30, 60, and 90 minute intervals, using the default CI

66 recall precision accuracy recall precision accuracy m 30m 60m 90m Minutes (a) m 30m 60m 90m Minutes (b) Figure 5.5: Precision, recall, and accuracy according to a period for detecting normal events (a) activity data (b) audio data m 30m 60m 90m Minutes recall precision accuracy m 30m 60m 90m Minutes recall precision accuracy (a) (b) Figure 5.6: Precision, recall, and accuracy according to a period for detecting unusual events (a) activity data (b) audio data.

67 54 of 95 percent as the classifier for investigating these time intervals. We measured the accuracy, precision and recall in two cases: 1) for normal event detection, when the hypothesized normal event was defined as the true event; and 2) for abnormal event detection, when the hypothesized abnormal event was defined as the true event. Figure 5.5 shows the precision, recall, and accuracy for detecting normal events, measured during the four different time periods. We found that the 30 minute period had the best accuracy with the performance measurements of 0.98 for the activity data and 0.97 for the audio data when detecting normal events. The precision and recall measurements were also very high, with 0.98 and 0.99 for the activity data, and 0.99 and 0.98 for the audio data. All of the performance measurements accuracy, precision and recall were quite high, with a very small standard deviation, for every single time period, especially given that there were many normal events with true positive results. We then applied these same performance measurements to the process of detecting abnormal/unusual events that might occur within these time periods, to see which time period had the highest accuracy for detecting abnormal events. Figure 5.6 shows precision, recall, and accuracy measurements for detecting abnormal events observed across each of these time periods. The accuracy remained at the same high level as detected when measuring performance for detecting abnormal events as it was when detecting normal events. The combined performance pattern of accuracy, precision and recall, however, was highest for the 30 minute period, for abnormal event detection. For the activity data of the abnormal events, precision and recall measurements were 0.94 and 0.67, respectively, and for the audio data, 0.85 and 0.91, respectively. The 15 minute period method correctly detected abnormal events, but it more frequently falsely identified usual events as abnormal events. The 60 and 90 minute period methods more often failed to identify unusual events because the time period was long enough to normalize unusual events to normal events Optimal threshold for the CI algorithm We sought to find an optimal confidence interval for identifying each audio and activity classifier. We assembled a 30-minute period each of the activity and audio data collected from the 20 phone survey subjects into two separate files of seven days worth of data to be analyzed. We divided the one week period of data into a four- day period for the training data, which was used for determining the normal distribution

68 recall precision accuracy recall precision accuracy % 95% 99% Confidence Interval (a) % 95% 99% Confidence Interval (b) Figure 5.7: Precision, recall, and accuracy of the confidence interval in detecting normal events (a) activity data (b) audio data recall precision accuracy recall precision accuracy % 95% 99% Confidence Interval % 95% 99% Confidence Interval (a) (b) Figure 5.8: Precision, recall, and accuracy of the confidence interval in detecting unusual events (a) activity data (b) audio data.

69 56 of the subjects audio and activity data, and a three-day period, which was used for the testing data. We analyzed precision, recall, and accuracy of the testing data with a CI of 90 percent, 95 percent, and 99 percent. We measured the accuracy, precision and recall in two cases: 1) for normal event detection, when the hypothesized normal event was defined as the true event; and 2) for abnormal event detection, when the hypothesized abnormal event was defined as the true event. Figure 5.7 shows the confidence intervals of the precision, recall, and accuracy measurements when detecting normal events. All three accuracy, precision and recall confidence intervals were quite high, more than 0.97 measurement for each for the activity data, and more than 0.96 measurement for each for the audio data. The CI of 95 percent yielded the best performance result, for detecting normal events with a 0.98 performance measurement for the activity data and a 0.97 measurement for the audio data. The 95 CI was best because the combined pattern of the accuracy, recall and precision measurements was the highest overall at that level. We then analyzed this same segment of data with the same method to find the optimal CI for detecting unusual events. Figure 5.8 shows the confidence intervals of the precision, recall, and accuracy measurements when detecting abnormal events. This analysis also showed that the CI of 95 percent was the best for abnormal event detection. Although we found at the 90 CI, that the precision was highest because the unusual event detection range was wider and true positive results were increased, the recall measurement was lower at this level because of increased false negatives. At the 99 percent CI the precision was decreased due to increased false positives. Therefore, the combined performance measurement totals of accuracy, precision and recall, yielded the most optimal result at the 95 percent CI Convergence speed of training We analyzed the audio and activity sensor data to find the most optimal number of days within which to evaluate the training data and build our model for event detection. We used the three-day pool of testing data, with the 30-minute segmentation from the 20 mobile phone subjects and analyzed the data with the CI of 95 percent. Figure 5.9 shows the precision, recall, and accuracy for detecting normal events for up to four days of training data evaluated. We found that the performance measurements for the activity and audio

70 recall precision accuracy recall precision accuracy 0.8 One Two Three Four Number of Days (a) 0.8 One Two Three Four Number of Days (b) Figure 5.9: Precision, recall, and accuracy according to the number of days of training data for detecting normal events (a) activity data (b) audio data recall precision accuracy recall precision accuracy 0.5 One Two Three Four Number of Days 0.5 One Two Three Four Number of Days (a) (b) Figure 5.10: Precision, recall, and accuracy according to the number of days of training data for detecting unusual events (a) activity data (b) audio data.

71 Accuracy Participant General model Personalized model 58 Figure 5.11: Accuracy evaluation of the General model versus the Personalized model for identifying unusual mobile user events. training data were consistently high at day one, for each day up to day four, for normal event detection. Even at the lowest measurement performance (day one), the precision, recall, and accuracy measurements for the activity data were 0.97, 0.99, and 0.96, and for the audio data were.96,.93, and.91, respectively. However, day four measurements for the normal training data yields the highest performance level of any of the four days: the increased performance measurements taken at day four for the activity data were 0.98, 0.99, and 0.98 and were 0.99, 0.98, and 0.97 for the audio data, respectively. For unusual event detection, we analyzed the sensor data and obtained the following performance measurements for days one through four, as in shown in Figure The precision and recall measurements for day one s training data were 0.76 and 0.50 for the activity data, and 0.52 and 0.69 for the audio data, respectively. These performance and recall results were quite low in contrast to those obtained at day one for normal event detection. However, by day four, the measurements for precision and recall for unusual event detection had increased substantially to readings of 0.94 and 0.67, respectively for the activity data, and 0.85 and 0.91 for the audio data. Noting these results, and that day four provided the highest performance readings of precision and recall across all four days for both normal and abnormal event detection, we felt confident in using a four-day pool of the subjects training data for our analysis, and retaining the remaining three days of the subjects mobile survey data for our testing data.

72 General versus Personalized Fusion Model A key advantage of the general model is that if its accuracy can be shown to be roughly equivalent to a personalized model of classification, then we can substitute the general model in place of the personalized model, and thus avoid requesting the user manually label periodic data as indicating an unusual event or not, in order to develop a personalized classification model under supervised learning. If we can show that the general model functions at a similarly high equivalent accuracy as the personalized model, we can be confident that our hiring of a small set of completely unrelated users to develop the general model is sufficient, and thus incorporate this general classifier into the mobile application to achieve almost equivalent unusual event detection as is attained by the personally trained classification model. To analyze the accuracy and performance of the general model compared with the personalized model, we used the highest performing algorithm for the general model analysis, the Bagging algorithm, and the highest accuracy measurement of each individual s set of data for the personalized model. Figure 5.11 shows the resulting accuracy measurements achieved for each participant s set of data, side by side, when the general model was used versus when the personalized model was used. To obtain the accuracy measurement for each individual s data using the personalized model, we chose the highest accuracy measurement among that person s data. To obtain the accuracy measurement for the general model for each individual user, we used the Bagging algorithm results based on the combined accuracy measurements from the training data of the set of 19 other users, excluding the chosen user s training data. Overall, we found the average personalized model-based accuracy of.96 to be slightly better than the average of the general model-based accuracy (0.94), by.02 points. Although for the personalized model, the accuracy measurements were higher for 50% of the users (10 of them), the accuracy measurements yielded using the general model were higher (15%) or the same (35%) for remaining of the users., as shown in the figure. Although the accuracy of the general model was a little lower than that of the personalized model, we noted its accuracy was sufficiently high enough and comparably performing enough across all users to be confident in adopting it for our research purposes. We adopted the general fusion model, as a result of our comparative analysis of these two models, because it was shown to be comparably accurate and more

73 recall precision accuracy f-measure 0 Activity Audio Fusion Figure 5.12: Comparing performance results of individual classifiers and the fusion algorithm using one week s data. suitable for our purposes for building our myblackbox system, and for making it more scalable and usable for users of our system. 5.5 Fusion Performance vs. Individual Classifiers To study how well the increased sensor dimensionality of our hybrid fusion approach improves the accuracy of classification compared to the more limited modality of the classifiers derived in the previous chapter, we evaluated the performance of our fusion algorithm versus the location-based audio and activity classifiers/algorithms from the previous chapter. Figure 5.12 shows the accuracy, precision, recall, f- measure [10] measurements of our application s success rate in detecting unusual mobile user events for the three algorithms over the one-week period of 20 users data. The accuracy for detecting unusual events was highest for the CI-based fusion algorithm compared to the location-based audio and activity algorithms, with measurements of.96, 0.94, and 0.87, respectively. The location-based activity classifier algorithm, when used alone, generated many false-negative classifications of unusual user events, noted by the.21 recall measurement. Thus the location-based activity classifier was the poorest performing algorithm of the three, and least useful for detecting unusual events, when used alone. The recall measurement for the location-based audio classifier was much better, with a 0.62 reading. However, we found that we could improve unusual event detection for mobile users of our application by using our CI-based fusion algorithm, which was the highest performing when considering the recall measurement of.80 for this analysis. The

74 61 f-measure, which is based on combined recall and precision performance measurements, further validates that our CI-based fusion algorithm classifier is the highest performing of the three classifiers. The f-measure fusion performance result of 0.84 was notably higher than both the activity (0.32) and the audio classifier (0.76) results. Given the f-measure results and the individual components of the f-measure: recall and precision, we found that we could improve unusual event detection for mobile users of our application by using our CI-based fusion algorithm, which was shown to be the highest performing of the algorithms. The CI-based algorithm yielded a recall measurement of 18 percentage points higher than the audio classifier s recall measurement. This may be helpful in situations where false negatives are very harmful for users of our unusual event detection system. For example, if you are trying to detect abnormal fighting, and get a false negative, then that means you failed to detect a fight, but it actually occurred.

75 Chapter 6 End-to-End myblackbox Mobile Cloud System To study the feasibility of the myblackbox concept and hybrid fusion classifier design, we built an end-to-end myblack mobile cloud system that enabled both mobile users to collect and monitor their own daily behavior pattern data on their phones and a cloud server to collect users summary behavior data. We implemented the myblackbox mobile application and published it on the Android market [2] to provide this behavior-tracking application to users and to test real-world feasibility for use on the mobile phone. We also developed a cloud server that stored each user s summary data in a database and provided users password-protected access to a display of their historical behavior-pattern data upon request. Our mobile application was implemented on the phone to automatically transfer each mobile user s summary data from their phones to the cloud server every 30 minutes. Previously we had tested our fusion algorithm for this system manually on a server, but had not implemented it on a smartphone until now. Before collecting data from users through the Android Market, we had our research approved by the Institutional Review Board (IRB) [69]. After collecting one month of myblackbox system data from 15 mobile users who participated in our Android Market study, we were able to evaluate the feasibility of our fusion algorithm working on the smartphone. Our myblackbox system is still published on the Android Market and data is currently being collected. The myblackbox mobile application [32] consists of four screens available to the users. Figure 6.1 shows these four screen shots of the myblackbox application. The first three screens display the user s current behavior data. Screen 1 shows the user s daily map of locations visited during the most recent period. The screen displays up to 100 data points and the data is updated every 3 to 5 minutes. Screen 2

76 63 Figure 6.1: Screen shot of myblackbox application. shows the user s behavior patterns for a 24-hour period. The daily behaviors collected from each user are based on percentages of occurrences of different user activities logged in a 30-minute period. Data were collected from the users audio, location, movement activity, etc., measured by the sensors on the phone. This second screen is also scrollable, from the first to the last hour of the current day, and it was refreshed at midnight of each day. The third screen shows summary results of each 30-minute segment of users behavior data, with our system s daily event prediction of whether this 30-minute period constituted a normal event or an unusual event in the user s daily schedule. The fourth screen was used initially to collect users login information for accessing our web server and summaries of their historical behavior data. We also collected non-mandatory background information on this screen from users who were willing to provide it, on: gender, profession and age. Users only needed to provide a login and password on this screen if they wished to access their historical data on our web server. For purposes of this research study we focused on analyzing one month of data collected from 15 subjects who utilized our Android Market deployed application continuously for a four-week period. We collected and analyzed both location and activity measurements of the users. The users location data was collected using the GPS and WiFi on the phone and users daily activities and behaviors were collected using the phone s sensors: accelerometer, orientation, proximity, audio, and light sensors. Initially we tried to run the mobile sensors at all times and save the data in the mobile phone s database to capture a continuous record of users behaviors in detail. However, the mobile sensors were power hungry and the storage space is limited on the mobile phone, so we adjusted the recording time length of use. To conserve battery power,

64 Figure 6.2: Collection server using Spring Roo. we set our program to measure accelerometer-based activities (e.g., walking, jumping, shaking phone, etc.

77 64 Figure 6.2: Collection server using Spring Roo. we set our program to measure accelerometer-based activities (e.g., walking, jumping, shaking phone, etc.) every five seconds and to obtain and record subjects locations using GPS every 5 minutes and WiFi [17] every 3 minutes. We also recorded all audio sounds in the users immediate area on the mobile phone for 15 seconds, every 5 minutes, and stored the orientation, proximity, and light sensor data along with these recorded sounds to determine the user s behavior related to the audio sounds. The battery power of the mobile phone, when running this entire application, lasted for about 13 hours. This time was enough to measure a user s daily behavior patterns and activities, because the users were normally back at home in the evenings and able to recharge the phone at the end of each day. For the myblackbox server we built two components: a collection and a web component. The collection component of the server retrieves and logs sensor data from users mobile phones. It also stores the summary data, processed on the phone, using the sensor and prediction data, which is analyzed every 30 minutes on each smartphone. The mobile application then compiles this summary data as an XML-based Json [22] file and then sends it to the collection component of the server. The server receives the Json data using the Spring Roo framework [31, 26] and stores it in a MongoDB [29] database as shown in Figure 6.2. The web component [33] of the myblackbox server accesses the MongoDB and processes the data in order to display the users location and activity data as a geographical map and line graph, respectively. The mo-

mobile phone status, as shown in Figure 6.3.

78 65 Figure 6.3: Web server using MongoDB. bile users can then view up to a month s worth of historical record data, summarizing their personal behavior patterns, which include: a map of visited locations, activity and audio patterns, and mobile phone status, as shown in Figure 6.3. For the smartphone we created 5 tables within an SQLite [19] database, stored on the phone, and for our server we built four mirror tables using the MongoDB database. The major difference between the databases is that the mobile phone s SQLite database contains one additional table: a raw sensor data table, which the MongoDB database does not need. The five tables stored on the mobile s application are: raw sensor data, 30-minute period summary data, 30-minute period prediction data, historical data, and user login data. The mobile application analyzes activity, location, audio recognition, and other mobile status data every five seconds, and stores them as raw sensor data. The application summarizes the data logged in the raw sensor data table, and then stores it in the 30-minute period summary data table. The historical data table is created by calculating the average activity patterns of one user s summary data, based on each location visited. The 30-minute period prediction data table then uses both the historical data and 30-minute period summary data tables to predict a user s next 30-minute behavior pattern in a given location. The mobile phone sends four of its five tables to the server all but the raw sensor data table. The server stores all of the data collected and processed on the mobile phone in its MongoDB database, and additionally receives each mobiles unique id from the phones when these tables are sent to the server. The login data table, sent from the phone to the server, is used to grant access to users historical summary data on the server, if they choose to enter their login data on the phone.

79 Chapter 7 myblackbox Performance Evaluation In this section, we evaluate the performance of the end-to-end myblackbox system deployed in the previous chapter. 7.1 Accuracy of the Fusion Algorithms Based on the one months worth of data collected from our end-to-end real-world deployment, we investigated the eight behavior patterns of our mobile users (low level background noise, talking voice, music, loud emotional voice, stationary status, slow walking, walking activity and running activity), measured by the audio and activity classifiers, with our four fusion algorithms to determine the best algorithm for identifying whether a mobile user is involved in an unusual or a normal situation. We used a fifteen-day period of training data and a fifteen-day period of testing data to analyze a 30-minute segment of the seven days pooled data for the 15 mobile user subjects. The 15 subjects training data were used to train four classification models designed to detect unusual events. We used four algorithms (Bagging, Adaboost, CI 95 percent, and SVM) for our unusual event classification model development. For each algorithm in our model, we first used the training data (comprised of ground truth data and the fifteen days of the 15 subject s collected behavior pattern data) to build our four classification models. We used these models as the baseline from which to analyze the testing data to see which of the four algorithms predicted the best overall combined performance for recall, precision and accuracy measurements for analyzing the audio and activity data. Figure 7.1 shows the precision, recall, accuracy, and f-measure (combined recall and precision) per-

80 Adaboost Bagging SVM CI recall precision accuracy f-measure Figure 7.1: Precision, recall, and accuracy of four classification algorithms for detecting normal events Adaboost Bagging SVM CI recall precision accuracy f-measure Figure 7.2: Precision, recall, and accuracy of four classification algorithms for detecting unusual events.

81 68 formance measurements of the four classification algorithms when using them to detect normal events. We found the best algorithm for detecting normal events was the Bagging algorithm, with combined performance measurements highest of any of the other three algorithms:.95 recall,.93 precision,.95 f-measure, and 0.91 accuracy measurements. The least proficient performing algorithm of the four was the Adaboost algorithm, with a.91 recall, a.96 precision,.94 f-measure, and a 0.90 accuracy measurement. Figure 7.2 shows the performance measurements of the four classification algorithms when using them to detect abnormal/unusual events. Once again, the Bagging algorithm was found to be the best performing of the four algorithms for detecting unusual events with the highest accuracy measurement of 0.91, and with.66 recall,.73 precision, and.69 f-measure measurements. The least proficient performing algorithm of the four was the Adaboost algorithm, with a.82 recall, a.65 precision, a.73 f-measure and a 0.90 accuracy measurement. We found that the low-complexity CI algorithm, with a 95 percent CI, provided the second highest results for our model, and was only slightly less optimal than the Bagging algorithm. In addition, the CI algorithm performed as consistently and efficiently as the Bagging algorithm. Because its performance being highly comparable to that of Bagging, and because of its lower complexity of the four fusion algorithms, we determined it to be the most suitable hybrid fusion algorithm for implementation of our myblackbox application on the mobile phone. The Adaboost algorithm was shown to be especially effective at yielding fewer false negatives than any of the other algorithms, as demonstrated by its recall measurement of The Adaboost algorithm is one of the worst algorithms, however, when dealing with high variance in the data, since it is very sensitive to extreme variance and uses an iterative weighting process that is based upon the variance across the data points in the data set. It had a notably lower accuracy than the others when analyzed over just one week of data as shown in an earlier chapter. With one-months worth of data, however, Adaboost was found to be less sensitive to the variance in the data and its accuracy measurement became comparable to the other three algorithms. Although Adaboost was shown to be the least proficiently performing algorithm in terms of its precision, which generated a higher number of false positives, it achieved the highest recall measurement. Since all fusion algorithms achieved commensurate accuracy in detecting unusual events, then it may be useful to apply Adaboost when there is a premium in keeping the rate of false negatives low and not as much

82 69 emphasis is placed on false positives. 7.2 Noise Removal We based our CI algorithm computations for each individual user of our application on the user s normal distribution of behavior pattern data, collected for the duration of one month periods in the same locations. In our previous algorithmic research, we found that the confidence interval of 95 percent yielded the second best results when comparing the accuracy of various algorithms: Bagging, Adaboost, SVM, and confidence interval. We chose to use the CI algorithm, with a 95 percent cutoff, for our fusion algorithm feasibility study of real-world application on the smartphone. When we analyzed the 15-participant data, for 30-minute segments that were based on activities of users who were stationary during these segments, we identified two activities: user game-playing and phone calling activites that yielded false positives of unusual event detection for our 95% CI algorithm. When we initially included these two stationary user events, we found unexpected noise data (false positive identification of unusual user events) in our results that decreased the accuracy of our normal distribution calculation for each user, because these activities were incorrectly identified as unusual events when they were actually normal events. For example, There is a comparison of user-movement activity data measured when a user is stationary (with phone carried in pocket) contrasted to data measured when the same user is holding their device and playing a game on their smartphone. With stationary user-game playing activity, incorrect unusual event detection is indicated compared to what normal activity would be expected when the user is just sitting stationary with the phone on their person. If we were to include both instances of stationary user activity (sitting and game-playing), the accuracy of the normal distribution calculation would be significantly decreased. This example illustrates the difficulty in measuring any user-phone interaction activities accurately, such as texting, web-surfing, skyping, etc. In addition to our identification of the noise data generated with user-phone interaction activities, such as game-playing, texting, etc., we discovered that user phone calling activity also could not be incorporated into our normal distribution measurement accurately. We found that the myblackbox application on the mobile phone cannot record audio simultaneously while a user is making a phone call, because only one application can access the audio device at the same

83 w Noise w/o Noise recall precision accuracy f-measure 70 Figure 7.3: Average of precision, recall, f-measure, and accuracy for detecting unusual events using CI. time. Additionally, when users charge their smartphone, there is no activity generated, and if we were to include this charging activity data in our normal distribution, it would interfere with normal and unusual event detection. Give our above-mentioned findings, we determined that each of these activities user-mobile phone interaction, phone calling, and battery charging could all safely be classified as normal user events, but that they contributed extra noise to our historical normal distribution (ND) calculation, so we eliminated all of them from our measurement. To remove these incorrectly classified and difficult (or impossible) to measure user events, we used status sensors on the phone to check whether the mobile user is calling, charging, or turning on the screen. When the mobile phone is charging, we did not include this activity in our historical ND measurement, but we did access the audio sensor and collect any audio data that was generated. For example, if a user charged their phone at night by their bed, we could collect user sounds generated, such as a low regular breathing sound versus talking voice sounds (e.g., in their sleep or talking near the phone). Similarly the audio sensor collected user voice data, while phone was charging on a counter or table, whenever the user or another person was standing next to the phone. With a combination of the status sensor, determining that a user had turned on his screen (e.g., to play a game, websurf, text, etc.) or initiated a voice call, and the audio sensor detecting the user s voice, we detected periods of user mobile use or calling and eliminated the audio and these activity patterns from our normal distribution s calculation of unusual event detection.

84 w Noise w/o Noise recall precision accuracy f-measure 71 Figure 7.4: Average of precision, recall, f-measure, and accuracy for detecting normal events using CI. We investigated three accuracy measurements accuracy, precision and recall for detecting unusual events of mobile users daily behavior patterns, when the 15 participants of our study carried their mobile phone for a period of one month. We analyzed and compared the measurements in two instances: 1) using the original collected data and 2) after identifying and removing noise data (e.g. phone calling, game playing, charging). Figure 7.3 shows the averaged measurements, across the 15 mobile user participants, for precision, recall, f-measure and accuracy probabilities of correctly detecting unusual events, using the CI. When we analyzed the original raw data of the 15 mobile users, we found the average accuracy measurement to be 0.88, but after removing the noise data the accuracy increased to.90. This method, of removing the noise data, also improved the average precision measurement by decreasing false-positives that indicated unusual events, when their were none. Figure 7.4 also shows that our method of removing noise data from normal event detection measurements was successful. By removing noise data for this scenario too, we also improved the overall accuracy and recall measurements by reducing the percentage of false-negatives for predicting normal events. Figure 7.5 shows a plot of two accuracy measurements, using the CI of 95% of each individual user s accuracy measurements. One line shows the accuracy achieved when calculated on the raw data, while the other line shows the accuracy achieved when all noise data was removed. After removing the noise data, we found the accuracy across these 15 mobile users increased anywhere from slightly over 0% to 6%. We found our noise removal method to be suitable for increasing the accuracy of user event detection.

85 Accuracy w Noise w/o Noise Participant Figure 7.5: Individual accuracy measurement for event detection using CI. 7.3 Fusion Performance vs. Location-based Activity and Audio Classifiers We investigated how well our hybrid fusion algorithm performs, by comparing it to the performance of the location-based mobile sensors audio and activity classifiers/algorithms. We analyzed the performance of these three algorithms for one month s period of data from 15 users of the myblackbox application and system. Figure 7.6 shows the accuracy, precision, recall, and f-measure [10] measurements of our application. The accuracy for detecting unusual events was highest for the CI-based fusion algorithm compared to the audio and activity algorithms, with measurements of.91, 0.90, and 0.85, respectively. The activity classifier algorithm, when used alone, generated many false-negative classifications of unusual user events, noted by the.12 recall measurement. Thus the activity classifier was the poorest performing algorithm of the three, and least useful for detecting unusual events, when used alone. The recall measurement for the audio classifier was much better, with a 0.60 reading. However, we found that we could improve unusual event detection for mobile users of our application by using our CI-based fusion algorithm, which was the highest performing when considering the increased recall measurement of.71 for this analysis. The f-measure, which is based on combined recall and precision performance measurements, further validates that our CIbased fusion algorithm classifier is the highest performing of the three classifiers. The f-measure fusion performance result of 0.69 was higher than both the activity (0.18) and the audio classifier (0.60) results.

86 Num of visited locations recall precision accuracy f-measure 0 Activity Audio Fusion Figure 7.6: Comparing performance results of the hybrid fusion algorithm to location-based activity and audio classifiers using one month of data Participant Predicted normal locations Actual normal locations Total actual locations (normal, unusual) Figure 7.7: Performance results of our location classifier using one month s data.

We collected ground truth data from each of 15 users to determine how well our algorithm identified normal or unusual visited locations for each user.

87 74 Figure 7.8: One hour of CPU usage of the myblackbox application 7.4 Performance of Location Classifier We investigated how well our location algorithm performs and analyzed the performance of our location classifier for one month s period of data from 15 users. We collected ground truth data from each of 15 users to determine how well our algorithm identified normal or unusual visited locations for each user. We detected 133 locations visited by the participants during a one-month period, with an average of about 9 locations visited by each of the 15 participants. We also calculated the accuracy of our location classifier s performance for dividing the visited locations into either unusual or normal locations. Figure 7.7 shows the number of actual normal locations, identified as such by each individual participant from the ground truth data that we collected, compared to our predicted normal events, and also shows the combined actual normal events and actual unusual events for each participant. We checked the accuracy of our previously selected 2 % cutoff, used for predicting unusual events, when running our algorithm against this ground truth data, and found the accuracy measurement of our approach was myblackbox System Evaluation We evaluated the performance of the end-to-end myblackbox application and system in terms of its CPU/storage/network footprint, robustness, scalability and energy usage. In order to measure our system s data usage requirements on the mobile phone, we performed a test with HTC Inspire 4G phone [18] that had the Android operating system, our application, and a monitoring application installed on it. For a

required for operating our myblackbox application on the mobile.

88 75 (a) (b) Figure 7.9: Evaluation of (a)network and (b)storage I/O usage of the myblackbox application period of approximately 90 minutes, we measured the CPU, the storage I/O, and the network data transfer usage required for operating our myblackbox application on the mobile. Adaptability of our application for practical use with multiple users requires that it not crash or interfere with other applications whenever it is running on users mobile phones. It also needs to be able to support a large number of users simultaneously to achieve scalability for our system. We analyzed the CPU usage of our mobile application using the System Monitor mobile application [34] that can log CPU usage of mobile phones. Figure 7.8 shows the CPU usage of myblackbox application that we monitored for an 80-minute period. We operated our myblackbox application, while running the System Monitor application as a background service. For about 90 minutes, our application collected and measured the summary data and prediction data that was collected over three 30-minute periods. The average CPU usage was only 14%, which should minimally interfere with most other mobile applications. However, when classifying the recorded audio files (every 5 minutes for 15 seconds), we noted that the CPU usage increased to 100% for 7 seconds. We are considering techniques to spread this classification over a longer duration in order to reduce peak CPU usage. While measuring the usage of CPU, we also investigated the data transfer usage between the network and mobile phone s storage I/O when operating the myblackbox application for approximately 90 minutes, as shown in Figure 7.9. The average usage of storage I/O of our application, operating on the mobile, was 2%, with a maximum usage of 32% with a maximum of 666 kilo-bytes per second and an average of 1 kilo-

Senion IPS 101. An introduction to Indoor Positioning Systems

Senion IPS 101. An introduction to Indoor Positioning Systems Senion IPS 101 An introduction to Indoor Positioning Systems INTRODUCTION Indoor Positioning 101 What is Indoor Positioning Systems? 3 Where IPS is used 4 How does it work? 6 Diverse Radio Environments