On-site Traffic Accident Detection with Both Social Media and Traffic Data Zhenhua Zhang Civil, Structural and Environmental Engineering University at Buffalo, The State University of New York, Buffalo, NY, USA Qing He 1 Civil, Structural and Environmental Engineering and Industrial and Systems Engineering University at Buffalo, The State University of New York, Buffalo, NY, USA 1. Introduction Email: qinghe@buffalo.edu Social media receives increasing attentions as crowdsourced information for traffic operations and management. One recent trending study is to use social media (e.g. tweets) to detect onsite traffic accidents (Gu et al., 2016). However, the shortcomings of using tweets directly as a detector of traffic accidents are almost as obvious as its merits. There are two major challenges to be addressed before the use of tweets in traffic accident detection. First, as compared to events that arouse enormous public concerns such as key basketball games, extreme weathers or traditional festivals, the influence of traffic accidents are comparably a midget. From our observation, tweets related to traffic accidents are thus in small quantity. What s more, most of them are confined to a small area and limited to a relatively short time interval. Second, the challenge in tweets lies in its inheritable complexity and unstructured nature of data: language ambiguity (Chen et al., 2014). The context of tweet is limited to 140 words which is not long enough for accurate automatic language processing using some keyword pairs. For example, internet traffic is slow and internet shows traffic is slow may deliver totally different information. In addition, it remains unknown how effective the social media based detection methods is as compared with traditional loop detector based method. To address above challenges, we propose a method to combine both the traffic-related metrics and tweet information for accurate real-time detection of on-site traffic accidents. In principle, the fusion of multi-source data provides significant advantages over single source data (Hall and Llinas, 1997), and the integration of data sources is expected to produce more synthetic and informative results. 2. Data Description and Models The study area is the vast road network of Northern Virginia (NOVA). The area has long been known for its heavy traffic (Cervero, 1994). It is a typical rural road network with more than 1,200 signalized intersections. Each intersection equips with an average of 12 lane-based loop detectors, the total of which amount to nearly 15,000 in NOVA. (1) Traffic data, including traffic flow and occupancy, were collected by these loop detectors at an interval of 15 minutes for 12 months, from January 2014 to December 2014. (2) Tweet data were collected through Twitter Streaming API with geo-location filter. Filtering by the coordinates, we extracted tweets posted only from NOVA region. There are more than 584,000 tweets throughout the year of 2014, and all of them have specific date and location information. (3) The accident data were extracted from traffic incident database maintained by Virginia Department of Transportation (VDOT) with detailed time and location information. The accidents include collision, disabled vehicle, vehicle on fire etc. 1 Corresponding author.
We employ the support vector machines (SVMs) as our classification model. We first give manual labels for the tweets and select the corresponding features separately based on traffic and tweet data. In training the regression model, we further implement 5-fold cross validation to increase the accuracy of the predicted model. 3. Feature selection based on tweet data Tweet features are extracted from the keywords of candidate tweets. From the whole tweet database, we obtain nearly 1500 eligible tweets which: Include the tweets that may contain any of accident, incident, crash, collision, head on, damage, pile up, rear end, rear-end, sideswipe, lost control, rolled over, roll over, tailgating Include the words that are relevant to accidents but apparently misspelled or personally modified including acident, incdent, etc. Include other variations of accident-related words such as the word pairs that have a hyphen in word pair such as roll-over, etc. Exclude the words related to transportation authority or news media. In single feature selection, each tweet is further decomposed into separate words that are called token in our paper. Then, we select the useful token features by three steps: stop-word filtering, keyword stemming and correlated-word filtering. The process is as shown in Figure 1: Tweet database T: T 1 : I saw a traffic accident in front. T 2 : Car damage on Route 1. Stemmed tokens: see traffic accident route saw see accident accidents traffic route Tokenization Stemming Tokens: i saw a traffic in accident front car damage on, route 1 see is an case damages accidents Stop-word filtering Tokens: I saw a traffic in accident front car damage on route 1 see is an case damages accidents Figure 1 Steps of token filtering and stemming In correlated-word filtering, we select those tokens that may correlate with our traffic accident label. The correlation benchmark we choose is phi coefficient. The coefficient (usually denoted as ϕ) between two variables x and y is calculated as: φ = n 11n 00 n 10 n 01 n 1 n 0 n 0 n 1, where n ab is the counts for x = a and y = b; when a or b = 0 or 1, we consider both counts for x or y. Those tokens whose φ is higher than 0.1 are selected. Following this rule, 46 tokens are selected and part of them are shown in the list: Features Correlation Features Correlation Features Correlation traffic 0.246 66 0.169 virginia 0.146 near 0.208 accidently -0.159 glad 0.146
car 0.194 closed 0.158 95 0.146 bad 0.172 major 0.158 270 0.146 495 0.169 accident 0.154 lanes 0.135 From Table 1, some of the tokens may be accounted by the geographic uniqueness such as virginia ; some may direct to the road names such as 95, 270. Others may be that of relevant topics traffic, accident. Potentially, the co-occurrence of certain keyword pairs in a tweet may indicate the existence of traffic accident. We further select the features from paired tokens by study the association rules between the manual label. The association rules can be unveiled by the Apriori algorithm which can find the regularities in large-scale binary data by two major probabilities: confidence and support: conf(l i t j1 t j2 t jm ) = supp(l i t j 1 t j2 t jm ) supp(t j 1 t j2 t jm ) ; supp(t j ) = sizeof({t i,t j T i }). sizeof({t i }) By setting the support equal to 0.1 and confidence equal to 0.5, our results show that most paired tokens contain accident. The paired features in a given tweet are equal to 1 if the tweet contains the corresponding paired tokens and otherwise 0. Parts of the features are shown in the following list: loop inner accident mile accident road accident exit accident major accident close accident south accident left accident bad loop accident accident involve accident block lane 4. Feature selection based on traffic data For each detector, we evenly divide the traffic occupancy into N separate groups. For each traffic occupancy group, we take the median of the corresponding traffic flow values as the traffic signature. We use the median because it is less affected by outliers than mean. The traffic signature of a detector d is defined as the vector of these traffic flow values. That is: F d = (F 1 d, F 2 d,, F o d,, F N d ), where F o d is the median value of traffic flow given a range of occupancy o in detector d. One can see that for each detector, the traffic pattern is a vector of N traffic flow values. We assume the relationship between traffic flow and occupancy in a given location may not change over time. However, these traffic signatures among detectors may be quite different. Those detectors with similar traffic signatures should be clustered into the same group. To cluster the traffic patterns of each detector, we employ the K-means algorithm without predefining the clustering centers and the number of clusters. We finally cluster nearly 15,000 detectors into 15 different clusters/groups. For each cluster, the traffic flows over a specified occupancy interval are distributed around their cluster centers. Further, the outliers can be quantified by a probabilistic method that measures its deviation degree. Our empirical examinations show that the distributions of the traffic flow in a particular cluster and occupancy interval follows a Gaussian distribution as shown in Figure 2. Therefore, the traffic outliers be quantified by the probability P dt of traffic accident where d is the detector and t is time.
(a) (b) Figure 2 (a) Comparisons between clustered centers and the original traffic flow and occupancy data in one detector; (b) The traffic flow distribution over a range of occupancy. To better quantify the traffic influence as to a tweet post, we mainly study the traffic related information within certain spatial and temporal ranges. Based on these traffic data, two features are then generated for our regression model for each tweet: mean and 75 th percentile value of P dt : p traffic = 1 NUM t dom(t) d dom(d) Pdt ;. q traffic = Q3( {P dt, d dom(d) t dom(t)}. 5. Results and Conclusions 0.9 0.8 0.7 0.839 0.839 0.681 0.683 0.829 0.839 0.788 0.793 0.76 0.76 0.6 0.526 0.526 0.5 Single token Single token + Traffic data Single token + Paired token All features Accuracy Precision1 Precision2 Figure 3 Comparisons of accuracy and precision using different features; precision 1 is for accident, precision 2 is for non-accident. The results indicate that that paired tokens can possibly capture the association rules can increase the accuracy of the traffic accident detection. We even employ our model with features of single and paired tokens to predict the accident label of all tweets in NOVA. For each accident-related tweet, we make comparison between the prediction results and the traffic management log maintained by VDOT. The time and location differences of tweets to their nearest accident records are shown in Figure 4.
Density 0.000 0.010 0.020 Density 0e+00 2e-04 4e-04-60 -40-20 0 20 40 60 1000 3000 5000 Time difference (min) Space difference (meter) (a) (b) Figure 4 (a) Time and (b) space difference between the accident-related tweets and the accident records by VDOT. We can conclude that first, sometimes the tweet reflection on the traffic accident is much faster than the traditional methods; second, tweets can sometimes capture those mild accidents that do not incur the attention of traffic police can make up for the deficiencies of traffic management log. Also, there are problems with the tweet accident-prediction methods that the locations are not so accurate and the coverage of tweets are not high enough to cover all traffic accidents in the whole area. References Cervero, R., 1994. Rail transit and joint development: Land market impacts in Washington, DC and Atlanta. Journal of the American Planning Association 60, 83-94. Chen, P.-T., Chen, F., Qian, Z., 2014. Road traffic congestion monitoring in social media with hinge-loss Markov random fields, Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, pp. 80-89. Gu, Y., Qian, Z.S., Chen, F., 2016. From Twitter to detector: Real-time traffic incident detection using social media data. Transportation Research Part C: Emerging Technologies 67, 321-342. Hall, D.L., Llinas, J., 1997. An introduction to multisensor data fusion. Proceedings of the IEEE 85, 6-23.