On-site Traffic Accident Detection with Both Social Media and Traffic Data

Similar documents
1. Travel time measurement using Bluetooth detectors 2. Travel times on arterials (characteristics & challenges) 3. Dealing with outliers 4.

Spatial-Temporal Data Mining in Traffic Incident Detection

BIG DATA EUROPE TRANSPORT PILOT: INTRODUCING THESSALONIKI. Josep Maria Salanova Grau CERTH-HIT

IDENTIFICATION OF SIGNATURES TRANSMITTED OVER RAYLEIGH FADING CHANNEL BY USING HMM AND RLE

Deployment and Testing of Optimized Autonomous and Connected Vehicle Trajectories at a Closed- Course Signalized Intersection

Traffic Management for Smart Cities TNK115 SMART CITIES

Data fusion for traffic flow estimation at intersections

Experimental study of traffic noise and human response in an urban area: deviations from standard annoyance predictions

ESTIMATING ROAD TRAFFIC PARAMETERS FROM MOBILE COMMUNICATIONS

Data collection and modeling for APTS and ATIS under Indian conditions - Challenges and Solutions

Traffic Control for a Swarm of Robots: Avoiding Group Conflicts

A Fuzzy Signal Controller for Isolated Intersections

Automated Driving Car Using Image Processing

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

Anurag Pande & Mohamed Abdel-Aty Department of Civil and Environmental Engineering, University of Central Florida, Orlando, FL

Outline for this presentation. Introduction I -- background. Introduction I Background

Computing Touristic Walking Routes using Geotagged Photographs from Flickr

Objective Evaluation of Edge Blur and Ringing Artefacts: Application to JPEG and JPEG 2000 Image Codecs

WHITE PAPER BENEFITS OF OPTICOM GPS. Upgrading from Infrared to GPS Emergency Vehicle Preemption GLOB A L TRAFFIC TE CHNOLOGIE S

Using Deep Learning for Sentiment Analysis and Opinion Mining

Innovative mobility data collection tools for sustainable planning

Link Activation with Parallel Interference Cancellation in Multi-hop VANET

Fig.2 the simulation system model framework

Traffic Incident Detection Enabled by Large Data Analytics. REaltime AnlytiCs on TranspORtation data

ASSESSING THE POTENTIAL FOR THE AUTOMATIC DETECTION OF INCIDENTS ON THE BASIS OF INFORMATION OBTAINED FROM ELECTRONIC TOLL TAGS

No-Reference Image Quality Assessment using Blur and Noise

Intelligent Traffic Signal Control System Using Embedded System

Next Generation of Adaptive Traffic Signal Control

Complex networks in applied research

Global Journal of Engineering Science and Research Management

Applying Multisensor Information Fusion Technology to Develop an UAV Aircraft with Collision Avoidance Model

Applying Multisensor Information Fusion Technology to Develop an UAV Aircraft with Collision Avoidance Model

Recognition Of Vehicle Number Plate Using MATLAB

Latest trends in sentiment analysis - A survey

A Review of Related Work on Machine Learning in Semiconductor Manufacturing and Assembly Lines

License Plate Localisation based on Morphological Operations

February 24, [Click for Most Updated Paper] [Click for Most Updated Online Appendices]

Comparative Study of various Surveys on Sentiment Analysis

Classification of Road Images for Lane Detection

Analysis of Data Mining Methods for Social Media

The Pennsylvania State University The Graduate School A STATISTICS-BASED FRAMEWORK FOR BUS TRAVEL TIME PREDICTION

Kernels and Support Vector Machines

FINAL REPORT IMPROVING THE EFFECTIVENESS OF TRAFFIC MONITORING BASED ON WIRELESS LOCATION TECHNOLOGY. Michael D. Fontaine, P.E. Research Scientist

Qosmotec. Software Solutions GmbH. Technical Overview. QPER C2X - Car-to-X Signal Strength Emulator and HiL Test Bench. Page 1

Clustering of traffic accidents with the use of the KDE+ method

A GI Science Perspective on Geocoding:

Currently 2 vacant engineer positions (1 Engineer level, 1 Managing Engineer level)

Wheeler-Classified Vehicle Detection System using CCTV Cameras

Speed Estimation in Forward Scattering Radar by Using Standard Deviation Method

Reduce the Wait Time For Customers at Checkout

Intelligent Technology for More Advanced Autonomous Driving

CellSense: A Probabilistic RSSI-based GSM Positioning System

Applications of Music Processing

Urban Traffic Bottleneck Identification Based on Congestion Propagation

Keywords- Fuzzy Logic, Fuzzy Variables, Traffic Control, Membership Functions and Fuzzy Rule Base.

A TWO-PART PREDICTIVE CODER FOR MULTITASK SIGNAL COMPRESSION. Scott Deeann Chen and Pierre Moulin

A VIDEO CAMERA ROAD SIGN SYSTEM OF THE EARLY WARNING FROM COLLISION WITH THE WILD ANIMALS

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Algorithm for Detector-Error Screening on Basis of Temporal and Spatial Information

A Vehicular Visual Tracking System Incorporating Global Positioning System

VALIDATION OF LINK TRAVEL TIME USING GPS DATA: A Case Study of Western Expressway, Mumbai

Signal Patterns for Improving Light Rail Operation By Wintana Miller and Mark Madden DKS Associates

Big Data Framework for Synchrophasor Data Analysis

Development of an Advanced Loop Event Data Analyzer (ALEDA) System for Dual-Loop Detector Malfunction Detection and Investigation

Smartphone Motion Mode Recognition

Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks

GPS data correction using encoders and INS sensors

WHITE PAPER. NLP TOOL (Natural Language Processing) User Case: isocialcube (Social Networks Campaign Management)

Table 1. List of NFL divisions that have won the Superbowl over the past 52 years.

Time-aware Collaborative Topic Regression: Towards Higher Relevance in Textual Items Recommendation

A Vehicular Visual Tracking System Incorporating Global Positioning System

Adaptive Feature Analysis Based SAR Image Classification

Speed Enforcement Systems Based on Vision and Radar Fusion: An Implementation and Evaluation 1

Association Rule Mining. Entscheidungsunterstützungssysteme SS 18

Wide-area Motion Imagery for Multi-INT Situational Awareness

A Vehicular Visual Tracking System Incorporating Global Positioning System

Classification of Voltage Sag Using Multi-resolution Analysis and Support Vector Machine

Using Administrative Records for Imputation in the Decennial Census 1

March 10, Greenbelt Road, Suite 400, Greenbelt, MD Tel: (301) Fax: (301)

Georgia Department of Transportation. Automated Traffic Signal Performance Measures Reporting Details

SOUND: A Traffic Simulation Model for Oversaturated Traffic Flow on Urban Expressways

Maximum Likelihood Sequence Detection (MLSD) and the utilization of the Viterbi Algorithm

Interframe Coding of Global Image Signatures for Mobile Augmented Reality

Initialisation improvement in engineering feedforward ANN models.

High-speed Noise Cancellation with Microphone Array

On the Optimality of WLAN Location Determination Systems

analysis of GPS total electron content Empirical orthogonal function (EOF) storm response 2016 NEROC Symposium M. Ruohoniemi (3)

Model-based Design of Coordinated Traffic Controllers

Segment based Traffic Information Estimation Method Using Cellular Network Data

OBJECTIVE OF THE BOOK ORGANIZATION OF THE BOOK

ENTERPRISE Transportation Pooled Fund Study TPF-5 (231)

Freeway Performance Measurement System (PeMS)

An Approach to Korean License Plate Recognition Based on Vertical Edge Matching

Prediction of LOS based Path-Loss in Urban Wireless Sensor Network Environments

Assessing the Performance of Integrated Corridor Management (ICM) Strategies

Analysis of Temporal Logarithmic Perspective Phenomenon Based on Changing Density of Information

Development of 24 GHz-band High Resolution Multi-Mode Radar

Vision Based Intelligent Traffic Analysis System for Accident Detection and Reporting System

Performance Evaluation of a Mixed Vehicular Network with CAM-DCC and LIMERIC Vehicles

Final Version of Micro-Simulator

Transcription:

On-site Traffic Accident Detection with Both Social Media and Traffic Data Zhenhua Zhang Civil, Structural and Environmental Engineering University at Buffalo, The State University of New York, Buffalo, NY, USA Qing He 1 Civil, Structural and Environmental Engineering and Industrial and Systems Engineering University at Buffalo, The State University of New York, Buffalo, NY, USA 1. Introduction Email: qinghe@buffalo.edu Social media receives increasing attentions as crowdsourced information for traffic operations and management. One recent trending study is to use social media (e.g. tweets) to detect onsite traffic accidents (Gu et al., 2016). However, the shortcomings of using tweets directly as a detector of traffic accidents are almost as obvious as its merits. There are two major challenges to be addressed before the use of tweets in traffic accident detection. First, as compared to events that arouse enormous public concerns such as key basketball games, extreme weathers or traditional festivals, the influence of traffic accidents are comparably a midget. From our observation, tweets related to traffic accidents are thus in small quantity. What s more, most of them are confined to a small area and limited to a relatively short time interval. Second, the challenge in tweets lies in its inheritable complexity and unstructured nature of data: language ambiguity (Chen et al., 2014). The context of tweet is limited to 140 words which is not long enough for accurate automatic language processing using some keyword pairs. For example, internet traffic is slow and internet shows traffic is slow may deliver totally different information. In addition, it remains unknown how effective the social media based detection methods is as compared with traditional loop detector based method. To address above challenges, we propose a method to combine both the traffic-related metrics and tweet information for accurate real-time detection of on-site traffic accidents. In principle, the fusion of multi-source data provides significant advantages over single source data (Hall and Llinas, 1997), and the integration of data sources is expected to produce more synthetic and informative results. 2. Data Description and Models The study area is the vast road network of Northern Virginia (NOVA). The area has long been known for its heavy traffic (Cervero, 1994). It is a typical rural road network with more than 1,200 signalized intersections. Each intersection equips with an average of 12 lane-based loop detectors, the total of which amount to nearly 15,000 in NOVA. (1) Traffic data, including traffic flow and occupancy, were collected by these loop detectors at an interval of 15 minutes for 12 months, from January 2014 to December 2014. (2) Tweet data were collected through Twitter Streaming API with geo-location filter. Filtering by the coordinates, we extracted tweets posted only from NOVA region. There are more than 584,000 tweets throughout the year of 2014, and all of them have specific date and location information. (3) The accident data were extracted from traffic incident database maintained by Virginia Department of Transportation (VDOT) with detailed time and location information. The accidents include collision, disabled vehicle, vehicle on fire etc. 1 Corresponding author.

We employ the support vector machines (SVMs) as our classification model. We first give manual labels for the tweets and select the corresponding features separately based on traffic and tweet data. In training the regression model, we further implement 5-fold cross validation to increase the accuracy of the predicted model. 3. Feature selection based on tweet data Tweet features are extracted from the keywords of candidate tweets. From the whole tweet database, we obtain nearly 1500 eligible tweets which: Include the tweets that may contain any of accident, incident, crash, collision, head on, damage, pile up, rear end, rear-end, sideswipe, lost control, rolled over, roll over, tailgating Include the words that are relevant to accidents but apparently misspelled or personally modified including acident, incdent, etc. Include other variations of accident-related words such as the word pairs that have a hyphen in word pair such as roll-over, etc. Exclude the words related to transportation authority or news media. In single feature selection, each tweet is further decomposed into separate words that are called token in our paper. Then, we select the useful token features by three steps: stop-word filtering, keyword stemming and correlated-word filtering. The process is as shown in Figure 1: Tweet database T: T 1 : I saw a traffic accident in front. T 2 : Car damage on Route 1. Stemmed tokens: see traffic accident route saw see accident accidents traffic route Tokenization Stemming Tokens: i saw a traffic in accident front car damage on, route 1 see is an case damages accidents Stop-word filtering Tokens: I saw a traffic in accident front car damage on route 1 see is an case damages accidents Figure 1 Steps of token filtering and stemming In correlated-word filtering, we select those tokens that may correlate with our traffic accident label. The correlation benchmark we choose is phi coefficient. The coefficient (usually denoted as ϕ) between two variables x and y is calculated as: φ = n 11n 00 n 10 n 01 n 1 n 0 n 0 n 1, where n ab is the counts for x = a and y = b; when a or b = 0 or 1, we consider both counts for x or y. Those tokens whose φ is higher than 0.1 are selected. Following this rule, 46 tokens are selected and part of them are shown in the list: Features Correlation Features Correlation Features Correlation traffic 0.246 66 0.169 virginia 0.146 near 0.208 accidently -0.159 glad 0.146

car 0.194 closed 0.158 95 0.146 bad 0.172 major 0.158 270 0.146 495 0.169 accident 0.154 lanes 0.135 From Table 1, some of the tokens may be accounted by the geographic uniqueness such as virginia ; some may direct to the road names such as 95, 270. Others may be that of relevant topics traffic, accident. Potentially, the co-occurrence of certain keyword pairs in a tweet may indicate the existence of traffic accident. We further select the features from paired tokens by study the association rules between the manual label. The association rules can be unveiled by the Apriori algorithm which can find the regularities in large-scale binary data by two major probabilities: confidence and support: conf(l i t j1 t j2 t jm ) = supp(l i t j 1 t j2 t jm ) supp(t j 1 t j2 t jm ) ; supp(t j ) = sizeof({t i,t j T i }). sizeof({t i }) By setting the support equal to 0.1 and confidence equal to 0.5, our results show that most paired tokens contain accident. The paired features in a given tweet are equal to 1 if the tweet contains the corresponding paired tokens and otherwise 0. Parts of the features are shown in the following list: loop inner accident mile accident road accident exit accident major accident close accident south accident left accident bad loop accident accident involve accident block lane 4. Feature selection based on traffic data For each detector, we evenly divide the traffic occupancy into N separate groups. For each traffic occupancy group, we take the median of the corresponding traffic flow values as the traffic signature. We use the median because it is less affected by outliers than mean. The traffic signature of a detector d is defined as the vector of these traffic flow values. That is: F d = (F 1 d, F 2 d,, F o d,, F N d ), where F o d is the median value of traffic flow given a range of occupancy o in detector d. One can see that for each detector, the traffic pattern is a vector of N traffic flow values. We assume the relationship between traffic flow and occupancy in a given location may not change over time. However, these traffic signatures among detectors may be quite different. Those detectors with similar traffic signatures should be clustered into the same group. To cluster the traffic patterns of each detector, we employ the K-means algorithm without predefining the clustering centers and the number of clusters. We finally cluster nearly 15,000 detectors into 15 different clusters/groups. For each cluster, the traffic flows over a specified occupancy interval are distributed around their cluster centers. Further, the outliers can be quantified by a probabilistic method that measures its deviation degree. Our empirical examinations show that the distributions of the traffic flow in a particular cluster and occupancy interval follows a Gaussian distribution as shown in Figure 2. Therefore, the traffic outliers be quantified by the probability P dt of traffic accident where d is the detector and t is time.

(a) (b) Figure 2 (a) Comparisons between clustered centers and the original traffic flow and occupancy data in one detector; (b) The traffic flow distribution over a range of occupancy. To better quantify the traffic influence as to a tweet post, we mainly study the traffic related information within certain spatial and temporal ranges. Based on these traffic data, two features are then generated for our regression model for each tweet: mean and 75 th percentile value of P dt : p traffic = 1 NUM t dom(t) d dom(d) Pdt ;. q traffic = Q3( {P dt, d dom(d) t dom(t)}. 5. Results and Conclusions 0.9 0.8 0.7 0.839 0.839 0.681 0.683 0.829 0.839 0.788 0.793 0.76 0.76 0.6 0.526 0.526 0.5 Single token Single token + Traffic data Single token + Paired token All features Accuracy Precision1 Precision2 Figure 3 Comparisons of accuracy and precision using different features; precision 1 is for accident, precision 2 is for non-accident. The results indicate that that paired tokens can possibly capture the association rules can increase the accuracy of the traffic accident detection. We even employ our model with features of single and paired tokens to predict the accident label of all tweets in NOVA. For each accident-related tweet, we make comparison between the prediction results and the traffic management log maintained by VDOT. The time and location differences of tweets to their nearest accident records are shown in Figure 4.

Density 0.000 0.010 0.020 Density 0e+00 2e-04 4e-04-60 -40-20 0 20 40 60 1000 3000 5000 Time difference (min) Space difference (meter) (a) (b) Figure 4 (a) Time and (b) space difference between the accident-related tweets and the accident records by VDOT. We can conclude that first, sometimes the tweet reflection on the traffic accident is much faster than the traditional methods; second, tweets can sometimes capture those mild accidents that do not incur the attention of traffic police can make up for the deficiencies of traffic management log. Also, there are problems with the tweet accident-prediction methods that the locations are not so accurate and the coverage of tweets are not high enough to cover all traffic accidents in the whole area. References Cervero, R., 1994. Rail transit and joint development: Land market impacts in Washington, DC and Atlanta. Journal of the American Planning Association 60, 83-94. Chen, P.-T., Chen, F., Qian, Z., 2014. Road traffic congestion monitoring in social media with hinge-loss Markov random fields, Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, pp. 80-89. Gu, Y., Qian, Z.S., Chen, F., 2016. From Twitter to detector: Real-time traffic incident detection using social media data. Transportation Research Part C: Emerging Technologies 67, 321-342. Hall, D.L., Llinas, J., 1997. An introduction to multisensor data fusion. Proceedings of the IEEE 85, 6-23.