Optimizing Public Transit

Similar documents
CandyCrush.ai: An AI Agent for Candy Crush

Understanding and Using the U.S. Census Bureau s American Community Survey

Using Administrative Records for Imputation in the Decennial Census 1

SacRT Forward Network Plan: Alternatives Report

THE TOP 100 CITIES PRIMED FOR SMART CITY INNOVATION

Developing the Model

Claritas Demographic Update Methodology Summary

7-2 Mean, Median, Mode, and Range. IWBAT find the mean, median, mode, and range of a data set.

Chapter Test Form A. mean median mode. 187 Holt Algebra 1. Name Date Class. Select the best answer.

3. Data and sampling. Plan for today

Modelling Small Cell Deployments within a Macrocell

Dota2 is a very popular video game currently.

Notes on the 2014 ACS 5-Year Estimates

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Achieving Desirable Gameplay Objectives by Niched Evolution of Game Parameters

Census Data Tools. Hands-on exercises July 17 & 19, LULAC National Convention

Motif finding. GCB 535 / CIS 535 M. T. Lee, 10 Oct 2004

Ghana - Ghana Living Standards Survey

ESP 171 Urban and Regional Planning. Demographic Report. Due Tuesday, 5/10 at noon

CTA Blue Line Forest Park Branch Feasibility/Vision Study

Who s in Your Neighborhood? Using the American FactFinder. Salma Abadin and Carrie Koss Vallejo Data You Can Use

PREDICTING RECLOSER FAILURE RATES FROM FIELD CONDITION ASSESSMENT. A Thesis by. Joseph M. Warner

CS188 Spring 2014 Section 3: Games

Drawing Isogloss Lines

distinguished MEET HILLSDALE SHOPPING CENTER

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Unit Nine Precalculus Practice Test Probability & Statistics. Name: Period: Date: NON-CALCULATOR SECTION

Statistics and Probability

EAST POINTE PLAZA COLUMBIA, SC. 278,308 square feet Garners Ferry Road Columbia, SC 29209

CS 229 Final Project: Using Reinforcement Learning to Play Othello

distinguished MEET HILLSDALE SHOPPING CENTER MARKET PROFILE

A Weighted Least Squares Algorithm for Passive Localization in Multipath Scenarios

Designing Service Coverage and Measuring Accessibility and Serviceability

Individual 5 th Grade

Alternation in the repeated Battle of the Sexes

Poverty in the United Way Service Area

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Measuring Income Inequality in Farm States: Weaknesses of The Gini Coefficient

Environmental Justice Tool Guide

Estimating Transit Ridership Patterns Through Automated Data Collection Technology

LAND FOR SALE DEVELOPMENT OPPORTUNITY. St. Louis, Missouri JUNE 2017

Finding U.S. Census Data with American FactFinder Tutorial

The American Community Survey and the 2010 Census

Using the Normalized Image Log-Slope, part 2

Exploring Pedestrian Bluetooth and WiFi Detection at Public Transportation Terminals

Acoustic Based Angle-Of-Arrival Estimation in the Presence of Interference

FIVE TOWN PLAZA SPRINGFIELD, MA. 328,372 square feet. 300 Cooley Street Springfield, MA 01128

Survey of Massachusetts Congressional District #4 Methodology Report

Autonomous Self-deployment of Wireless Access Networks in an Airport Environment *

Lessons from a Pilot Study for a National Probability Sample Survey of Chinese Adults Focusing on Internal Migration

UNIVERSITY PLAZA AMHERST, NY Main Street Amherst, NY ,277 square feet

CSE 564: Scientific Visualization

City of St. Petersburg Planning & Visioning Commission October 11, 2011

Unified Growth Theory

Siyavula textbooks: Grade 12 Maths. Collection Editor: Free High School Science Texts Project

1 Simultaneous move games of complete information 1

Ultimatum Bargaining. James Andreoni Econ 182

Coordinate Algebra 1 Common Core Diagnostic Test 1. about 1 hour and 30 minutes for Justin to arrive at work. His car travels about 30 miles per

2, 3, 4, 4, 5, 5, 5, 6, 6, 7 There is an even number of items, so find the mean of the middle two numbers.

Adjusting for linkage errors to analyse coverage of the Integrated Data Infrastructure (IDI) and the administrative population (IDI-ERP)

OILFIELD DATA ANALYTICS

Gentrification and Graffiti in Harlem

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

An Improved Path Planning Method Based on Artificial Potential Field for a Mobile Robot

The American Community Survey. An Esri White Paper August 2017

Beamforming with Finite Rate Feedback for LOS MIMO Downlink Channels

A Primer on Image Segmentation. Jonas Actor

The Changing Structure of Africa s Economies

Chapter 3 Solution to Problems

Integrating Spaceborne Sensing with Airborne Maritime Surveillance Patrols

Depth from Focusing and Defocusing. Carnegie Mellon University. Pittsburgh, PA result is 1.3% RMS error in terms of distance

End of the Census. Why does the Census need reforming? Seminar Series POPULATION PATTERNS. seeing retirement differently

Blur Detection for Historical Document Images

Dynamic Ambulance Redeployment by Optimizing Coverage. Bachelor Thesis Econometrics & Operations Research Major Quantitative Logistics

EDGEWATER COMMUNITY INPUT REPORT

census 2016: count yourself in

1995 Video Lottery Survey - Results by Player Type

Mathematicsisliketravellingona rollercoaster.sometimesyouron. Mathematics. ahighothertimesyouronalow.ma keuseofmathsroomswhenyouro

Reinforcement Learning in Games Autonomous Learning Systems Seminar

On-site Traffic Accident Detection with Both Social Media and Traffic Data

Guess the Mean. Joshua Hill. January 2, 2010

Automatic Processing of Dance Dance Revolution

Pedigree Reconstruction using Identity by Descent

Exam 3 is two weeks from today. Today s is the final lecture that will be included on the exam.

Performance of ALOHA and CSMA in Spatially Distributed Wireless Networks

Online Appendix A: Supplementary Tables and Additional Results

Divided Landscapes of Economic Opportunity: The Canadian Geography of Intergenerational Income Mobility

Localization in Wireless Sensor Networks

Wireless ad hoc networks. Acknowledgement: Slides borrowed from Richard Y. Yale

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

DATA APPENDIX TO UNDERSTANDING THE IMPACT OF IMMIGRATION ON CRIME

Census Response Rate, 1970 to 1990, and Projected Response Rate in 2000

Cómo estructurar un buen proyecto de Machine Learning? Anna Bosch Rue VP Data Launchmetrics

Effect of Inaccurate Position Estimation on Self-Organising Coverage Estimation in Cellular Networks

NORTH POINT LANDING MODESTO, CA McHenry Avenue Modesto, CA ,240 square feet

SIERRA DEL ORO TOWNE CENTRE

COLONIAL PROMENADE WINTER HAVEN, FL. 280,228 square feet Cypress Gardens Boulevard Winter Haven, FL 33880

Development of an improved flood frequency curve applying Bulletin 17B guidelines

1980 Census 1. 1, 2, 3, 4 indicate different levels of racial/ethnic detail in the tables, and provide different tables.

ONLINE APPENDIX: SUPPLEMENTARY ANALYSES AND ADDITIONAL ESTIMATES FOR. by Martha J. Bailey, Olga Malkova, and Zoë M. McLaren.

Transcription:

Optimizing Public Transit Mindy Huang Christopher Ling CS229 with Andrew Ng 1 Introduction Most applications of machine learning deal with technical challenges, while the social sciences have seen much less use and give more surprising results. Therefore, for this project we chose to study what comprises an optimal light rail transit system. We chose light rail systems (subways, bus rapid transit, etc.) because they are generally not tied to trac, are costly to implement and therefore restricted to a limited number of stops at key destinations, and are the current focus in transit planning. First, we model the ridership of a system in a city. Then, we use our model to generate a light rail transit system that optimizes for ridership. 2 Model We use linear regression to predict ridership a transit system. First, we look at each individual station within the city and its features (demographics of the area, points of interest nearby, etc). Let x ij be the j-th feature of the i-th station. Then we represent the i-th station like so: θ 1 x i1 + θ 2 x i2 + + θ m x im = j θ j x ij Then to get a "score" of how good the transit sytem of the city as a whole is, we sum each station together. score = θ j x ij = θ j x ij i j j i We then add in features of the transit sytem as a whole (average distance between each station, etc). Let y i represent each feature of the transit system as a whole. So the model of the entire ridership system is: θ j + φ i y i ridership = j i x ij }{{} features of stations i }{{} features of system 1

2.1 Features of each station At each station, we have the following kinds of features: Demographics - we obtained demographics data on the area around each station from the 2010 Census. Features include median age, race, employment status, average transit duration to work, etc. Points of interest - we have features for the count of each place type within a 10 mile radius of each station. In other words, x 1 is the number of parks within 10 miles, x 2 is the number of sports stadiums with 10 miles, etc. Some interesting ndings - having airports, parks, and universities nearby correlates with high ridership. Also, working further away correlates with low ridership. We also found that having art galleries near stations correlates with higher ridership, possibly due to the inuence of New York City. 2.2 Features of the system as a whole Currently, we only have one feature for the system as a whole - average distance between each station. However, this is extremely important for the second step of our project - when we generate our own transit system that maximizes ridership. If we do not have a parameter that monitors the distance between each station, then maximizing ridership of the system is trivially putting an innite number of stations on the same point. Features we plan on implenting in the future include the number of stations in the system per unit of area the that the transit system covers the coverage of the city the system serves a cost function to control the jagged-ness of station placement, to mimic the routes of an actual transit system 2.3 Fitting the model To determine our parameters θ and φ, we implemented both normal equations and gradient descent. As this was an ill-conditioned problem, the normal equations failed to converge, so we continued with gradient descent. 3 Optimizing the model Our greatest obstacle was the lack of training data - there are only 25 cities in the US that have implemented light rail systems. As such, we implemented a lot of techniques to optimize our error. 2

Regularization - Since we had such a small data set, it was imperative to regularize to prevent overtting. We added a penalty to our model to smooth it out ridership = θx λ θ 2 Regularizing brought down our error by about 10 percent. Leave-one-out cross validation - we used this to get a more accurate estimate for the generalization error of our model, as well as to optimize our regularization and gradient descent coecients. We found that α = 1.38 10 8 and λ = 0.021 optimized our general error. This improved our error by about 20 percent. Feature selection - we used feature selection to remove features that increased error. We found that having Hindu temples, locksmiths, and taxi-stands a part of our algorithm all increased general error, most likely because they occur randomly with no real correlation to ridership. This improved our error by about 10 percent. Logging - Our data set was extremely jagged - numbers ranged from less than 1 when we looked at percentage race to tens of thousands for median income. Therefore, to smooth it out we took the log of the numbers. This worked surprisingly well, and brought our error down by over 100 percent. With all of our optimization techniques, our estimated generalization error decreased to 30 percent. 4 Generating a Transit System After tting a reasonably accurate model, we moved on to the second step of our project - given data of a city, nd locations that optimize the ridership of the transit system if stations were built there. To do this, we discretized a city into blockgroups (a unit of geography used by the Census) and implemented a greedy algorithm that selects the blockgroups maximizing ridership according to our model. The algorithm runs until adding more stations begins to decrease ridership. Below are comparisons between the actual transit systems and our optimal station locations according to our model. We generated models for The Bronx in New York City, and Austin. Since New York City has high ridership, when we run our model we would expect our algorithm to output something similar to the existing system. And since the system in Austin has low ridership, running our model in Austin should output a system that is fairly dierent. 3

Figure 1: Comparison of The Bronx. (a) Actual system in New York Approximately 320 rides/person/year (b) Our generated system in New York Analysis of New York: As seen above, our generated system creates stations in similar locations and hot spots to the current system, as expected. However, our stations are relatively clumped and jagged compared to the actual station locations, for two reasons - 1. we did not factor in a cost function to smooth the stations into distinct routes, like a real system would have; and 2. since our discretization was based on block groups, the locations to choose from, were not even in shape and not ne-grained enough. Figure 2: Comparison of Austin. Approximately 0.34 rides/person/year (a) Actual system in Austin (b) Our generated system in Austin 4

Analysis of Austin: Our model of Austin had several more stations than the current system. Most of them are along the heart of the city, and the outliers of the original system were done away with. Of particular interest is the station circled in red - note that it is next to several parks. This shows that in our model, parks correlates highly with ridership, which is fairly intuitive. 5 Assumptions and Restrictions Due to the inherently fuzzy nature of the subject, we make a few assumptions to simplify our model. Transit culture - we assume that transit culture is the same across all the cities we train on (we assume that Angelinos are just as willing to take public transit as New Yorkers if the transit system is optimal). City boundaries - we assume that transit systems are self-enclosed within each city. In reality, they often cross boundaries. Cost - when we create our own transit system at the end, we do not take into account the cost, monetary or political, of erecting a station at a given point. This is because political battles are extremely hard to quantify, and probably the reason the station was not built at that point in the rst place. 6 In Conclusion We were pleasantly surprised at the outcome of our project - we did not at all expect such a good model, given the fuzziness of the problem and the simplicity of our model. Moreover, this project demonstrates the potential applicability of machine learning to the social sciences. In the future, we hope to further improve the model, and perhaps provide new insights into public transit systems. 7 Sources The American Public Transportation Association provides information on revenue and ridership per transportation district Google's public transit feed provides the location and type of each station Google Places provides the points of interest around each station The 2010 Census and American Community Survey provides up-to-date information on demographics, population, age, and income. Retrieved from National Historical Geographic Information System at the University of Minnesota. Special thanks to Jerey Barrera, TA for Urban Design, and Peter Brownell, PhD., Reasearch Director at the Center on Policy Initiatives. And Dave, a super helpful TA. 5