Problem Statements


A large array of urban activities including mobility can be modeled as networks evolving over time. These networks potentially capture the changes in urban dynamics caused by events like strikes and weather extremities, but identification of these events from temporal networks is a challenging problem and we intend to address it in this research. Our approach is a topological aggregation of the network followed by dimensionality reduction using representation learning, enabling the application of standard outlier detection to low dimensional representation space. We will evaluate the methodology by its ability to identify specific urban events. We expect our research to produce a methodology for anomaly detection in temporal networks of urban mobility that outperforms the legacy techniques and is generalizable to different types of temporal networks. Our motivations to pursue this problem is our belief that such a system can be used in early detection of potentially unsafe developments and enable a timely response.

The project aims to develop a generic approach for anomaly detection in temporal networks and evaluate it for event detection in urban mobility networks.


Data

The urban mobility datasets were collected for multiple cities, including taxi ridership datasets for New York (USA), Washington DC (USA) and Chicago (USA), and subway ridership data for Taipei (Taiwan). We aggregated all these datasets at the day level and transformed them into a convenient uniform format.The summary of the aggregated mobility datasets is listed below.

Chicago

943 Days
77 Taxi Zones

New York

729 Days
263 Taxi Zones

Washignton

730 Days
96 Taxi Zones

Taipei

637 Days
108 Subway Stations

Data Summary

City No. of Temporal Points Total Ridership Collected Average Ridership (per station per day) Total Number of Nodes (Stations/ Taxi Zones) Total Number of Edges
Chicago Taxi 943 14919326 282 77 1015
Washington DC Taxi 730 19062827 508 96 3668
Taipei Subway 637 1307013573 18969 108 11664
New York Taxi 729 540472732 2815 263 65792

Data Source

City Timeframe Website No. of Records Obtained
Chicago 2015-01-01 to 2017-08-01 https://data.cityofchicago.org/Transportation /Taxi-Trips/wrvz-psew/data 170717
Washington DC 2016-01-01 to 2017-12-31 http://opendata.dc.gov/search?q=taxi 678485
Taipei 2017-01-01 to 2018-09-30 https://data.taipei/dataset/detail/metadata?i d=63f31c7e-7fc3-418b-bd82-b95158755b4 d 7374816
New York City 2017-01-01 to 2018-12-3 https://www1.nyc.gov/site/tlc/about/tlc-trip-r ecord-data.page 21380658
Events that are global in nature can be identified relatively easily by just using the legacy methods like aggregated time series analysis, as the impact of these events can be seen across the entire network. The challenging problem we want to address using this study is to detect events that are local in nature yet are significant enough to impact the ridership in the overall network. To benchmark the efficacy of our method in detecting events where the legacy methods perform well, and to detect events of our interest as mentioned before, we have selected a set of global and 3 significant local events for this study. The different types of events we have considered are National Holidays, Cultural Events, Parades, Protests, and Extreme Weather. The weather datasets were further processed to detect extreme weather conditions from weather readings. Days having temperature or precipitation, above or below the threshold (1%) have been marked as extreme weather condition. For temperature, we also marked those days that are above or below 2 standard deviations from the rolling average of the last 10 days as local extreme weather condition. The summary of the aggregated events data is presented below.

National Holidays

Culture Events

Extreme Weather

Data Summary

City Extreme Weather National Holiday Culture Event
Chicago 42 10 4
Washington DC 43 41 46
Taipei 63 30 5
New York City 49 21 18

Weather Data Source

City Timeframe Website
Chicago 2015-01-01 to 2017-08-01 https://www.wunderground.com/history/monthly/us/il/chicago/KORD/date
Washington DC 2016-01-01 to 2017-12-31 https://www.wunderground.com/history/monthly/us/dc/washington/KDCA/date
Taipei 2017-01-01 to 2018-09-30 https://www.wunderground.com/history/monthly/tw/songshan-district/RCSS/date
New York City 2017-01-01 to 2018-12-31 https://www.wunderground.com/history/daily/us/ny/new-york-city/KLGA/date

We created synthetic data primarily for two reasons.

► To inject artificial anomalies of different kinds for diagnosing the models.

► To generate large volumes of data to enable training of deep auto-encoder.

Since the data has strong correlations we could not fit distributions and sample independently for each column. So we performed PCA to extract independent latent variables and fitted Gaussian distributions on these variables. Data was then created by sampling from these distributions and using inverse PCA to transform data back into the network domain. Finally, following types of anomalies were injected in the network domain;

► Global anomalies where the entire network witnesses a shift in the ridership.

► Balanced anomalies where some portions of the network experience shifts in ridership but the aggregated ridership is not affected on average.



Conclusion


Experiments on real-world data exhibited that community detection outperforms spatial aggregation because it also considers topological structure and connectivity of networks. Furthermore, time series analysis of daily aggregation does well in isolating anomalies which have a global impact while it fails to do well in isolating localized anomalies. Further experiments will yield a deeper diagnosis of the performance of these techniques. Experiments on synthetic data show that decomposition approaches (PCA and Autoencoder) perform better than crude network aggregation. But these experiments have not revealed any advantage of autoencoder over PCA. This is plausible because autoencoders provide an advantage in modeling complex nonlinear relationships but the data generation process was based on PCA and only had linear correlations between features. We will further refine the 7 synthetic data generation process and try to inject different types of anomalies which disrupt distinctive spatial and temporal patterns at different scales. This will provide detailed diagnostics into comparative capabilities of different methodologies in isolating a different kind of anomalies.


For more information



Team Members

Urwa Muaz

Writeup Owner

Prof. Stan

Sponsor and head coach

Shivam Pathak

Algorithms Pipeline Owner

Mingyi He

Logistic Leader

Jingtian Zhou

Data Wizard

Saloni Saini

Codebase Manager


Acknowledgement

We'd like to thank Prof. Stanislav Sobolevsky who provided guidance and assistance througout the whole project.

Back to Top