The project aims to develop a generic approach for anomaly detection in temporal networks and evaluate it for event detection in urban mobility networks.
|City||No. of Temporal Points||Total Ridership Collected||Average Ridership (per station per day)||Total Number of Nodes (Stations/ Taxi Zones)||Total Number of Edges|
|Washington DC Taxi||730||19062827||508||96||3668|
|New York Taxi||729||540472732||2815||263||65792|
|City||Timeframe||Website||No. of Records Obtained|
|Chicago||2015-01-01 to 2017-08-01||https://data.cityofchicago.org/Transportation /Taxi-Trips/wrvz-psew/data||170717|
|Washington DC||2016-01-01 to 2017-12-31||http://opendata.dc.gov/search?q=taxi||678485|
|Taipei||2017-01-01 to 2018-09-30||https://data.taipei/dataset/detail/metadata?i d=63f31c7e-7fc3-418b-bd82-b95158755b4 d||7374816|
|New York City||2017-01-01 to 2018-12-3||https://www1.nyc.gov/site/tlc/about/tlc-trip-r ecord-data.page||21380658|
|City||Extreme Weather||National Holiday||Culture Event|
|New York City||49||21||18|
|Chicago||2015-01-01 to 2017-08-01||https://www.wunderground.com/history/monthly/us/il/chicago/KORD/date|
|Washington DC||2016-01-01 to 2017-12-31||https://www.wunderground.com/history/monthly/us/dc/washington/KDCA/date|
|Taipei||2017-01-01 to 2018-09-30||https://www.wunderground.com/history/monthly/tw/songshan-district/RCSS/date|
|New York City||2017-01-01 to 2018-12-31||https://www.wunderground.com/history/daily/us/ny/new-york-city/KLGA/date|
We created synthetic data primarily for two reasons.
► To inject artificial anomalies of different kinds for diagnosing the models.
► To generate large volumes of data to enable training of deep auto-encoder.
Since the data has strong correlations we could not fit distributions and sample independently for each column. So we performed PCA to extract independent latent variables and fitted Gaussian distributions on these variables. Data was then created by sampling from these distributions and using inverse PCA to transform data back into the network domain. Finally, following types of anomalies were injected in the network domain;
► Global anomalies where the entire network witnesses a shift in the ridership.
► Balanced anomalies where some portions of the network experience shifts in ridership but the aggregated ridership is not affected on average.
Experiments on real-world data exhibited that community detection outperforms spatial aggregation because it also considers topological structure and connectivity of networks. Furthermore, time series analysis of daily aggregation does well in isolating anomalies which have a global impact while it fails to do well in isolating localized anomalies. Further experiments will yield a deeper diagnosis of the performance of these techniques. Experiments on synthetic data show that decomposition approaches (PCA and Autoencoder) perform better than crude network aggregation. But these experiments have not revealed any advantage of autoencoder over PCA. This is plausible because autoencoders provide an advantage in modeling complex nonlinear relationships but the data generation process was based on PCA and only had linear correlations between features. We will further refine the 7 synthetic data generation process and try to inject different types of anomalies which disrupt distinctive spatial and temporal patterns at different scales. This will provide detailed diagnostics into comparative capabilities of different methodologies in isolating a different kind of anomalies.
We'd like to thank Prof. Stanislav Sobolevsky who provided guidance and assistance througout the whole project.