GTFS Transit Delay Prediction
End-to-end machine learning solution for predicting public transport delays using GTFS-Realtime data, weather features, and a serverless AWS ETL pipeline.
Project Information
- Category: Data
- Status: In Progress
- Type: Data Project
Technologies Used
About This Project
End-to-end machine learning solution for predicting public transport delays using GTFS-Realtime data, weather features, and a serverless AWS ETL pipeline.
Problem & Context
Project focused on predicting public transport delays based on weather conditions and real-time GTFS data from the Stockholm transit system, combining streaming transit data with rich meteorological features.
Architecture Overview
End-to-end, cloud-native pipeline that moves from raw GTFS-Realtime feeds to ML-ready datasets and model inference:
- Local orchestrator downloads GTFS-RT feeds, extracts
.pbfiles, and uploads them to S3. - AWS Glue PySpark job performs ETL, joins weather data, and engineers temporal & lag features.
- Trained models and metrics are stored for downstream inference and evaluation.
Key Components
- ETL Pipeline with AWS Glue (PySpark) for large-scale data processing.
- AWS Lambda functions for ETL orchestration, inference, and automated retraining.
- Infrastructure as Code using CloudFormation templates for Glue, S3, and related IAM roles.
- Monitoring via CloudWatch dashboards and alarms.
Machine Learning
Binary classification task: predict whether a vehicle will be delayed more than 180 seconds.
- Models: Logistic Regression, Random Forest, Neural Network (MLP), XGBoost.
- Feature engineering: temporal features, weather indicators, lagged delays, and moving averages.
- Best models achieve high AUC-ROC (~0.94 for XGBoost) with lag-enhanced features.