GTFS Transit Delay Prediction

Data

End-to-end machine learning solution for predicting public transport delays using GTFS-Realtime data, weather features, and a serverless AWS ETL pipeline.

Project Information

Category: Data
Status: In Progress
Type: Data Project

Technologies Used

Python Scikit-learn TensorFlow XGBoost PySpark AWS S3 AWS Glue AWS Lambda CloudFormation GTFS-Realtime

About This Project

End-to-end machine learning solution for predicting public transport delays using GTFS-Realtime data, weather features, and a serverless AWS ETL pipeline.

Problem & Context

Project focused on predicting public transport delays based on weather conditions and real-time GTFS data from the Stockholm transit system, combining streaming transit data with rich meteorological features.

Architecture Overview

End-to-end, cloud-native pipeline that moves from raw GTFS-Realtime feeds to ML-ready datasets and model inference:

Local orchestrator downloads GTFS-RT feeds, extracts .pb files, and uploads them to S3.
AWS Glue PySpark job performs ETL, joins weather data, and engineers temporal & lag features.
Trained models and metrics are stored for downstream inference and evaluation.

Key Components

ETL Pipeline with AWS Glue (PySpark) for large-scale data processing.
AWS Lambda functions for ETL orchestration, inference, and automated retraining.
Infrastructure as Code using CloudFormation templates for Glue, S3, and related IAM roles.
Monitoring via CloudWatch dashboards and alarms.

Machine Learning

Binary classification task: predict whether a vehicle will be delayed more than 180 seconds.

Models: Logistic Regression, Random Forest, Neural Network (MLP), XGBoost.
Feature engineering: temporal features, weather indicators, lagged delays, and moving averages.
Best models achieve high AUC-ROC (~0.94 for XGBoost) with lag-enhanced features.

Project Links

View Code