GTFS Transit Delay Prediction

Data

End-to-end machine learning solution for predicting public transport delays using GTFS-Realtime data, weather features, and a serverless AWS ETL pipeline.

GTFS Transit Delay Prediction
Project Information
  • Category: Data
  • Status: In Progress
  • Type: Data Project
Technologies Used
Python Scikit-learn TensorFlow XGBoost PySpark AWS S3 AWS Glue AWS Lambda CloudFormation GTFS-Realtime

About This Project

End-to-end machine learning solution for predicting public transport delays using GTFS-Realtime data, weather features, and a serverless AWS ETL pipeline.

Problem & Context

Project focused on predicting public transport delays based on weather conditions and real-time GTFS data from the Stockholm transit system, combining streaming transit data with rich meteorological features.

Architecture Overview

End-to-end, cloud-native pipeline that moves from raw GTFS-Realtime feeds to ML-ready datasets and model inference:

  • Local orchestrator downloads GTFS-RT feeds, extracts .pb files, and uploads them to S3.
  • AWS Glue PySpark job performs ETL, joins weather data, and engineers temporal & lag features.
  • Trained models and metrics are stored for downstream inference and evaluation.

Key Components

  • ETL Pipeline with AWS Glue (PySpark) for large-scale data processing.
  • AWS Lambda functions for ETL orchestration, inference, and automated retraining.
  • Infrastructure as Code using CloudFormation templates for Glue, S3, and related IAM roles.
  • Monitoring via CloudWatch dashboards and alarms.

Machine Learning

Binary classification task: predict whether a vehicle will be delayed more than 180 seconds.

  • Models: Logistic Regression, Random Forest, Neural Network (MLP), XGBoost.
  • Feature engineering: temporal features, weather indicators, lagged delays, and moving averages.
  • Best models achieve high AUC-ROC (~0.94 for XGBoost) with lag-enhanced features.

Project Links