Skip to content

ugcodrr/WMATA-R

Repository files navigation

WMATA Daily Ridership Forecasting

This repository contains the WMATA daily ridership forecasting workflow for MetroBus and MetroRail. It builds clean modeling tables from WMATA ridership exports, trains and evaluates forecast models, and produces 30-day forecast tables and presentation-ready graphs.

A detailed project overview, modeling scope, and design decisions are documented in the OUTLINE.MD file.

Quick Start

Run these commands from the repository root.

Rscript setup.R
Rscript -e 'targets::tar_make()'

The main pipeline writes tables, figures, and diagnostics under outputs/.

To refresh the curated production graph set after the main pipeline finishes:

Rscript wmata_prod_graphs.R

Create a simple 30-day Bus and Rail forecast text summary: (outputs to outputs/tables/future_forecast_summary.txt)

Rscript future_forecast.R

Requirements

  • R 4.5 or newer is recommended.
  • The setup script restores packages from renv.lock when available and installs any missing runtime packages.

Input Data

Download the ridership exports from the WMATA Daily Ridership Portal:

https://www.wmata.com/initiatives/ridership-portal/daily-summary.cfm

On the portal, use the table export controls:

  1. Select Download Data.
  2. Select Full Data.
  3. Set the row limit to 100,000 rows.
  4. Download both the full ridership detail export and the daily summary totals export.

Place the two WMATA ridership exports in data/raw/:

  • Daily Ridership - Bar Chart View_Full Data_data (8).csv
  • Daily Ridership - Bar Chart View_data (5).csv

The pipeline expects both raw exports in data/raw/.

Training exclusions are managed in:

data/training_exclusions.csv

Use that file for known closures, shutdowns, weather disruptions, or one-off operating anomalies that should remain in history but be excluded from model training.

Expected format:

date,mode,station_name,exclude_from_training,reason
YYYY-MM-DD,Bus,,TRUE,Short reason for excluding this day
YYYY-MM-DD,Rail,Station Name,TRUE,Short reason for excluding this station-day

Simple example:

date,mode,station_name,exclude_from_training,reason
2026-01-26,Bus,,TRUE,WMATA closure due to excessive snow
2026-01-26,Rail,,TRUE,WMATA closure due to excessive snow

Use a blank station_name for a mode-wide exclusion. Fill station_name only when a Rail station-specific row should be excluded.

Project Structure

.
├── _targets.R                  # Pipeline definition
├── setup.R                     # Package and directory setup
├── functions.R                 # Shared helpers
├── data_prep.R                 # Raw, bronze, and silver data preparation
├── feature_engineering.R       # Forecast-safe feature engineering
├── model_spec.R                # Model definitions
├── model_fit.R                 # Training and model selection
├── evaluation.R                # Metrics and backtesting helpers
├── forecasting.R               # Future forecast generation
├── future_forecast.R           # Simple 30-day forecast text summary
├── graph_pipeline.R            # Main pipeline figures
├── discovery_layer.R           # Discovery diagnostics and insight tables
├── wmata_prod_graphs.R         # Curated production graph export script
├── data/
│   ├── raw/
│   ├── processed/
│   │   ├── bronze/
│   │   ├── silver/
│   │   └── gold/
│   └── training_exclusions.csv
├── outputs/
│   ├── diagnostics/
│   ├── figures/
│   └── tables/
├── slideshowGraphs/
└── docs/

Pipeline Commands

Install or restore dependencies:

Rscript setup.R

Run the full pipeline:

Rscript -e 'targets::tar_make()'

Reset pipeline state when inputs or code have changed substantially:

Rscript -e 'targets::tar_destroy(); targets::tar_make()'

Modeling Scope

  • MetroBus: one systemwide daily ridership forecast.
  • MetroRail: station-level forecasts for the main station cohort, fallback forecasts for incomplete or newer stations, and an aggregated systemwide rail forecast.
  • Unassigned Rail rows: excluded from station-level model training and evaluation, tracked separately in QA outputs, and forecast separately for complete system totals.

Forecast-Safe Features

The model uses only information that would be available at prediction time:

  • Calendar fields: trend, year, month, week of year, day of week, and weekend flag
  • WMATA context: holiday, service type, and weekday/Saturday/Sunday grouping
  • History: lags at 1, 7, 14, 21, and 28 days
  • Rolling means: trailing 7, 14, and 28 days shifted by one day
  • Same-weekday history: prior same-weekday averages
  • Rail-only fields: station identifier and station age flags

Excluded from the production feature set:

  • Weather
  • Gas prices
  • Economic indicators
  • Unknown future disruptions
  • Future actuals or leaked rolling statistics

These are all either unavailable at prediction time or risk data leakage that would overstate model performance.

Model Ladder

The project evaluates:

  1. Annual seasonal naive benchmark
  2. 7-day lag benchmark
  3. Linear regression benchmark
  4. GLMNET regularized regression
  5. XGBoost challenger

Leakage prevention

  • All train, validation, holdout, and forecast splits are chronological
  • Rolling features are shifted by one day before use
  • Lagged features only reference prior observations
  • Holdout reporting is reserved for January 1, 2026 through March 31, 2026
  • The final 30-day forecast is generated after the latest available historical date

Bus modeling

  • MetroBus is modeled as one systemwide daily series
  • It follows the same model ladder and validation design as MetroRail

Rail modeling

  • Rail production v1 is station-level first, aggregated up
  • Main cohort: stations with at least 90% overall coverage and at least 2 years of pre-holdout history
  • Newer or incomplete stations are forecast with a deterministic fallback hierarchy and flagged in outputs
  • Unassigned Rail rows are excluded from station-level modeling and documented separately

Fallback strategy

  • Primary fallback: 7-day lag
  • Secondary fallback: rolling same-weekday average
  • Annual seasonal naive is kept as a benchmark, not the fallback default

Validation design

  • Monthly rolling-origin backtests across calendar year 2025
  • 30-day forecast window for each origin
  • Horizon reporting for day 1, days 2 to 7, and days 8 to 30
  • 90-day evaluation is optional and not the core selection criterion in v1

Future improvements

  • Add published service-planning calendars if WMATA provides them
  • Add appendix aggregate-only SARIMA benchmarks if stakeholders want classic time-series comparisons
  • Expand forecast horizon reporting to 90-day production outputs when runtime and forecast quality justify it

Releases

No releases published

Packages

 
 
 

Contributors

Languages