This repository contains the WMATA daily ridership forecasting workflow for MetroBus and MetroRail. It builds clean modeling tables from WMATA ridership exports, trains and evaluates forecast models, and produces 30-day forecast tables and presentation-ready graphs.
A detailed project overview, modeling scope, and design decisions are documented in the OUTLINE.MD file.
Run these commands from the repository root.
Rscript setup.R
Rscript -e 'targets::tar_make()'The main pipeline writes tables, figures, and diagnostics under outputs/.
To refresh the curated production graph set after the main pipeline finishes:
Rscript wmata_prod_graphs.RCreate a simple 30-day Bus and Rail forecast text summary:
(outputs to outputs/tables/future_forecast_summary.txt)
Rscript future_forecast.R- R 4.5 or newer is recommended.
- The setup script restores packages from
renv.lockwhen available and installs any missing runtime packages.
Download the ridership exports from the WMATA Daily Ridership Portal:
https://www.wmata.com/initiatives/ridership-portal/daily-summary.cfm
On the portal, use the table export controls:
- Select Download Data.
- Select Full Data.
- Set the row limit to 100,000 rows.
- Download both the full ridership detail export and the daily summary totals export.
Place the two WMATA ridership exports in data/raw/:
Daily Ridership - Bar Chart View_Full Data_data (8).csvDaily Ridership - Bar Chart View_data (5).csv
The pipeline expects both raw exports in data/raw/.
Training exclusions are managed in:
data/training_exclusions.csv
Use that file for known closures, shutdowns, weather disruptions, or one-off operating anomalies that should remain in history but be excluded from model training.
Expected format:
date,mode,station_name,exclude_from_training,reason
YYYY-MM-DD,Bus,,TRUE,Short reason for excluding this day
YYYY-MM-DD,Rail,Station Name,TRUE,Short reason for excluding this station-daySimple example:
date,mode,station_name,exclude_from_training,reason
2026-01-26,Bus,,TRUE,WMATA closure due to excessive snow
2026-01-26,Rail,,TRUE,WMATA closure due to excessive snowUse a blank station_name for a mode-wide exclusion. Fill station_name only when a Rail station-specific row should be excluded.
.
├── _targets.R # Pipeline definition
├── setup.R # Package and directory setup
├── functions.R # Shared helpers
├── data_prep.R # Raw, bronze, and silver data preparation
├── feature_engineering.R # Forecast-safe feature engineering
├── model_spec.R # Model definitions
├── model_fit.R # Training and model selection
├── evaluation.R # Metrics and backtesting helpers
├── forecasting.R # Future forecast generation
├── future_forecast.R # Simple 30-day forecast text summary
├── graph_pipeline.R # Main pipeline figures
├── discovery_layer.R # Discovery diagnostics and insight tables
├── wmata_prod_graphs.R # Curated production graph export script
├── data/
│ ├── raw/
│ ├── processed/
│ │ ├── bronze/
│ │ ├── silver/
│ │ └── gold/
│ └── training_exclusions.csv
├── outputs/
│ ├── diagnostics/
│ ├── figures/
│ └── tables/
├── slideshowGraphs/
└── docs/
Install or restore dependencies:
Rscript setup.RRun the full pipeline:
Rscript -e 'targets::tar_make()'Reset pipeline state when inputs or code have changed substantially:
Rscript -e 'targets::tar_destroy(); targets::tar_make()'- MetroBus: one systemwide daily ridership forecast.
- MetroRail: station-level forecasts for the main station cohort, fallback forecasts for incomplete or newer stations, and an aggregated systemwide rail forecast.
- Unassigned Rail rows: excluded from station-level model training and evaluation, tracked separately in QA outputs, and forecast separately for complete system totals.
The model uses only information that would be available at prediction time:
- Calendar fields: trend, year, month, week of year, day of week, and weekend flag
- WMATA context: holiday, service type, and weekday/Saturday/Sunday grouping
- History: lags at 1, 7, 14, 21, and 28 days
- Rolling means: trailing 7, 14, and 28 days shifted by one day
- Same-weekday history: prior same-weekday averages
- Rail-only fields: station identifier and station age flags
Excluded from the production feature set:
- Weather
- Gas prices
- Economic indicators
- Unknown future disruptions
- Future actuals or leaked rolling statistics
These are all either unavailable at prediction time or risk data leakage that would overstate model performance.
The project evaluates:
- Annual seasonal naive benchmark
- 7-day lag benchmark
- Linear regression benchmark
- GLMNET regularized regression
- XGBoost challenger
- All train, validation, holdout, and forecast splits are chronological
- Rolling features are shifted by one day before use
- Lagged features only reference prior observations
- Holdout reporting is reserved for January 1, 2026 through March 31, 2026
- The final 30-day forecast is generated after the latest available historical date
- MetroBus is modeled as one systemwide daily series
- It follows the same model ladder and validation design as MetroRail
- Rail production v1 is station-level first, aggregated up
- Main cohort: stations with at least 90% overall coverage and at least 2 years of pre-holdout history
- Newer or incomplete stations are forecast with a deterministic fallback hierarchy and flagged in outputs
- Unassigned Rail rows are excluded from station-level modeling and documented separately
- Primary fallback: 7-day lag
- Secondary fallback: rolling same-weekday average
- Annual seasonal naive is kept as a benchmark, not the fallback default
- Monthly rolling-origin backtests across calendar year 2025
- 30-day forecast window for each origin
- Horizon reporting for day 1, days 2 to 7, and days 8 to 30
- 90-day evaluation is optional and not the core selection criterion in v1
- Add published service-planning calendars if WMATA provides them
- Add appendix aggregate-only SARIMA benchmarks if stakeholders want classic time-series comparisons
- Expand forecast horizon reporting to 90-day production outputs when runtime and forecast quality justify it