WMATA Daily Ridership Forecasting

This repository contains the WMATA daily ridership forecasting workflow for MetroBus and MetroRail. It builds clean modeling tables from WMATA ridership exports, trains and evaluates forecast models, and produces 30-day forecast tables and presentation-ready graphs.

A detailed project overview, modeling scope, and design decisions are documented in the OUTLINE.MD file.

Quick Start

Run these commands from the repository root.

Rscript setup.R
Rscript -e 'targets::tar_make()'

The main pipeline writes tables, figures, and diagnostics under outputs/.

To refresh the curated production graph set after the main pipeline finishes:

Rscript wmata_prod_graphs.R

Create a simple 30-day Bus and Rail forecast text summary: (outputs to outputs/tables/future_forecast_summary.txt)

Rscript future_forecast.R

Requirements

R 4.5 or newer is recommended.
The setup script restores packages from renv.lock when available and installs any missing runtime packages.

Input Data

Download the ridership exports from the WMATA Daily Ridership Portal:

https://www.wmata.com/initiatives/ridership-portal/daily-summary.cfm

On the portal, use the table export controls:

Select Download Data.
Select Full Data.
Set the row limit to 100,000 rows.
Download both the full ridership detail export and the daily summary totals export.

Place the two WMATA ridership exports in data/raw/:

Daily Ridership - Bar Chart View_Full Data_data (8).csv
Daily Ridership - Bar Chart View_data (5).csv

The pipeline expects both raw exports in data/raw/.

Training exclusions are managed in:

data/training_exclusions.csv

Use that file for known closures, shutdowns, weather disruptions, or one-off operating anomalies that should remain in history but be excluded from model training.

Expected format:

date,mode,station_name,exclude_from_training,reason
YYYY-MM-DD,Bus,,TRUE,Short reason for excluding this day
YYYY-MM-DD,Rail,Station Name,TRUE,Short reason for excluding this station-day

Simple example:

date,mode,station_name,exclude_from_training,reason
2026-01-26,Bus,,TRUE,WMATA closure due to excessive snow
2026-01-26,Rail,,TRUE,WMATA closure due to excessive snow

Use a blank station_name for a mode-wide exclusion. Fill station_name only when a Rail station-specific row should be excluded.

Project Structure

.
├── _targets.R                  # Pipeline definition
├── setup.R                     # Package and directory setup
├── functions.R                 # Shared helpers
├── data_prep.R                 # Raw, bronze, and silver data preparation
├── feature_engineering.R       # Forecast-safe feature engineering
├── model_spec.R                # Model definitions
├── model_fit.R                 # Training and model selection
├── evaluation.R                # Metrics and backtesting helpers
├── forecasting.R               # Future forecast generation
├── future_forecast.R           # Simple 30-day forecast text summary
├── graph_pipeline.R            # Main pipeline figures
├── discovery_layer.R           # Discovery diagnostics and insight tables
├── wmata_prod_graphs.R         # Curated production graph export script
├── data/
│   ├── raw/
│   ├── processed/
│   │   ├── bronze/
│   │   ├── silver/
│   │   └── gold/
│   └── training_exclusions.csv
├── outputs/
│   ├── diagnostics/
│   ├── figures/
│   └── tables/
├── slideshowGraphs/
└── docs/

Pipeline Commands

Install or restore dependencies:

Rscript setup.R

Run the full pipeline:

Rscript -e 'targets::tar_make()'

Reset pipeline state when inputs or code have changed substantially:

Rscript -e 'targets::tar_destroy(); targets::tar_make()'

Modeling Scope

MetroBus: one systemwide daily ridership forecast.
MetroRail: station-level forecasts for the main station cohort, fallback forecasts for incomplete or newer stations, and an aggregated systemwide rail forecast.
Unassigned Rail rows: excluded from station-level model training and evaluation, tracked separately in QA outputs, and forecast separately for complete system totals.

Forecast-Safe Features

The model uses only information that would be available at prediction time:

Calendar fields: trend, year, month, week of year, day of week, and weekend flag
WMATA context: holiday, service type, and weekday/Saturday/Sunday grouping
History: lags at 1, 7, 14, 21, and 28 days
Rolling means: trailing 7, 14, and 28 days shifted by one day
Same-weekday history: prior same-weekday averages
Rail-only fields: station identifier and station age flags

Excluded from the production feature set:

Weather
Gas prices
Economic indicators
Unknown future disruptions
Future actuals or leaked rolling statistics

These are all either unavailable at prediction time or risk data leakage that would overstate model performance.

Model Ladder

The project evaluates:

Annual seasonal naive benchmark
7-day lag benchmark
Linear regression benchmark
GLMNET regularized regression
XGBoost challenger

Leakage prevention

All train, validation, holdout, and forecast splits are chronological
Rolling features are shifted by one day before use
Lagged features only reference prior observations
Holdout reporting is reserved for January 1, 2026 through March 31, 2026
The final 30-day forecast is generated after the latest available historical date

Bus modeling

MetroBus is modeled as one systemwide daily series
It follows the same model ladder and validation design as MetroRail

Rail modeling

Rail production v1 is station-level first, aggregated up
Main cohort: stations with at least 90% overall coverage and at least 2 years of pre-holdout history
Newer or incomplete stations are forecast with a deterministic fallback hierarchy and flagged in outputs
Unassigned Rail rows are excluded from station-level modeling and documented separately

Fallback strategy

Primary fallback: 7-day lag
Secondary fallback: rolling same-weekday average
Annual seasonal naive is kept as a benchmark, not the fallback default

Validation design

Monthly rolling-origin backtests across calendar year 2025
30-day forecast window for each origin
Horizon reporting for day 1, days 2 to 7, and days 8 to 30
90-day evaluation is optional and not the core selection criterion in v1

Future improvements

Add published service-planning calendars if WMATA provides them
Add appendix aggregate-only SARIMA benchmarks if stakeholders want classic time-series comparisons
Expand forecast horizon reporting to 90-day production outputs when runtime and forecast quality justify it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WMATA Daily Ridership Forecasting

Quick Start

Requirements

Input Data

Project Structure

Pipeline Commands

Modeling Scope

Forecast-Safe Features

Model Ladder

Leakage prevention

Bus modeling

Rail modeling

Fallback strategy

Validation design

Future improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
outputs		outputs
slideshowGraphs		slideshowGraphs
.gitignore		.gitignore
OUTLINE.md		OUTLINE.md
README.md		README.md
_targets.R		_targets.R
data_prep.R		data_prep.R
discovery_layer.R		discovery_layer.R
evaluation.R		evaluation.R
feature_engineering.R		feature_engineering.R
forecasting.R		forecasting.R
functions.R		functions.R
future_forecast.R		future_forecast.R
graph_pipeline.R		graph_pipeline.R
model_fit.R		model_fit.R
model_spec.R		model_spec.R
renv.lock		renv.lock
setup.R		setup.R
wmata_prod_graphs.R		wmata_prod_graphs.R

Folders and files

Latest commit

History

Repository files navigation

WMATA Daily Ridership Forecasting

Quick Start

Requirements

Input Data

Project Structure

Pipeline Commands

Modeling Scope

Forecast-Safe Features

Model Ladder

Leakage prevention

Bus modeling

Rail modeling

Fallback strategy

Validation design

Future improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages