- A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow. Designed for rapid experimentation, visualization, and benchmarking of both classification and regression models, it provides a structured yet extensible workflow that’s equally useful for teaching, prototyping, and real-world application development.
- File A (Account Balances) published monthly by agencies on USASpending
- Required by the DATA Act.
- Pulled automatically from data in the Governmentwide Treasury Account Symbol Adjusted Trial Balance System (GTAS)
- Contains Budgetary resources, obligation, and outlay data for all the relevant Treasury Account Symbols (TAS) in a reporting agency.
- It includes both award and non-award spending (grouped together), and crosswalks with the SF 133 report.
- Train models
- Cross-validate with nested k-fold
- Generate predictions
- Output evaluation plots & performance metrics
- Store results & timings for meta-analysis
git clone https://github.com/your-username/balance-projector.git
cd balance-projector
pip install -r requirements.txt
jupyter notebook balances.ipynb
1. Click the **Open In Colab** badge above.
2. Upload your CSV or mount Google Drive.
3. Set `DATA_PATH` near the top of the notebook.
4. **Runtime → Run all**.
bash
# 1) Create environment
conda create -n sake python=3.11 -y
conda activate sake
# 2) Install dependencies
pip install -U pip wheel setuptools
pip install pandas numpy scipy matplotlib seaborn scikit-learn jupyter
# 3) Launch Jupyter
jupyter notebook
Open ipynb/schedule-x.ipynb
and run cells top-to-bottom.
- Confusion Matrix Heatmaps 🔥
- ROC & Precision-Recall Curves 📈
- Actual vs. Predicted Scatterplots 🎯
- Residual Analysis & Error Distribution 🎭
- Feature Importance Charts 📊
- Automatically logs
fit
andpredict
durations - Model performance rankings across tasks
- Output available in tabular format for export
- Accepts CSVs, Excel files, or Pandas DataFrames
- Label encoding, numeric coercion, missing data handling
- Drop-in replacement for datasets via parameter injection
- Benchmark dozens of models easily
- Plug-in architecture for testing experimental models
- Use in classrooms to demo interpretability, overfitting, and variance
Statistic | Description | Use in Budget Analysis |
---|---|---|
Mean | Average value | Avg. Outlays, Obligations, etc., across accounts |
Median | Middle value | Robust central tendency in skewed financial data |
Mode | Most frequent value | Identify common MainAccountCodes or Availability categories |
Standard Deviation | Spread around the mean | Indicates variability in execution rates or balances |
Variance | Square of standard deviation | Used in statistical tests and model diagnostics |
Range | Difference between max and min | Measures total spread of financial metrics |
Interquartile Range (IQR) | Spread of middle 50% of data | Identifies budget outliers and extreme accounts |
Skewness | Asymmetry of distribution | Skewed obligations suggest few accounts dominate totals |
Kurtosis | "Peakedness" of distribution | High values indicate outlier-prone financial data |
Metric | Description | Use in Budget Analysis |
---|---|---|
Pearson Correlation | Linear relationship between variables | E.g., TotalResources vs. Obligations |
Spearman Correlation | Monotonic (rank-based) relationship | More robust to non-linear trends in financial execution |
t-test | Compare means between 2 groups | Discretionary vs. Mandatory accounts' execution rates |
ANOVA | Compare means across multiple groups | Obligations across availability periods or account types |
Chi-square Test | Categorical independence | Are Main Account Codes related to availability or a specific agency? |
Confidence Intervals | Estimate range of a population mean | Upper and lower bound expected obligations or recoveries |
Regression Coefficients (p-values) | Test variable significance | Are Recoveries a significant predictor of UnobligatedBalance? |
F-statistic (overall regression) | Test whole model fit | Determines the combined influence of all predictors |
Z-score / Outlier Tests | Deviation from standard mean | Identify abnormal balances or lapse rates |
Boxplots | Visual outlier detection | Discover obligation anomalies within agencies |
Model | Module |
---|---|
Logistic Regression | sklearn.linear_model.LogisticRegression |
SVM | sklearn.svm.SVC |
Decision Tree | sklearn.tree.DecisionTreeClassifier |
Random Forest | sklearn.ensemble.RandomForestClassifier |
XGBoost Classifier | xgboost.XGBClassifier |
K-Nearest Neighbors | sklearn.neighbors.KNeighborsClassifier |
Gaussian Naive Bayes | sklearn.naive_bayes.GaussianNB |
Extra Trees | sklearn.ensemble.ExtraTreesClassifier |
Bagging | sklearn.ensemble.BaggingClassifier |
AdaBoost | sklearn.ensemble.AdaBoostClassifier |
Model | Module |
---|---|
Linear Regression | sklearn.linear_model.LinearRegression |
Ridge Regression | sklearn.linear_model.Ridge |
Lasso Regression | sklearn.linear_model.Lasso |
ElasticNet | sklearn.linear_model.ElasticNet |
Support Vector Regressor | sklearn.svm.SVR |
Decision Tree Regressor | sklearn.tree.DecisionTreeRegressor |
Random Forest Regressor | sklearn.ensemble.RandomForestRegressor |
Gradient Boosting Regressor | sklearn.ensemble.GradientBoostingRegressor |
XGBoost Regressor | xgboost.XGBRegressor |
K-Nearest Neighbors | sklearn.neighbors.KNeighborsRegressor |
AdaBoost Regressor | sklearn.ensemble.AdaBoostRegressor |
Extra Trees Regressor | sklearn.ensemble.ExtraTreesRegressor |
Package | Description | Link |
---|---|---|
numpy | Numerical computing library | numpy.org |
pandas | Data manipulation and DataFrames | pandas.pydata.org |
matplotlib | Plotting and visualization | matplotlib.org |
seaborn | Statistical data visualization | seaborn.pydata.org |
scikit-learn | ML modeling and metrics | scikit-learn.org |
xgboost | Gradient boosting framework (optional) | xgboost.readthedocs.io |
torch | PyTorch deep learning library | pytorch.org |
tensorflow | End-to-end ML platform | tensorflow.org |
openai | OpenAI’s Python API client | openai-python |
requests | HTTP requests for API and web access | requests.readthedocs.io |
PySimpleGUI | GUI framework for desktop apps | pysimplegui.readthedocs.io |
typing | Type hinting standard library | typing Docs |
pyodbc | ODBC database connector | pyodbc GitHub |
fitz | PDF document parser via PyMuPDF | pymupdf |
pillow | Image processing library | python-pillow.org |
openpyxl | Excel file processing | openpyxl Docs |
soundfile | Read/write sound file formats | pysoundfile |
sounddevice | Audio I/O interface | sounddevice Docs |
loguru | Structured, elegant logging | loguru GitHub |
statsmodels | Statistical tests and regression diagnostics | statsmodels.org |
dotenv | Load environment variables from .env |
python-dotenv GitHub |
python-dotenv | Same as above (modern usage) | python-dotenv |
Replace dataset ingestion cell with:
import pandas as pd
df = pd.read_csv("your_dataset.csv")
X = df.drop("target_column", axis=1)
y = df["target_column"]
- R², MAE, MSE for each model
- Bar plots of performance scores
- Visual predicted vs. actual scatter charts
- Residual error analysis
Disclaimer: This is for analytical exploration, research, and education purposes.
This is not an official government product; validate against authoritative sources before use.
Sake is published under the MIT General Public License v3.