GitHub - is-leeroy-jenkins/Sake: A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow.

Sake

A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow. Designed for rapid experimentation, visualization, and benchmarking of both classification and regression models, it provides a structured yet extensible workflow that’s equally useful for teaching, prototyping, and real-world application development.

🔬 Data Source

File A (Account Balances) published monthly by agencies on USASpending
Required by the DATA Act.
Pulled automatically from data in the Governmentwide Treasury Account Symbol Adjusted Trial Balance System (GTAS)
Contains Budgetary resources, obligation, and outlay data for all the relevant Treasury Account Symbols (TAS) in a reporting agency.
It includes both award and non-award spending (grouped together), and crosswalks with the SF 133 report.

🔄 Unified Evaluation Pipeline

A single interface `train_and_evaluate()` to:

Train models
Cross-validate with nested k-fold
Generate predictions
Output evaluation plots & performance metrics
Store results & timings for meta-analysis

🧪 How to Run

git clone https://github.com/your-username/balance-projector.git
cd balance-projector
pip install -r requirements.txt
jupyter notebook balances.ipynb

🎯 Quickstart

Option A — Google Colab (no local setup)

1. Click the **Open In Colab** badge above.
2. Upload your CSV or mount Google Drive.
3. Set `DATA_PATH` near the top of the notebook.
4. **Runtime → Run all**.

Option B — Local (conda or venv)

bash
# 1) Create environment
conda create -n sake python=3.11 -y
conda activate sake

# 2) Install dependencies
pip install -U pip wheel setuptools
pip install pandas numpy scipy matplotlib seaborn scikit-learn jupyter

# 3) Launch Jupyter
jupyter notebook

Open ipynb/schedule-x.ipynb and run cells top-to-bottom.

📊 Rich Visualization Toolkit

Confusion Matrix Heatmaps 🔥
ROC & Precision-Recall Curves 📈
Actual vs. Predicted Scatterplots 🎯
Residual Analysis & Error Distribution 🎭
Feature Importance Charts 📊

⏱️ Timing & Benchmarking

Automatically logs fit and predict durations
Model performance rankings across tasks
Output available in tabular format for export

💡 Custom Dataset Support

Accepts CSVs, Excel files, or Pandas DataFrames
Label encoding, numeric coercion, missing data handling
Drop-in replacement for datasets via parameter injection

🧪 Research Ready

Benchmark dozens of models easily
Plug-in architecture for testing experimental models
Use in classrooms to demo interpretability, overfitting, and variance

📊 Descriptive Statistics

Statistic	Description	Use in Budget Analysis
Mean	Average value	Avg. Outlays, Obligations, etc., across accounts
Median	Middle value	Robust central tendency in skewed financial data
Mode	Most frequent value	Identify common MainAccountCodes or Availability categories
Standard Deviation	Spread around the mean	Indicates variability in execution rates or balances
Variance	Square of standard deviation	Used in statistical tests and model diagnostics
Range	Difference between max and min	Measures total spread of financial metrics
Interquartile Range (IQR)	Spread of middle 50% of data	Identifies budget outliers and extreme accounts
Skewness	Asymmetry of distribution	Skewed obligations suggest few accounts dominate totals
Kurtosis	"Peakedness" of distribution	High values indicate outlier-prone financial data

🔍 Inferrential Statistics

Metric	Description	Use in Budget Analysis
Pearson Correlation	Linear relationship between variables	E.g., TotalResources vs. Obligations
Spearman Correlation	Monotonic (rank-based) relationship	More robust to non-linear trends in financial execution
t-test	Compare means between 2 groups	Discretionary vs. Mandatory accounts' execution rates
ANOVA	Compare means across multiple groups	Obligations across availability periods or account types
Chi-square Test	Categorical independence	Are Main Account Codes related to availability or a specific agency?
Confidence Intervals	Estimate range of a population mean	Upper and lower bound expected obligations or recoveries
Regression Coefficients (p-values)	Test variable significance	Are Recoveries a significant predictor of UnobligatedBalance?
F-statistic (overall regression)	Test whole model fit	Determines the combined influence of all predictors
Z-score / Outlier Tests	Deviation from standard mean	Identify abnormal balances or lapse rates
Boxplots	Visual outlier detection	Discover obligation anomalies within agencies

✅ Classification:

Model	Module
Logistic Regression	`sklearn.linear_model.LogisticRegression`
SVM	`sklearn.svm.SVC`
Decision Tree	`sklearn.tree.DecisionTreeClassifier`
Random Forest	`sklearn.ensemble.RandomForestClassifier`
XGBoost Classifier	`xgboost.XGBClassifier`
K-Nearest Neighbors	`sklearn.neighbors.KNeighborsClassifier`
Gaussian Naive Bayes	`sklearn.naive_bayes.GaussianNB`
Extra Trees	`sklearn.ensemble.ExtraTreesClassifier`
Bagging	`sklearn.ensemble.BaggingClassifier`
AdaBoost	`sklearn.ensemble.AdaBoostClassifier`

📉 Regression:

Model	Module
Linear Regression	`sklearn.linear_model.LinearRegression`
Ridge Regression	`sklearn.linear_model.Ridge`
Lasso Regression	`sklearn.linear_model.Lasso`
ElasticNet	`sklearn.linear_model.ElasticNet`
Support Vector Regressor	`sklearn.svm.SVR`
Decision Tree Regressor	`sklearn.tree.DecisionTreeRegressor`
Random Forest Regressor	`sklearn.ensemble.RandomForestRegressor`
Gradient Boosting Regressor	`sklearn.ensemble.GradientBoostingRegressor`
XGBoost Regressor	`xgboost.XGBRegressor`
K-Nearest Neighbors	`sklearn.neighbors.KNeighborsRegressor`
AdaBoost Regressor	`sklearn.ensemble.AdaBoostRegressor`
Extra Trees Regressor	`sklearn.ensemble.ExtraTreesRegressor`

📦 Dependencies

Package	Description	Link
numpy	Numerical computing library	numpy.org
pandas	Data manipulation and DataFrames	pandas.pydata.org
matplotlib	Plotting and visualization	matplotlib.org
seaborn	Statistical data visualization	seaborn.pydata.org
scikit-learn	ML modeling and metrics	scikit-learn.org
xgboost	Gradient boosting framework (optional)	xgboost.readthedocs.io
torch	PyTorch deep learning library	pytorch.org
tensorflow	End-to-end ML platform	tensorflow.org
openai	OpenAI’s Python API client	openai-python
requests	HTTP requests for API and web access	requests.readthedocs.io
PySimpleGUI	GUI framework for desktop apps	pysimplegui.readthedocs.io
typing	Type hinting standard library	typing Docs
pyodbc	ODBC database connector	pyodbc GitHub
fitz	PDF document parser via PyMuPDF	pymupdf
pillow	Image processing library	python-pillow.org
openpyxl	Excel file processing	openpyxl Docs
soundfile	Read/write sound file formats	pysoundfile
sounddevice	Audio I/O interface	sounddevice Docs
loguru	Structured, elegant logging	loguru GitHub
statsmodels	Statistical tests and regression diagnostics	statsmodels.org
dotenv	Load environment variables from `.env`	python-dotenv GitHub
python-dotenv	Same as above (modern usage)	python-dotenv

📁 Customize Dataset

Replace dataset ingestion cell with:

import pandas as pd
df = pd.read_csv("your_dataset.csv")
X = df.drop("target_column", axis=1)
y = df["target_column"]

📊 Outputs

R², MAE, MSE for each model
Bar plots of performance scores
Visual predicted vs. actual scatter charts
Residual error analysis

Disclaimer: This is for analytical exploration, research, and education purposes.
This is not an official government product; validate against authoritative sources before use.

📝 License

Sake is published under the MIT General Public License v3.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.idea		.idea
data		data
resources/assets/img		resources/assets/img
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
booger.py		booger.py
minion.py		minion.py
models.ipynb		models.ipynb
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sake

🔬 Data Source

🔄 Unified Evaluation Pipeline

A single interface `train_and_evaluate()` to:

🧪 How to Run

🎯 Quickstart

Option A — Google Colab (no local setup)

Option B — Local (conda or venv)

📊 Rich Visualization Toolkit

⏱️ Timing & Benchmarking

💡 Custom Dataset Support

🧪 Research Ready

📊 Descriptive Statistics

🔍 Inferrential Statistics

✅ Classification:

📉 Regression:

📦 Dependencies

📁 Customize Dataset

📊 Outputs

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

is-leeroy-jenkins/Sake

Folders and files

Latest commit

History

Repository files navigation

Sake

🔬 Data Source

🔄 Unified Evaluation Pipeline

A single interface train_and_evaluate() to:

🧪 How to Run

🎯 Quickstart

Option A — Google Colab (no local setup)

Option B — Local (conda or venv)

📊 Rich Visualization Toolkit

⏱️ Timing & Benchmarking

💡 Custom Dataset Support

🧪 Research Ready

📊 Descriptive Statistics

🔍 Inferrential Statistics

✅ Classification:

📉 Regression:

📦 Dependencies

📁 Customize Dataset

📊 Outputs

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

A single interface `train_and_evaluate()` to:

Packages