Skip to content

A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow.

License

Notifications You must be signed in to change notification settings

is-leeroy-jenkins/Sake

Repository files navigation

Sake

  • A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow. Designed for rapid experimentation, visualization, and benchmarking of both classification and regression models, it provides a structured yet extensible workflow that’s equally useful for teaching, prototyping, and real-world application development.
Open In Colab

🔬 Data Source

🔄 Unified Evaluation Pipeline

A single interface train_and_evaluate() to:

  • Train models
  • Cross-validate with nested k-fold
  • Generate predictions
  • Output evaluation plots & performance metrics
  • Store results & timings for meta-analysis

🧪 How to Run

git clone https://github.com/your-username/balance-projector.git
cd balance-projector
pip install -r requirements.txt
jupyter notebook balances.ipynb

🎯 Quickstart

Option A — Google Colab (no local setup)

1. Click the **Open In Colab** badge above.
2. Upload your CSV or mount Google Drive.
3. Set `DATA_PATH` near the top of the notebook.
4. **Runtime → Run all**.

Option B — Local (conda or venv)

bash
# 1) Create environment
conda create -n sake python=3.11 -y
conda activate sake

# 2) Install dependencies
pip install -U pip wheel setuptools
pip install pandas numpy scipy matplotlib seaborn scikit-learn jupyter

# 3) Launch Jupyter
jupyter notebook

Open ipynb/schedule-x.ipynb and run cells top-to-bottom.

📊 Rich Visualization Toolkit

  • Confusion Matrix Heatmaps 🔥
  • ROC & Precision-Recall Curves 📈
  • Actual vs. Predicted Scatterplots 🎯
  • Residual Analysis & Error Distribution 🎭
  • Feature Importance Charts 📊

⏱️ Timing & Benchmarking

  • Automatically logs fit and predict durations
  • Model performance rankings across tasks
  • Output available in tabular format for export

💡 Custom Dataset Support

  • Accepts CSVs, Excel files, or Pandas DataFrames
  • Label encoding, numeric coercion, missing data handling
  • Drop-in replacement for datasets via parameter injection

🧪 Research Ready

  • Benchmark dozens of models easily
  • Plug-in architecture for testing experimental models
  • Use in classrooms to demo interpretability, overfitting, and variance

📊 Descriptive Statistics

Statistic Description Use in Budget Analysis
Mean Average value Avg. Outlays, Obligations, etc., across accounts
Median Middle value Robust central tendency in skewed financial data
Mode Most frequent value Identify common MainAccountCodes or Availability categories
Standard Deviation Spread around the mean Indicates variability in execution rates or balances
Variance Square of standard deviation Used in statistical tests and model diagnostics
Range Difference between max and min Measures total spread of financial metrics
Interquartile Range (IQR) Spread of middle 50% of data Identifies budget outliers and extreme accounts
Skewness Asymmetry of distribution Skewed obligations suggest few accounts dominate totals
Kurtosis "Peakedness" of distribution High values indicate outlier-prone financial data

🔍 Inferrential Statistics

Metric Description Use in Budget Analysis
Pearson Correlation Linear relationship between variables E.g., TotalResources vs. Obligations
Spearman Correlation Monotonic (rank-based) relationship More robust to non-linear trends in financial execution
t-test Compare means between 2 groups Discretionary vs. Mandatory accounts' execution rates
ANOVA Compare means across multiple groups Obligations across availability periods or account types
Chi-square Test Categorical independence Are Main Account Codes related to availability or a specific agency?
Confidence Intervals Estimate range of a population mean Upper and lower bound expected obligations or recoveries
Regression Coefficients (p-values) Test variable significance Are Recoveries a significant predictor of UnobligatedBalance?
F-statistic (overall regression) Test whole model fit Determines the combined influence of all predictors
Z-score / Outlier Tests Deviation from standard mean Identify abnormal balances or lapse rates
Boxplots Visual outlier detection Discover obligation anomalies within agencies

✅ Classification:

Model Module
Logistic Regression sklearn.linear_model.LogisticRegression
SVM sklearn.svm.SVC
Decision Tree sklearn.tree.DecisionTreeClassifier
Random Forest sklearn.ensemble.RandomForestClassifier
XGBoost Classifier xgboost.XGBClassifier
K-Nearest Neighbors sklearn.neighbors.KNeighborsClassifier
Gaussian Naive Bayes sklearn.naive_bayes.GaussianNB
Extra Trees sklearn.ensemble.ExtraTreesClassifier
Bagging sklearn.ensemble.BaggingClassifier
AdaBoost sklearn.ensemble.AdaBoostClassifier

📉 Regression:

Model Module
Linear Regression sklearn.linear_model.LinearRegression
Ridge Regression sklearn.linear_model.Ridge
Lasso Regression sklearn.linear_model.Lasso
ElasticNet sklearn.linear_model.ElasticNet
Support Vector Regressor sklearn.svm.SVR
Decision Tree Regressor sklearn.tree.DecisionTreeRegressor
Random Forest Regressor sklearn.ensemble.RandomForestRegressor
Gradient Boosting Regressor sklearn.ensemble.GradientBoostingRegressor
XGBoost Regressor xgboost.XGBRegressor
K-Nearest Neighbors sklearn.neighbors.KNeighborsRegressor
AdaBoost Regressor sklearn.ensemble.AdaBoostRegressor
Extra Trees Regressor sklearn.ensemble.ExtraTreesRegressor

📦 Dependencies

Package Description Link
numpy Numerical computing library numpy.org
pandas Data manipulation and DataFrames pandas.pydata.org
matplotlib Plotting and visualization matplotlib.org
seaborn Statistical data visualization seaborn.pydata.org
scikit-learn ML modeling and metrics scikit-learn.org
xgboost Gradient boosting framework (optional) xgboost.readthedocs.io
torch PyTorch deep learning library pytorch.org
tensorflow End-to-end ML platform tensorflow.org
openai OpenAI’s Python API client openai-python
requests HTTP requests for API and web access requests.readthedocs.io
PySimpleGUI GUI framework for desktop apps pysimplegui.readthedocs.io
typing Type hinting standard library typing Docs
pyodbc ODBC database connector pyodbc GitHub
fitz PDF document parser via PyMuPDF pymupdf
pillow Image processing library python-pillow.org
openpyxl Excel file processing openpyxl Docs
soundfile Read/write sound file formats pysoundfile
sounddevice Audio I/O interface sounddevice Docs
loguru Structured, elegant logging loguru GitHub
statsmodels Statistical tests and regression diagnostics statsmodels.org
dotenv Load environment variables from .env python-dotenv GitHub
python-dotenv Same as above (modern usage) python-dotenv

📁 Customize Dataset

Replace dataset ingestion cell with:

import pandas as pd
df = pd.read_csv("your_dataset.csv")
X = df.drop("target_column", axis=1)
y = df["target_column"]

📊 Outputs

  • R², MAE, MSE for each model
  • Bar plots of performance scores
  • Visual predicted vs. actual scatter charts
  • Residual error analysis

Disclaimer: This is for analytical exploration, research, and education purposes.
This is not an official government product; validate against authoritative sources before use.

📝 License

Sake is published under the MIT General Public License v3.


About

A modular machine learning framework for budget execution & data analysis built in Python with Scikit, XGBoost, PyTorch, and TensorFlow.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published