This is a personal MLOps project based on a Kaggle dataset for credit default predictions.
It was developed as part of the this End-to-end MLOps with Databricks course.
Feel free to ⭐ and clone this repo 😉
The project has been structured with the following folders and files:
.github/workflows
: CI/CD configuration filescd.yml
ci.yml
data
: raw datadata.csv
notebooks
: notebooks for various stages of the projectcreate_source_data
: notebook for generating synthetic datacreate_source_data_notebook.py
feature_engineering
: feature engineering and MLflow experimentsbasic_mlflow_experiment_notebook.py
combined_mlflow_experiment_notebook.py
custom_mlflow_experiment_notebook.py
prepare_data_notebook.py
model_feature_serving
: notebooks for serving models and featuresAB_test_model_serving_notebbok.py
feature_serving_notebook.py
model_serving_feat_lookup_notebook.py
model_serving_notebook.py
monitoring
: monitoring and alerts setupcreate_alert.py
create_inference_data.py
lakehouse_monitoring.py
send_request_to_endpoint.py
src
: source code for the projectcredit_default
data_cleaning.py
data_cleaning_spark.py
data_preprocessing.py
data_preprocessing_spark.py
utils.py
tests
: unit tests for the projecttest_data_cleaning.py
test_data_preprocessor.py
workflows
: workflows for Databricks asset bundledeploy_model.py
evaluate_model.py
preprocess.py
refresh_monitor.py
train_model.py
.pre-commit-config.yaml
: configuration for pre-commit hooksMakefile
: helper commands for installing requirements, formatting, testing, linting, and cleaningproject_config.yml
: configuration settings for the projectdatabricks.yml
: Databricks asset bundle configurationbundle_monitoring.yml
: monitoring settings for Databricks asset bundle
The Python version used for this project is Python 3.11.
-
Clone the repo:
git clone https://github.com/satel33/mlops-databricks-credit-default.git
-
Create the virtual environment using
uv
with Python version 3.11 and install the requirements:uv venv -p 3.11.0 .venv source .venv/bin/activate uv pip install -r pyproject.toml --all-extras uv lock
-
Build the wheel package:
# Build uv build
-
Install the Databricks extension for VS Code and Databricks CLI:
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
-
Authenticate on Databricks:
# Authentication databricks auth login --configure-cluster --host <workspace-url> # Profiles databricks auth profiles cat ~/.databrickscfg
After entering your information, the CLI will prompt you to save it under a Databricks configuration profile ~/.databrickscfg
Once the project is set up, you need to create the volumes to store the data and the wheel package that will you have to install in the cluster:
-
catalog name: credit
-
schema_name: default
-
volume name: data and packages
# Create volumes databricks volumes create credit default data MANAGED databricks volumes create credit default packages MANAGED # Push volumes databricks fs cp data/data.csv dbfs:/Volumes/credit/default/data/data.csv databricks fs cp dist/credit_default_databricks-0.0.1-py3-none-any.whl dbfs:/Volumes/credit/default/packages # Show volumes databricks fs ls dbfs:/Volumes/credit/default/data databricks fs ls dbfs:/Volumes/credit/default/packages
Some project files require a Databricks authentication token. This token allows secure access to Databricks resources and APIs:
-
Create a token in the Databricks UI:
-
Navigate to
Settings
-->User
-->Developer
-->Access tokens
-
Generate a new personal access token
-
-
Create a secret scope for securely storing the token:
# Create Scope databricks secrets create-scope secret-scope # Add secret after running command databricks secrets put-secret secret-scope databricks-token # List secrets databricks secrets list-secrets secret-scope