Skip to content

Production-ready data pipeline with Jupyter notebook, SQLite database modeling (star schema), and automated ETL workflows for customer churn analysis and segmentation.

Notifications You must be signed in to change notification settings

DHANA5982/Customer-Segmentation-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Customer Segmentation Data Pipeline

Python Database Modeling Star Schema ETL Automation Project Status Data Quality Logging

🎯 Project Overview

A comprehensive end-to-end data pipeline for customer segmentation analysis using machine learning techniques. This project implements multiple pipeline approaches for data processing, database modeling, and automated ETL workflows. The pipeline processes raw customer data through multiple stages of validation, cleaning, transformation, and modeling to create both analytical datasets and a structured database schema.

πŸ“Š Dataset

  • Source: Customer churn dataset (Customer_churn4.csv)
  • Records: 7,043 customer records
  • Features: 21 attributes including demographics, services, and billing information
  • Target: Customer churn prediction and segmentation

πŸ—οΈ Architecture

Customer Segmentation Data Pipeline/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # Original datasets
β”‚   β”œβ”€β”€ processed/              # Cleaned datasets
β”‚   └── fact_dim_tables/        # Dimensional model tables
β”œβ”€β”€ database/                   # SQLite database files
β”œβ”€β”€ notebooks/                  # Jupyter analysis notebooks
β”œβ”€β”€ etl/                        # Automated ETL pipeline
β”œβ”€β”€ scripts/                    # Utility scripts
└── documentation/              # Additional documentation

πŸ”„ Pipeline Approaches

1. Interactive Analysis Pipeline (Jupyter Notebook)

  • Location: notebooks/Customer_Segmentation_Data_Pipeline.ipynb
  • Purpose: Data exploration, validation, and manual processing
  • Output: Processed datasets and fact/dimension tables

2. Database Modeling Pipeline

  • Purpose: Create star schema with fact and dimension tables
  • Output: SQLite database with normalized tables
  • Documentation: See database/README.md

3. Automated ETL Pipeline

  • Location: etl/ directory
  • Purpose: Production-ready automated data processing
  • Documentation: See etl/README.md

πŸ“ Logging Steps & Responsibilities

This project uses Python's built-in logging module for robust, centralized logging across all ETL pipeline steps. Logging is configured in run_etl_pipeline.py and is used in every ETL module for consistent tracking and debugging.

Logging Responsibilities:

  • Configure logging in the main pipeline script (run_etl_pipeline.py) to output logs to both console and a log file (logs/etl_pipeline.log).
  • Use module-level loggers (e.g., logger = logging.getLogger(__name__)) in each ETL Python file (etl/ingest.py, etl/clean.py, etl/transform.py, etl/utils.py).
  • Replace all print statements with appropriate logging calls (logger.info, logger.error, etc.).
  • Log key events: data ingestion, cleaning, transformation, profiling, errors, and pipeline completion.
  • Ensure logs provide enough detail for debugging and monitoring pipeline health.

Example Logging Configuration (in run_etl_pipeline.py):

import logging
logging.basicConfig(
  level=logging.INFO,
  format='%(asctime)s | %(levelname)s | %(name)s | %(message)s',
  handlers=[
    logging.FileHandler('logs/etl_pipeline.log', mode='a', encoding='utf-8'),
    logging.StreamHandler()
  ]
)

πŸš€ Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/DHANA5982/Customer-Segmentation-Data-Pipeline.git
cd Customer-Segmentation-Data-Pipeline

# Create virtual environment
python -m venv .venv

# Activate virtual environment (Windows)
.\.venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

Run the Pipeline

Option 1: Interactive Analysis

# Start Jupyter notebook
jupyter notebook notebooks/Customer_Segmentation_Data_Pipeline.ipynb

Option 2: Automated ETL Pipeline

# Run the complete automated pipeline
python run_etl_pipeline.py

Option 3: Database Creation Only

# Run database modeling pipeline
python scripts/data_ingestion.py
python scripts/data_processing.py
python scripts/data_stadardizing.py
python scripts/data_modeling.py
python scripts/load_to_sqlite.py
python scripts/query.py

πŸ“‹ Pipeline Stages

1. Data Ingestion

  • Read CSV files from raw data directory
  • Validate schema (column names, data types)
  • Log basic statistics and data quality metrics

2. Data Validation & Profiling

  • Missing Values: Check for null/empty values
  • Duplicates: Identify and handle duplicate records
  • Data Types: Validate and convert inconsistent types
  • Categorical Analysis: Analyze unique values in categorical fields

3. Data Cleaning

  • Handle missing values with appropriate strategies
  • Convert data types (e.g., TotalCharges to numeric)
  • Normalize string formats and column names
  • Flag or remove invalid rows

4. Feature Standardization

  • Encode binary values (Yes/No β†’ 1/0)
  • Map categorical values to consistent formats
  • Normalize column names (lowercase, underscore format)
  • Create derived features as needed

5. Data Modeling (Star Schema)

  • Fact Table: fact_customer_activity - transactional/measurable data
  • Dimension Tables:
    • dim_customer - customer demographics
    • dim_services - service subscriptions
    • dim_subscription - billing and contract information

6. Database Loading

  • Create SQLite database with optimized schema
  • Load fact and dimension tables
  • Implement referential integrity constraints
  • Create indexes for query performance

πŸ“ˆ Key Features

  • βœ… Data Quality Validation: Comprehensive data profiling and validation
  • βœ… Automated Cleaning: Intelligent handling of missing values and data types
  • βœ… Star Schema Design: Optimized dimensional modeling for analytics
  • βœ… Multiple Pipeline Options: Interactive, automated, and database-focused approaches
  • βœ… Error Handling: Robust error handling and logging
  • βœ… Scalable Architecture: Modular design for easy extension

πŸ› οΈ Technology Stack

  • Python 3.13+: Core programming language
  • Pandas: Data manipulation and analysis
  • NumPy: Numerical computing
  • SQLAlchemy: Database toolkit and ORM
  • SQLite: Lightweight database engine
  • Jupyter: Interactive development environment
  • Matplotlib/Seaborn: Data visualization

πŸ“Š Output Datasets

Processed Data Files

  • cleaned_df.csv - Basic cleaned dataset
  • processed_df.csv - Fully processed and standardized dataset

Fact and Dimension Tables

  • fact_df.csv - Customer activity metrics
  • dim_customer_df.csv - Customer demographics
  • dim_service_df.csv - Service subscriptions
  • dim_subscription_df.csv - Billing and contract details

Database

  • telco_churn.db - Complete SQLite database with all tables

πŸ” Data Quality Metrics

  • Data Completeness: 99.8% (11 missing values in TotalCharges)
  • Data Uniqueness: 100% (no duplicate records)
  • Data Consistency: All categorical values standardized
  • Data Validity: All data types validated and converted

πŸ“š Additional Documentation

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -m 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Open a Pull Request

πŸ‘¨β€πŸ’» Author

DHANA5982

πŸ™ Acknowledgments

  • Data source: Telco Customer Churn Dataset
  • Inspired by modern data engineering best practices
  • Built with open-source tools and libraries