Customer Segmentation Data Pipeline

🎯 Project Overview

A comprehensive end-to-end data pipeline for customer segmentation analysis using machine learning techniques. This project implements multiple pipeline approaches for data processing, database modeling, and automated ETL workflows. The pipeline processes raw customer data through multiple stages of validation, cleaning, transformation, and modeling to create both analytical datasets and a structured database schema.

📊 Dataset

Source: Customer churn dataset (Customer_churn4.csv)
Records: 7,043 customer records
Features: 21 attributes including demographics, services, and billing information
Target: Customer churn prediction and segmentation

🏗️ Architecture

Customer Segmentation Data Pipeline/
├── data/
│   ├── raw/                    # Original datasets
│   ├── processed/              # Cleaned datasets
│   └── fact_dim_tables/        # Dimensional model tables
├── database/                   # SQLite database files
├── notebooks/                  # Jupyter analysis notebooks
├── etl/                        # Automated ETL pipeline
├── scripts/                    # Utility scripts
└── documentation/              # Additional documentation

🔄 Pipeline Approaches

1. Interactive Analysis Pipeline (Jupyter Notebook)

Location: notebooks/Customer_Segmentation_Data_Pipeline.ipynb
Purpose: Data exploration, validation, and manual processing
Output: Processed datasets and fact/dimension tables

2. Database Modeling Pipeline

Purpose: Create star schema with fact and dimension tables
Output: SQLite database with normalized tables
Documentation: See database/README.md

3. Automated ETL Pipeline

Location: etl/ directory
Purpose: Production-ready automated data processing
Documentation: See etl/README.md

📝 Logging Steps & Responsibilities

This project uses Python's built-in logging module for robust, centralized logging across all ETL pipeline steps. Logging is configured in run_etl_pipeline.py and is used in every ETL module for consistent tracking and debugging.

Logging Responsibilities:

Configure logging in the main pipeline script (run_etl_pipeline.py) to output logs to both console and a log file (logs/etl_pipeline.log).
Use module-level loggers (e.g., logger = logging.getLogger(__name__)) in each ETL Python file (etl/ingest.py, etl/clean.py, etl/transform.py, etl/utils.py).
Replace all print statements with appropriate logging calls (logger.info, logger.error, etc.).
Log key events: data ingestion, cleaning, transformation, profiling, errors, and pipeline completion.
Ensure logs provide enough detail for debugging and monitoring pipeline health.

Example Logging Configuration (in run_etl_pipeline.py):

import logging
logging.basicConfig(
  level=logging.INFO,
  format='%(asctime)s | %(levelname)s | %(name)s | %(message)s',
  handlers=[
    logging.FileHandler('logs/etl_pipeline.log', mode='a', encoding='utf-8'),
    logging.StreamHandler()
  ]
)

🚀 Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/DHANA5982/Customer-Segmentation-Data-Pipeline.git
cd Customer-Segmentation-Data-Pipeline

# Create virtual environment
python -m venv .venv

# Activate virtual environment (Windows)
.\.venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

Run the Pipeline

Option 1: Interactive Analysis

# Start Jupyter notebook
jupyter notebook notebooks/Customer_Segmentation_Data_Pipeline.ipynb

Option 2: Automated ETL Pipeline

# Run the complete automated pipeline
python run_etl_pipeline.py

Option 3: Database Creation Only

# Run database modeling pipeline
python scripts/data_ingestion.py
python scripts/data_processing.py
python scripts/data_stadardizing.py
python scripts/data_modeling.py
python scripts/load_to_sqlite.py
python scripts/query.py

📋 Pipeline Stages

1. Data Ingestion

Read CSV files from raw data directory
Validate schema (column names, data types)
Log basic statistics and data quality metrics

2. Data Validation & Profiling

Missing Values: Check for null/empty values
Duplicates: Identify and handle duplicate records
Data Types: Validate and convert inconsistent types
Categorical Analysis: Analyze unique values in categorical fields

3. Data Cleaning

Handle missing values with appropriate strategies
Convert data types (e.g., TotalCharges to numeric)
Normalize string formats and column names
Flag or remove invalid rows

4. Feature Standardization

Encode binary values (Yes/No → 1/0)
Map categorical values to consistent formats
Normalize column names (lowercase, underscore format)
Create derived features as needed

5. Data Modeling (Star Schema)

Fact Table: fact_customer_activity - transactional/measurable data
Dimension Tables:
- dim_customer - customer demographics
- dim_services - service subscriptions
- dim_subscription - billing and contract information

6. Database Loading

Create SQLite database with optimized schema
Load fact and dimension tables
Implement referential integrity constraints
Create indexes for query performance

📈 Key Features

✅ Data Quality Validation: Comprehensive data profiling and validation
✅ Automated Cleaning: Intelligent handling of missing values and data types
✅ Star Schema Design: Optimized dimensional modeling for analytics
✅ Multiple Pipeline Options: Interactive, automated, and database-focused approaches
✅ Error Handling: Robust error handling and logging
✅ Scalable Architecture: Modular design for easy extension

🛠️ Technology Stack

Python 3.13+: Core programming language
Pandas: Data manipulation and analysis
NumPy: Numerical computing
SQLAlchemy: Database toolkit and ORM
SQLite: Lightweight database engine
Jupyter: Interactive development environment
Matplotlib/Seaborn: Data visualization

📊 Output Datasets

Processed Data Files

cleaned_df.csv - Basic cleaned dataset
processed_df.csv - Fully processed and standardized dataset

Fact and Dimension Tables

fact_df.csv - Customer activity metrics
dim_customer_df.csv - Customer demographics
dim_service_df.csv - Service subscriptions
dim_subscription_df.csv - Billing and contract details

Database

telco_churn.db - Complete SQLite database with all tables

🔍 Data Quality Metrics

Data Completeness: 99.8% (11 missing values in TotalCharges)
Data Uniqueness: 100% (no duplicate records)
Data Consistency: All categorical values standardized
Data Validity: All data types validated and converted

📚 Additional Documentation

database/README.md - Database schema and modeling details
etl/README.md - Automated ETL pipeline documentation
diagrams/schema.md - Complete data schema documentation

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -m 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Open a Pull Request

👨‍💻 Author

DHANA5982

GitHub: @DHANA5982

🙏 Acknowledgments

Data source: Telco Customer Churn Dataset
Inspired by modern data engineering best practices
Built with open-source tools and libraries

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
database		database
diagrams		diagrams
etl		etl
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_etl_pipeline.py		run_etl_pipeline.py

DHANA5982/Customer-Segmentation-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation