A comprehensive end-to-end data pipeline for customer segmentation analysis using machine learning techniques. This project implements multiple pipeline approaches for data processing, database modeling, and automated ETL workflows. The pipeline processes raw customer data through multiple stages of validation, cleaning, transformation, and modeling to create both analytical datasets and a structured database schema.
- Source: Customer churn dataset (
Customer_churn4.csv) - Records: 7,043 customer records
- Features: 21 attributes including demographics, services, and billing information
- Target: Customer churn prediction and segmentation
Customer Segmentation Data Pipeline/
βββ data/
β βββ raw/ # Original datasets
β βββ processed/ # Cleaned datasets
β βββ fact_dim_tables/ # Dimensional model tables
βββ database/ # SQLite database files
βββ notebooks/ # Jupyter analysis notebooks
βββ etl/ # Automated ETL pipeline
βββ scripts/ # Utility scripts
βββ documentation/ # Additional documentation
- Location:
notebooks/Customer_Segmentation_Data_Pipeline.ipynb - Purpose: Data exploration, validation, and manual processing
- Output: Processed datasets and fact/dimension tables
- Purpose: Create star schema with fact and dimension tables
- Output: SQLite database with normalized tables
- Documentation: See
database/README.md
- Location:
etl/directory - Purpose: Production-ready automated data processing
- Documentation: See
etl/README.md
This project uses Python's built-in logging module for robust, centralized logging across all ETL pipeline steps. Logging is configured in run_etl_pipeline.py and is used in every ETL module for consistent tracking and debugging.
Logging Responsibilities:
- Configure logging in the main pipeline script (
run_etl_pipeline.py) to output logs to both console and a log file (logs/etl_pipeline.log). - Use module-level loggers (e.g.,
logger = logging.getLogger(__name__)) in each ETL Python file (etl/ingest.py,etl/clean.py,etl/transform.py,etl/utils.py). - Replace all print statements with appropriate logging calls (
logger.info,logger.error, etc.). - Log key events: data ingestion, cleaning, transformation, profiling, errors, and pipeline completion.
- Ensure logs provide enough detail for debugging and monitoring pipeline health.
Example Logging Configuration (in run_etl_pipeline.py):
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s | %(levelname)s | %(name)s | %(message)s',
handlers=[
logging.FileHandler('logs/etl_pipeline.log', mode='a', encoding='utf-8'),
logging.StreamHandler()
]
)# Clone the repository
git clone https://github.com/DHANA5982/Customer-Segmentation-Data-Pipeline.git
cd Customer-Segmentation-Data-Pipeline
# Create virtual environment
python -m venv .venv
# Activate virtual environment (Windows)
.\.venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txt# Start Jupyter notebook
jupyter notebook notebooks/Customer_Segmentation_Data_Pipeline.ipynb# Run the complete automated pipeline
python run_etl_pipeline.py# Run database modeling pipeline
python scripts/data_ingestion.py
python scripts/data_processing.py
python scripts/data_stadardizing.py
python scripts/data_modeling.py
python scripts/load_to_sqlite.py
python scripts/query.py- Read CSV files from raw data directory
- Validate schema (column names, data types)
- Log basic statistics and data quality metrics
- Missing Values: Check for null/empty values
- Duplicates: Identify and handle duplicate records
- Data Types: Validate and convert inconsistent types
- Categorical Analysis: Analyze unique values in categorical fields
- Handle missing values with appropriate strategies
- Convert data types (e.g., TotalCharges to numeric)
- Normalize string formats and column names
- Flag or remove invalid rows
- Encode binary values (Yes/No β 1/0)
- Map categorical values to consistent formats
- Normalize column names (lowercase, underscore format)
- Create derived features as needed
- Fact Table:
fact_customer_activity- transactional/measurable data - Dimension Tables:
dim_customer- customer demographicsdim_services- service subscriptionsdim_subscription- billing and contract information
- Create SQLite database with optimized schema
- Load fact and dimension tables
- Implement referential integrity constraints
- Create indexes for query performance
- β Data Quality Validation: Comprehensive data profiling and validation
- β Automated Cleaning: Intelligent handling of missing values and data types
- β Star Schema Design: Optimized dimensional modeling for analytics
- β Multiple Pipeline Options: Interactive, automated, and database-focused approaches
- β Error Handling: Robust error handling and logging
- β Scalable Architecture: Modular design for easy extension
- Python 3.13+: Core programming language
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing
- SQLAlchemy: Database toolkit and ORM
- SQLite: Lightweight database engine
- Jupyter: Interactive development environment
- Matplotlib/Seaborn: Data visualization
cleaned_df.csv- Basic cleaned datasetprocessed_df.csv- Fully processed and standardized dataset
fact_df.csv- Customer activity metricsdim_customer_df.csv- Customer demographicsdim_service_df.csv- Service subscriptionsdim_subscription_df.csv- Billing and contract details
telco_churn.db- Complete SQLite database with all tables
- Data Completeness: 99.8% (11 missing values in TotalCharges)
- Data Uniqueness: 100% (no duplicate records)
- Data Consistency: All categorical values standardized
- Data Validity: All data types validated and converted
database/README.md- Database schema and modeling detailsetl/README.md- Automated ETL pipeline documentationdiagrams/schema.md- Complete data schema documentation
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
DHANA5982
- GitHub: @DHANA5982
- Data source: Telco Customer Churn Dataset
- Inspired by modern data engineering best practices
- Built with open-source tools and libraries