This project aims to predict customer churn for a telecom company by analyzing customer behavior over a four-month period. The goal is to identify high-value customers who are at risk of churning and recommend actionable strategies to improve customer retention. The project was initially made as part of a graduate data science course.
In the highly competitive telecom industry, customer churn poses significant challenges. Acquiring new customers costs 5-10 times more than retaining existing ones. Predicting churn, especially for prepaid customers, is critical as they can leave without formal notice. This project focuses on the Indian and Southeast Asian markets, where the prepaid model dominates.
- Revenue-based churn: Customers generating minimal or no revenue.
- Usage-based churn: Customers with no incoming/outgoing calls or internet usage over a defined period.
- Predict churn in the 9th month based on data from months 6, 7, and 8.
- Identify key indicators of churn.
The dataset contains customer-level information over four months, and is stored in the dataset folder of the repository:
- Columns: 226 features, including usage patterns, recharge amounts, and customer demographics.
- Rows: ~100,000 customer records.
- Monthly Encodings: June (6), July (7), August (8), and September (9).
- High-Value Customers: Defined as those whose average recharge in the first two months is above the 70th percentile.
- Target Variable: Churn (1 if the customer stopped using services in month 9, 0 otherwise).
Refer to the included Data Dictionary for details on column abbreviations and meanings.
- Initial Cleanup:
- Removed non-predictive columns (e.g.,
mobile_number
,circle_id
). - Addressed missing values through imputation or column removal.
- Removed non-predictive columns (e.g.,
- Feature Engineering:
- Derived metrics such as average revenue, total usage, and recharge patterns.
- Tagged high-value customers.
- Analyzed data distributions and correlations.
- Segmented customers by usage and revenue metrics.
- Churn is defined by the absence of calls and internet usage in the churn phase (month 9).
- Applied Principal Component Analysis (PCA) to reduce 226 features into a smaller set of components while retaining variance.
- Algorithms:
- Logistic Regression (baseline).
- Random Forest, Gradient Boosting (e.g., XGBoost).
- Class Imbalance Handling:
- Used SMOTE and class weighting techniques.
- Evaluation Metrics:
- Precision, Recall, F1-Score, and AUC-ROC.
- Selected the best-performing model based on evaluation metrics.
- Identified significant predictors of churn using feature importance (e.g., decision trees) and regression coefficients.
- Best Model: Gradient Boosting with an AUC-ROC of 0.85.
- Key Indicators of Churn:
- Decline in recharge amounts.
- Reduced call and internet usage in the action phase.
- Recommendations:
- Offer retention incentives to high-risk customers.
- Monitor declining usage patterns for early intervention.
- Data Collection: Load and clean telecom customer data.
- EDA and Feature Engineering: Derive meaningful metrics and segment customers.
- Model Building:
- Apply dimensionality reduction.
- Train and tune predictive models.
- Evaluation: Assess models based on business-centric metrics.
- Insights and Recommendations: Use findings to propose retention strategies.
- Python Libraries:
pandas
,numpy
,matplotlib
,seaborn
,scikit-learn
,imbalanced-learn