A comprehensive exploratory data analysis (EDA) and machine learning project for predicting insurance costs using demographic and health factors.
This project demonstrates a complete data science workflow for analyzing and predicting insurance costs. Using a dataset of 1,338 insurance records, we explore relationships between demographic factors, health indicators, and insurance charges through advanced statistical analysis and machine learning.
- Complete EDA Pipeline: From data exploration to feature engineering
- Statistical Analysis: Correlation analysis, Chi-square testing, and hypothesis testing
- Machine Learning Model: Linear regression with 80.4% R-squared accuracy
- Feature Engineering: BMI categorization and advanced feature selection
- Data Visualization: Professional plots using Matplotlib and Seaborn
| Feature | Description | Type |
|---|---|---|
| Age | Age of primary beneficiary | Numerical (18-64) |
| Sex | Insurance contractor gender | Categorical (male/female) |
| BMI | Body mass index | Numerical (15.96-53.13) |
| Children | Number of dependents | Numerical (0-5) |
| Smoker | Smoking status | Categorical (yes/no) |
| Region | Beneficiary's residential area | Categorical (4 regions) |
| Charges | Medical costs billed by insurance | Target Variable |
Dataset Stats: 1,338 records โข 7 features โข No missing values โข 1 duplicate removed
Python 3.9+
Jupyter Notebook or JupyterLab- Clone this repository
git clone https://github.com/zyna-b/Insurance-Cost-Analysis-EDA.git
cd Insurance-Cost-Analysis-EDA- Create virtual environment
python -m venv venv_py39
# Windows
venv_py39\Scripts\activate
# macOS/Linux
source venv_py39/bin/activate- Install dependencies
pip install -r requirements.txt- Launch analysis
jupyter notebook insurance.ipynb- Data Inspection: Shape, types, missing values, duplicates
- Descriptive Statistics: Central tendencies and distributions
- Univariate Analysis: Individual feature distributions
- Bivariate Analysis: Feature relationships with target variable
- Distribution Plots: Age, BMI, children, charges histograms with KDE
- Count Plots: Categorical variable frequencies
- Box Plots: Outlier detection and quartile analysis
- Correlation Heatmap: Feature relationship visualization
- Data Cleaning: Duplicate removal and type conversion
- Feature Encoding:
- Binary encoding for gender and smoker status
- One-hot encoding for region variables
- Feature Engineering: BMI categorization (Underweight, Normal, Overweight, Obesity)
- Standardization: StandardScaler for numerical features
# Key findings from Pearson correlation analysis
correlations = {
'is_smoker': 0.787, # Strongest predictor
'age': 0.299, # Moderate positive correlation
'bmi': 0.198, # Weak positive correlation
'children': 0.068, # Very weak correlation
# ... additional features
}- Purpose: Test independence between categorical variables and charges
- Significance Level: ฮฑ = 0.05
- Results: Identified significant features for model inclusion
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Model training and evaluation
model = LinearRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_test)- R-squared Score: 0.804 (80.4% variance explained)
- Adjusted R-squared: 0.799
- Train-Test Split: 80% training, 20% testing
- Random State: 42 (reproducible results)
- Smoking Impact: Smoking status is the strongest predictor of insurance costs
- Age Factor: Older individuals tend to have higher insurance charges
- BMI Influence: Higher BMI correlates with increased medical costs
- Regional Variations: Geographic location affects insurance pricing
- Family Size: Number of children has minimal impact on costs
- Correlation Strength: Smoking status shows 0.787 correlation with charges
- Feature Importance: Age, BMI, and smoking status are primary cost drivers
- Data Distribution: Charges show right-skewed distribution (typical for insurance data)
- Gender Impact: Minimal difference in average costs between males and females
import pandas as pd # Data manipulation and analysis
import numpy as np # Numerical computing
import matplotlib.pyplot as plt # Data visualization
import seaborn as sns # Statistical data visualizationfrom sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from scipy.stats import pearsonr, chi2_contingencyInsurance-Cost-Analysis-EDA/
โโโ ๐ insurance.ipynb # Main analysis notebook
โโโ ๐ insurance.csv # Dataset (1,338 records)
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ README.md # Project documentation
โโโ ๐ venv_py39/ # Virtual environment
โโโ Scripts/ # Environment executables
โโโ Lib/ # Installed packages
โโโ Include/ # Header files
- Distribution Analysis: Histograms with KDE for numerical variables
- Categorical Analysis: Count plots for sex, smoker, region, children
- Correlation Matrix: Heatmap showing feature relationships
- Box Plots: Outlier detection for numerical features
- Feature Engineering: BMI categorization visualization
- Pearson Correlation: Measures linear relationship strength (-1 to +1)
- Interpretation: Values closer to ยฑ1 indicate stronger linear relationships
- Application: Identifying features most correlated with insurance charges
- Purpose: Tests independence between categorical variables and target
- Null Hypothesis: Variables are independent
- Decision Rule: Reject H0 if p-value < 0.05
- Business Value: Validates which categorical features significantly impact costs
- BMI Categories: Medical standard classifications
- Dummy Variables: Binary encoding for categorical features
- Standardization: Zero mean, unit variance for numerical features
| Metric | Value | Interpretation |
|---|---|---|
| R-squared | 0.804 | Model explains 80.4% of variance |
| Adjusted R-squared | 0.799 | Accounts for number of predictors |
| Features Used | 7 | Optimal feature subset selected |
| Sample Size | 1,337 | After duplicate removal |
- Advanced Models: Random Forest, Gradient Boosting, Neural Networks
- Cross-Validation: K-fold validation for robust performance metrics
- Feature Engineering: Polynomial features, interaction terms
- Hyperparameter Tuning: Grid search for optimal parameters
- Interactive Dashboard: Streamlit or Dash implementation
- Model Deployment: Flask API for real-time predictions
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset Source: Kaggle Insurance Dataset
- Statistical Methods: Scipy documentation and best practices
- Visualization Inspiration: Seaborn gallery and matplotlib examples
- Machine Learning Techniques: Scikit-learn documentation
Zainab Hamid
- ๐ GitHub: @zyna-b
- ๐ผ LinkedIn: Zainab Hamid
- ๐ง Email: zainabhamid2468@gmail.com
insurance-analysis data-science machine-learning exploratory-data-analysis python pandas scikit-learn statistical-analysis data-visualization linear-regression feature-engineering correlation-analysis chi-square-testing jupyter-notebook healthcare-analytics
โญ Found this project helpful? Please consider starring the repository!
๐ Looking for specific analysis techniques? Check out the detailed Jupyter notebook for complete implementation.
๐ Interested in similar projects? Follow for more data science content!