Skip to content

zyna-b/Insurance-Cost-Analysis-and-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿฅ๐Ÿ’ฐ Insurance Cost Analysis & Prediction - Complete Data Science Project

Python Pandas Scikit-Learn Jupyter License: MIT

A comprehensive exploratory data analysis (EDA) and machine learning project for predicting insurance costs using demographic and health factors.

๐ŸŽฏ Project Overview

This project demonstrates a complete data science workflow for analyzing and predicting insurance costs. Using a dataset of 1,338 insurance records, we explore relationships between demographic factors, health indicators, and insurance charges through advanced statistical analysis and machine learning.

๐Ÿ” Key Features

  • Complete EDA Pipeline: From data exploration to feature engineering
  • Statistical Analysis: Correlation analysis, Chi-square testing, and hypothesis testing
  • Machine Learning Model: Linear regression with 80.4% R-squared accuracy
  • Feature Engineering: BMI categorization and advanced feature selection
  • Data Visualization: Professional plots using Matplotlib and Seaborn

๐Ÿ“Š Dataset Information

Feature Description Type
Age Age of primary beneficiary Numerical (18-64)
Sex Insurance contractor gender Categorical (male/female)
BMI Body mass index Numerical (15.96-53.13)
Children Number of dependents Numerical (0-5)
Smoker Smoking status Categorical (yes/no)
Region Beneficiary's residential area Categorical (4 regions)
Charges Medical costs billed by insurance Target Variable

Dataset Stats: 1,338 records โ€ข 7 features โ€ข No missing values โ€ข 1 duplicate removed

๐Ÿš€ Quick Start

Prerequisites

Python 3.9+
Jupyter Notebook or JupyterLab

Installation & Setup

  1. Clone this repository
git clone https://github.com/zyna-b/Insurance-Cost-Analysis-EDA.git
cd Insurance-Cost-Analysis-EDA
  1. Create virtual environment
python -m venv venv_py39
# Windows
venv_py39\Scripts\activate
# macOS/Linux
source venv_py39/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Launch analysis
jupyter notebook insurance.ipynb

๐Ÿ“ˆ Analysis Workflow

1. ๐Ÿ” Exploratory Data Analysis

  • Data Inspection: Shape, types, missing values, duplicates
  • Descriptive Statistics: Central tendencies and distributions
  • Univariate Analysis: Individual feature distributions
  • Bivariate Analysis: Feature relationships with target variable

2. ๐Ÿ“Š Data Visualization

  • Distribution Plots: Age, BMI, children, charges histograms with KDE
  • Count Plots: Categorical variable frequencies
  • Box Plots: Outlier detection and quartile analysis
  • Correlation Heatmap: Feature relationship visualization

3. ๐Ÿงน Data Preprocessing

  • Data Cleaning: Duplicate removal and type conversion
  • Feature Encoding:
    • Binary encoding for gender and smoker status
    • One-hot encoding for region variables
  • Feature Engineering: BMI categorization (Underweight, Normal, Overweight, Obesity)
  • Standardization: StandardScaler for numerical features

4. ๐Ÿ“Š Statistical Analysis

Correlation Analysis

# Key findings from Pearson correlation analysis
correlations = {
    'is_smoker': 0.787,      # Strongest predictor
    'age': 0.299,            # Moderate positive correlation
    'bmi': 0.198,            # Weak positive correlation
    'children': 0.068,       # Very weak correlation
    # ... additional features
}

Chi-Square Testing

  • Purpose: Test independence between categorical variables and charges
  • Significance Level: ฮฑ = 0.05
  • Results: Identified significant features for model inclusion

5. ๐Ÿค– Machine Learning Model

Linear Regression Implementation

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Model training and evaluation
model = LinearRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_test)

Model Performance

  • R-squared Score: 0.804 (80.4% variance explained)
  • Adjusted R-squared: 0.799
  • Train-Test Split: 80% training, 20% testing
  • Random State: 42 (reproducible results)

๐Ÿ” Key Insights & Findings

๐Ÿ’ก Business Intelligence

  1. Smoking Impact: Smoking status is the strongest predictor of insurance costs
  2. Age Factor: Older individuals tend to have higher insurance charges
  3. BMI Influence: Higher BMI correlates with increased medical costs
  4. Regional Variations: Geographic location affects insurance pricing
  5. Family Size: Number of children has minimal impact on costs

๐Ÿ“Š Statistical Discoveries

  • Correlation Strength: Smoking status shows 0.787 correlation with charges
  • Feature Importance: Age, BMI, and smoking status are primary cost drivers
  • Data Distribution: Charges show right-skewed distribution (typical for insurance data)
  • Gender Impact: Minimal difference in average costs between males and females

๐Ÿ› ๏ธ Technologies & Libraries

Core Stack

import pandas as pd              # Data manipulation and analysis
import numpy as np               # Numerical computing
import matplotlib.pyplot as plt  # Data visualization
import seaborn as sns           # Statistical data visualization

Machine Learning & Statistics

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from scipy.stats import pearsonr, chi2_contingency

๐Ÿ“ Project Structure

Insurance-Cost-Analysis-EDA/
โ”œโ”€โ”€ ๐Ÿ““ insurance.ipynb          # Main analysis notebook
โ”œโ”€โ”€ ๐Ÿ“Š insurance.csv           # Dataset (1,338 records)
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt        # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“„ README.md              # Project documentation
โ””โ”€โ”€ ๐Ÿ“ venv_py39/             # Virtual environment
    โ”œโ”€โ”€ Scripts/              # Environment executables
    โ”œโ”€โ”€ Lib/                  # Installed packages
    โ””โ”€โ”€ Include/              # Header files

๐Ÿ“ˆ Visualizations Included

  • Distribution Analysis: Histograms with KDE for numerical variables
  • Categorical Analysis: Count plots for sex, smoker, region, children
  • Correlation Matrix: Heatmap showing feature relationships
  • Box Plots: Outlier detection for numerical features
  • Feature Engineering: BMI categorization visualization

๐Ÿ”ฌ Statistical Methods Explained

Correlation Analysis

  • Pearson Correlation: Measures linear relationship strength (-1 to +1)
  • Interpretation: Values closer to ยฑ1 indicate stronger linear relationships
  • Application: Identifying features most correlated with insurance charges

Chi-Square Testing

  • Purpose: Tests independence between categorical variables and target
  • Null Hypothesis: Variables are independent
  • Decision Rule: Reject H0 if p-value < 0.05
  • Business Value: Validates which categorical features significantly impact costs

Feature Engineering

  • BMI Categories: Medical standard classifications
  • Dummy Variables: Binary encoding for categorical features
  • Standardization: Zero mean, unit variance for numerical features

๐ŸŽฏ Model Evaluation Metrics

Metric Value Interpretation
R-squared 0.804 Model explains 80.4% of variance
Adjusted R-squared 0.799 Accounts for number of predictors
Features Used 7 Optimal feature subset selected
Sample Size 1,337 After duplicate removal

๐Ÿ”ฎ Future Enhancements

  • Advanced Models: Random Forest, Gradient Boosting, Neural Networks
  • Cross-Validation: K-fold validation for robust performance metrics
  • Feature Engineering: Polynomial features, interaction terms
  • Hyperparameter Tuning: Grid search for optimal parameters
  • Interactive Dashboard: Streamlit or Dash implementation
  • Model Deployment: Flask API for real-time predictions

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Dataset Source: Kaggle Insurance Dataset
  • Statistical Methods: Scipy documentation and best practices
  • Visualization Inspiration: Seaborn gallery and matplotlib examples
  • Machine Learning Techniques: Scikit-learn documentation

๐Ÿ‘จโ€๐Ÿ’ป Author

Zainab Hamid

๐Ÿ“Š Keywords

insurance-analysis data-science machine-learning exploratory-data-analysis python pandas scikit-learn statistical-analysis data-visualization linear-regression feature-engineering correlation-analysis chi-square-testing jupyter-notebook healthcare-analytics


โญ Found this project helpful? Please consider starring the repository!

๐Ÿ” Looking for specific analysis techniques? Check out the detailed Jupyter notebook for complete implementation.

๐Ÿ“ˆ Interested in similar projects? Follow for more data science content!

Releases

No releases published

Packages

No packages published