The primary objective of this project is to predict medical insurance prices based on customer data using AWS SageMaker. The project involves building, training, and deploying a machine learning model using SageMaker's built-in Linear Learner algorithm.
The dataset used for training and testing was sourced from Kaggle and uploaded to Amazon S3 for easy integration with SageMaker.
- Basic Information of Data: Checked the shape of the dataset, described the columns, verified data types, and performed initial data inspections.
- Data Preprocessing: Cleaned the dataset by handling missing values, Duplicate values, Invalid values and more.
- Exploratory Data Analysis (EDA): Performed data exploration, feature analysis, and visualized important patterns in the data and correlaiton matrix.
- Environment Setup: Set up the AWS SageMaker environment and configured IAM roles and S3 buckets for data storage.
- Data Preparation: Uploaded the dataset to S3, split into training and test sets.
- Model Training: Used SageMaker’s Linear Learner algorithm with hyperparameter tuning (epochs, batch size).
- Model Deployment: Deployed the trained model to a SageMaker endpoint for real-time predictions.
- Model Testing: Evaluated the model using test data and handled serialization/deserialization for predictions.
- Model Evaluation: Assessed the model’s performance with relevant metrics, deleted the SageMaker endpoints to prevent incurring extra costs.
- AWS SageMaker: For building, training, and deploying the model.
- Amazon S3: For storing datasets.
- IAM Roles: For managing permissions in AWS.
- Python & Jupyter Notebook: For coding and model building.
- Boto3: AWS SDK for Python to interact with the SageMaker API.
- Environment Setup: Connected SageMaker to necessary AWS resources like S3 for data storage and IAM roles for permissions.
- Data Upload: Uploaded the insurance data to S3 in CSV format.
- Model Training: Trained the model using SageMaker’s Linear Learner algorithm, optimizing key hyperparameters.
- Model Deployment: Deployed the model for real-time predictions on a SageMaker-hosted endpoint.
- Prediction & Evaluation: Tested the model using unseen data, validating prediction accuracy.
- Calculated key features like age, BMI, smoking status, and number of dependents.
- Handled missing data, outliers, and feature scaling to improve the model's accuracy.
- Applied one-hot encoding to categorical variables like region and smoker status.
The project leveraged the Linear Learner algorithm provided by SageMaker. After training and deployment, the model was tested using real-world insurance data.
- Non-Linear Data: The features (age, BMI, etc.) exhibit non-linear relationships with the target (insurance price), making it challenging for the linear model to capture all patterns effectively.
- Sensitivity to Outliers: Linear regression is sensitive to outliers, and the dataset includes critical outliers that significantly impact prices. Removing them would reduce the model’s ability to predict real-world cases accurately.
The AWS SageMaker model for predicting medical insurance prices performed good. This project highlights the potential of cloud-based machine learning for efficient and scalable deployment in real-world applications.