Skip to content

Revanth-144/Heart-Disease-Modeling-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Heart Disease Modeling Project β€” Data Science Fundamentals

This project applies core data science and machine learning techniques to the UCI Heart Disease (Cleveland) dataset. It includes exploratory analysis, data preprocessing, supervised learning, linear regression, PCA-based dimensionality reduction, and clustering. The goal is to understand patterns related to heart disease and build predictive models using real clinical data.


πŸ“Œ Dataset Reference

The dataset used is the UCI Heart Disease Dataset (Cleveland) with:

  • 303 patient records
  • 13 clinical features
  • 1 target (num), converted to:
    • 0 β†’ No heart disease
    • 1 β†’ Presence of heart disease (combining original 1–4)

πŸ“‚ Project Structure

│── Final_Project.ipynb
│── data/
β”‚ └── data.csv
│── README.md

1. EDA & Data Preprocessing

βœ” Steps Performed

  • Displayed first rows, summary statistics, and dataset info
  • Identified missing values
  • Applied median imputation
  • Converted num to binary classes
  • Scaled all numerical features using StandardScaler

βœ” Observations

  • Missing values: ca (4 rows), thal (2 rows)
  • Dataset standardized successfully
  • Final dataset ready for modeling

2. Heart Disease Prediction (Classification)

Two models were trained and evaluated:

πŸ”Ή Logistic Regression

Metric Score
Accuracy 0.869
Precision 0.812
Recall 0.929
F1-Score 0.867

πŸ”Ή Random Forest Classifier

Metric Score
Accuracy 0.902
Precision 0.844
Recall 0.964
F1-Score 0.900

βœ” Conclusion

The Random Forest Classifier showed the best performance, with strong recall and overall accuracy.


3. Cholesterol Level Prediction (Regression)

A Multiple Linear Regression model was built to predict serum cholesterol (chol).

βœ” Model Performance

  • Mean Squared Error: 3614.52
  • RΒ² Score: 0.106

Low RΒ² indicates weak linear relationships with available features.

βœ” Feature Correlation with Cholesterol

Most positively correlated features:

  • Age (0.209)
  • Resting ECG results (0.171)
  • Resting BP (0.130)

Most negatively correlated features:

  • Sex (-0.200)
  • Slope (-0.004)
  • Thalach (-0.003)

4. Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality while keeping 95% variance.

βœ” Results

  • Final reduced shape: (303, 12)
  • Explained variance curve plotted for all components

5. K-Means Clustering

Clustering was performed on PCA-transformed data.

βœ” Optimal Number of Clusters

Using Elbow Method and Silhouette Scores, the optimal cluster count was:

k = 2

βœ” Visualization

A 2D scatter plot using the first two PCA components shows clear grouping of patient profiles.


πŸ“˜ Technologies & Libraries Used

  • Python
  • NumPy, Pandas
  • Scikit-learn
  • Matplotlib, Seaborn
  • Jupyter Notebook

πŸš€ How to Run the Project

  1. Install the dependencies:
    pip install -r requirements.txt
    
  2. Place the dataset in:
    data/data.csv
    
  3. Launch the notebook:
    jupyter notebook Final_Project.ipynb
    
  4. Run all cells from start to end.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published