Heart Disease Modeling Project — Data Science Fundamentals

This project applies core data science and machine learning techniques to the UCI Heart Disease (Cleveland) dataset. It includes exploratory analysis, data preprocessing, supervised learning, linear regression, PCA-based dimensionality reduction, and clustering. The goal is to understand patterns related to heart disease and build predictive models using real clinical data.

📌 Dataset Reference

The dataset used is the UCI Heart Disease Dataset (Cleveland) with:

303 patient records
13 clinical features
1 target (num), converted to:
- 0 → No heart disease
- 1 → Presence of heart disease (combining original 1–4)

📂 Project Structure

│── Final_Project.ipynb
│── data/
│ └── data.csv
│── README.md

1. EDA & Data Preprocessing

✔ Steps Performed

Displayed first rows, summary statistics, and dataset info
Identified missing values
Applied median imputation
Converted num to binary classes
Scaled all numerical features using StandardScaler

✔ Observations

Missing values: ca (4 rows), thal (2 rows)
Dataset standardized successfully
Final dataset ready for modeling

2. Heart Disease Prediction (Classification)

Two models were trained and evaluated:

🔹 Logistic Regression

Metric	Score
Accuracy	0.869
Precision	0.812
Recall	0.929
F1-Score	0.867

🔹 Random Forest Classifier

Metric	Score
Accuracy	0.902
Precision	0.844
Recall	0.964
F1-Score	0.900

✔ Conclusion

The Random Forest Classifier showed the best performance, with strong recall and overall accuracy.

3. Cholesterol Level Prediction (Regression)

A Multiple Linear Regression model was built to predict serum cholesterol (chol).

✔ Model Performance

Mean Squared Error: 3614.52
R² Score: 0.106

Low R² indicates weak linear relationships with available features.

✔ Feature Correlation with Cholesterol

Most positively correlated features:

Age (0.209)
Resting ECG results (0.171)
Resting BP (0.130)

Most negatively correlated features:

Sex (-0.200)
Slope (-0.004)
Thalach (-0.003)

4. Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality while keeping 95% variance.

✔ Results

Final reduced shape: (303, 12)
Explained variance curve plotted for all components

5. K-Means Clustering

Clustering was performed on PCA-transformed data.

✔ Optimal Number of Clusters

Using Elbow Method and Silhouette Scores, the optimal cluster count was:

k = 2

✔ Visualization

A 2D scatter plot using the first two PCA components shows clear grouping of patient profiles.

📘 Technologies & Libraries Used

Python
NumPy, Pandas
Scikit-learn
Matplotlib, Seaborn
Jupyter Notebook

🚀 How to Run the Project

Install the dependencies:
```
pip install -r requirements.txt
```
Place the dataset in:
```
data/data.csv
```
Launch the notebook:
```
jupyter notebook Final_Project.ipynb
```
Run all cells from start to end.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
Final_Project.pdf		Final_Project.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heart Disease Modeling Project — Data Science Fundamentals

📌 Dataset Reference

📂 Project Structure

1. EDA & Data Preprocessing

✔ Steps Performed

✔ Observations

2. Heart Disease Prediction (Classification)

🔹 Logistic Regression

🔹 Random Forest Classifier

✔ Conclusion

3. Cholesterol Level Prediction (Regression)

✔ Model Performance

✔ Feature Correlation with Cholesterol

4. Principal Component Analysis (PCA)

✔ Results

5. K-Means Clustering

✔ Optimal Number of Clusters

✔ Visualization

📘 Technologies & Libraries Used

🚀 How to Run the Project

About

Uh oh!

Releases

Packages

Revanth-144/Heart-Disease-Modeling-Project

Folders and files

Latest commit

History

Repository files navigation

Heart Disease Modeling Project — Data Science Fundamentals

📌 Dataset Reference

📂 Project Structure

1. EDA & Data Preprocessing

✔ Steps Performed

✔ Observations

2. Heart Disease Prediction (Classification)

🔹 Logistic Regression

🔹 Random Forest Classifier

✔ Conclusion

3. Cholesterol Level Prediction (Regression)

✔ Model Performance

✔ Feature Correlation with Cholesterol

4. Principal Component Analysis (PCA)

✔ Results

5. K-Means Clustering

✔ Optimal Number of Clusters

✔ Visualization

📘 Technologies & Libraries Used

🚀 How to Run the Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages