This project applies core data science and machine learning techniques to the UCI Heart Disease (Cleveland) dataset. It includes exploratory analysis, data preprocessing, supervised learning, linear regression, PCA-based dimensionality reduction, and clustering. The goal is to understand patterns related to heart disease and build predictive models using real clinical data.
The dataset used is the UCI Heart Disease Dataset (Cleveland) with:
- 303 patient records
- 13 clinical features
- 1 target (
num), converted to:0β No heart disease1β Presence of heart disease (combining original 1β4)
βββ Final_Project.ipynb
βββ data/
β βββ data.csv
βββ README.md- Displayed first rows, summary statistics, and dataset info
- Identified missing values
- Applied median imputation
- Converted
numto binary classes - Scaled all numerical features using StandardScaler
- Missing values: ca (4 rows), thal (2 rows)
- Dataset standardized successfully
- Final dataset ready for modeling
Two models were trained and evaluated:
| Metric | Score |
|---|---|
| Accuracy | 0.869 |
| Precision | 0.812 |
| Recall | 0.929 |
| F1-Score | 0.867 |
| Metric | Score |
|---|---|
| Accuracy | 0.902 |
| Precision | 0.844 |
| Recall | 0.964 |
| F1-Score | 0.900 |
The Random Forest Classifier showed the best performance, with strong recall and overall accuracy.
A Multiple Linear Regression model was built to predict serum cholesterol (chol).
- Mean Squared Error: 3614.52
- RΒ² Score: 0.106
Low RΒ² indicates weak linear relationships with available features.
Most positively correlated features:
- Age (0.209)
- Resting ECG results (0.171)
- Resting BP (0.130)
Most negatively correlated features:
- Sex (-0.200)
- Slope (-0.004)
- Thalach (-0.003)
PCA was applied to reduce dimensionality while keeping 95% variance.
- Final reduced shape:
(303, 12) - Explained variance curve plotted for all components
Clustering was performed on PCA-transformed data.
Using Elbow Method and Silhouette Scores, the optimal cluster count was:
k = 2
A 2D scatter plot using the first two PCA components shows clear grouping of patient profiles.
- Python
- NumPy, Pandas
- Scikit-learn
- Matplotlib, Seaborn
- Jupyter Notebook
- Install the dependencies:
pip install -r requirements.txt
- Place the dataset in:
data/data.csv - Launch the notebook:
jupyter notebook Final_Project.ipynb
- Run all cells from start to end.