An end-to-end data science project built using Kepler exoplanet data. This project includes data preprocessing, feature engineering, habitability scoring, exploratory analysis, classification modeling, and a fully interactive Streamlit dashboard.
This project aims to explore and classify Kepler Objects of Interest (KOIs) to determine their potential habitability and disposition using scientific metrics and machine learning.
The complete pipeline includes:
- 🧹 Data Cleaning & Preprocessing
- 🧪 Exploratory Data Analysis (EDA)
- 🛠️ Feature Engineering
- 🌍 Habitability Scoring System
- 🎯 KOI Disposition Classification (Machine Learning)
- 📊 Visualizations
- Loaded
KeplerExoRaw.csv
- Removed columns with excessive missing values
- Handled missing data via imputation
- Converted types and removed duplicates
- Created features like:
pos_diff_mdec
,pos_diff_msky
: positional differencesavg_err_mdec
: average errortotal_pos_diff
: total spatial noise- Encoded categorical flags
- Extracted vetting year
- Distributions of planetary and stellar features
- Correlation heatmaps
- Trends across vetting years
- Comparison of features across KOI classes
Planets scored from 0 to 3 based on:
Feature | Ideal Range |
---|---|
Radius | 0.5 – 1.5 Earth radii |
Equilibrium Temp | 200 – 320 Kelvin |
Positional Noise | Total diff < 0.1 |
Each flag contributes 1 point to the habitability_score
.
- Goal: Predict KOI disposition:
CANDIDATE
,CONFIRMED
,FALSE POSITIVE
,NOT DISPOSITIONED
- Features used:
- Engineered features + habitability score
- Algorithms:
- Random Forest Classifier
- Metrics:
- Confusion Matrix
- Classification Report
- PCA Visualization
CANDIDATE: Probable planet, under review
CONFIRMED: Verified exoplanet
FALSE POSITIVE: Mistaken signal (e.g., stellar noise)
NOT DISPOSITIONED: Unclassified or unreviewed
precision recall f1-score
CANDIDATE 0.35 0.19 0.25
CONFIRMED 0.75 0.87 0.81
FALSE POSITIVE 0.39 0.19 0.26
NOT DISPOSITIONED 0.70 0.68 0.69
-
PCA plot for numeric feature space
Language: Python 3.x
Libraries:
Pandas, NumPy, Matplotlib, Seaborn
Scikit-learn (modeling + PCA)
📦 exoplanet-habitability ┣ 📄 KeplerExoRaw.csv # Raw data
┣ 📄 cleankepler.csv # Cleaned dataset
┣ 📄 data.csv # Feature-engineered dataset
┣ 📄 featureEngineer.py # Feature engineering script
┣ 📄 model_train.py # ML model training & evaluatio
┣ 📄 README.md
Hyperparameter tuning for model
SHAP or feature importance analysis
Deployment via Docker or Streamlit Cloud
Integration with real-time exoplanet APIs
NASA Exoplanet Archive
Kepler Mission