🚀 Predicting Startup Success with Crunchbase Data

🚀 Best Model Accuracy (Random Forest with SMOTE): 87.9%

📌 Project Overview

As someone transitioning into data science with an interest in the banking and finance sector, I sought to take on a project that combines both business relevance and technical depth. This project focuses on predicting startup success using Crunchbase data. I approached it from three key angles:

Regression to understand funding patterns
Classification to predict startup outcomes
Clustering to discover natural groupings among startups

The goal was to determine the factors that increase the likelihood of a startup's success — measured by IPOs, acquisitions, or a combination of both — while incorporating both machine learning techniques and business intelligence.

🧰 Code and Resources Used

Python Version: 3.11
IDE: Jupyter Notebook
Libraries: pandas, numpy, matplotlib, seaborn, statsmodels, scikit-learn, imbalanced-learn
Dataset: An excerpt of a dataset released to the public by Crunchbase

🧹 Data Cleaning

Filtered out companies younger than 3 years or older than 7 years to reduce noise (from 31,000+ to ~9,400 observations)
Created a new status label:
- Success = Acquired, IPO, or both
- Failure = Closed or no exit
Engineered number_degrees from MBA, PhD, MS, and Other
Dropped columns with high missing values (e.g., acquired_companies, products_number)
Handled missing values selectively
Removed funding outliers (above 99th percentile)

📊 Exploratory Data Analysis (EDA)

EDA focused on understanding class balance and numerical relationships:

Status Value Counts: Success vs Failure
Log-Transformed Funding Distribution
Correlation Heatmap of numeric variables

📈 Regression Modeling

🔹 Multiple Linear Regression (Statsmodels)

Dependent variable: average_funded
Significant predictors: average_participants, number_degrees, ipo
offices and is_acquired were not statistically significant

🔹 Linear Regression (Scikit-learn)

Repeated regression using Scikit-learn’s LinearRegression
Log-transformed average_funded for normality
Coefficients confirmed the importance of average_participants and ipo

🧪 Classification Modeling

🔸 Logistic Regression (Baseline – No SMOTE)

Initially trained on imbalanced classes
High overall accuracy but poor recall for “Success”
Ineffective at identifying successful startups

🔸 Logistic Regression (With SMOTE)

Used SMOTE to balance the dataset
Improved recall and precision for the “Success” class
More reliable at flagging high-potential startups

🌲 Random Forest Classification

Used SMOTE-balanced dataset
Boosted accuracy from 70.7% to 87%
Strong performance across all evaluation metrics

🛠️ Hyperparameter Tuning

Tuned n_estimators, max_depth, and min_samples_split using GridSearchCV
Slight performance improvement
Feature importance rankings remained stable

📚 Clustering Analysis (KMeans)

For a final unsupervised learning step, I applied KMeans Clustering to group startups based on:

category_code (encoded)
average_funded
average_participants

Steps:

Scaled all features using StandardScaler
Used the elbow method to determine 3 optimal clusters
Visualized results using 2D scatter plots

Key Insights:

Startups in similar sectors tend to attract similar funding amounts
Higher average_participants is associated with higher funding clusters
These clusters help spot high-potential startups independent of IPO or acquisition status

💡 Final Thoughts

This end-to-end machine learning project gave me the chance to:

Work with a real-world business dataset
Build interpretable and high-performing models
Apply data cleaning, feature engineering, and EDA effectively
Handle class imbalance with SMOTE
Improve model performance through hyperparameter tuning
Add unsupervised clustering to surface hidden patterns

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
README.md		README.md
crunchbase_classification.ipynb		crunchbase_classification.ipynb
crunchbase_clustering.ipynb		crunchbase_clustering.ipynb
crunchbase_hyperparameter_tuning.ipynb		crunchbase_hyperparameter_tuning.ipynb
crunchbase_multiple_regression.ipynb		crunchbase_multiple_regression.ipynb
crunchbase_randomforest.ipynb		crunchbase_randomforest.ipynb
crunchbase_sklearn_regression.ipynb		crunchbase_sklearn_regression.ipynb
crunchbase_smote_logreg.ipynb		crunchbase_smote_logreg.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly