๐ Best Model Accuracy (Random Forest with SMOTE): 87.9%
As someone transitioning into data science with an interest in the banking and finance sector, I sought to take on a project that combines both business relevance and technical depth. This project focuses on predicting startup success using Crunchbase data. I approached it from three key angles:
- Regression to understand funding patterns
- Classification to predict startup outcomes
- Clustering to discover natural groupings among startups
The goal was to determine the factors that increase the likelihood of a startup's success โ measured by IPOs, acquisitions, or a combination of both โ while incorporating both machine learning techniques and business intelligence.
- Python Version: 3.11
- IDE: Jupyter Notebook
- Libraries:
pandas,numpy,matplotlib,seaborn,statsmodels,scikit-learn,imbalanced-learn - Dataset: An excerpt of a dataset released to the public by Crunchbase
- Filtered out companies younger than 3 years or older than 7 years to reduce noise (from 31,000+ to ~9,400 observations)
- Created a new
statuslabel:Success= Acquired, IPO, or bothFailure= Closed or no exit
- Engineered
number_degreesfrom MBA, PhD, MS, and Other - Dropped columns with high missing values (e.g.,
acquired_companies,products_number) - Handled missing values selectively
- Removed funding outliers (above 99th percentile)
EDA focused on understanding class balance and numerical relationships:
- Status Value Counts: Success vs Failure
- Log-Transformed Funding Distribution
- Correlation Heatmap of numeric variables
- Dependent variable:
average_funded - Significant predictors:
average_participants,number_degrees,ipo officesandis_acquiredwere not statistically significant
- Repeated regression using Scikit-learnโs
LinearRegression - Log-transformed
average_fundedfor normality - Coefficients confirmed the importance of
average_participantsandipo
- Initially trained on imbalanced classes
- High overall accuracy but poor recall for โSuccessโ
- Ineffective at identifying successful startups
- Used SMOTE to balance the dataset
- Improved recall and precision for the โSuccessโ class
- More reliable at flagging high-potential startups
- Used SMOTE-balanced dataset
- Boosted accuracy from 70.7% to 87%
- Strong performance across all evaluation metrics
- Tuned
n_estimators,max_depth, andmin_samples_splitusingGridSearchCV - Slight performance improvement
- Feature importance rankings remained stable
For a final unsupervised learning step, I applied KMeans Clustering to group startups based on:
category_code(encoded)average_fundedaverage_participants
- Scaled all features using
StandardScaler - Used the elbow method to determine 3 optimal clusters
- Visualized results using 2D scatter plots
- Startups in similar sectors tend to attract similar funding amounts
- Higher
average_participantsis associated with higher funding clusters - These clusters help spot high-potential startups independent of IPO or acquisition status
This end-to-end machine learning project gave me the chance to:
- Work with a real-world business dataset
- Build interpretable and high-performing models
- Apply data cleaning, feature engineering, and EDA effectively
- Handle class imbalance with SMOTE
- Improve model performance through hyperparameter tuning
- Add unsupervised clustering to surface hidden patterns