Predict software industry salaries using machine learning and a user-friendly Flask web app.
- Overview
- Features
- Dataset
- Exploratory Data Analysis (EDA)
- Data Cleaning & Preparation
- Model Building & Evaluation
- Hyperparameter Tuning
- Web App (Flask Deployment)
- How to Run Locally
- Results & Insights
- Contributing
- License
- Acknowledgements
This project predicts the salaries of software professionals based on various features such as company, job title, location, employment status, and more. It combines thorough data analysis, robust machine learning, and an interactive web interface for easy predictions.
- End-to-end ML pipeline: Data cleaning, EDA, feature engineering, model training, and evaluation.
- Multiple regression models: Linear Regression, Decision Tree, Random Forest, XGBoost.
- Hyperparameter tuning: RandomizedSearchCV for XGBoost.
- Interactive Flask web app: User-friendly interface for salary prediction.
- Source: Kaggle - Software Professional Salaries 2022
- Shape: 22,770 rows × 8 columns
- Features:
- Rating
- Company Name
- Job Title
- Salary (target)
- Salaries Reported
- Location
- Employment Status
- Job Roles
- Univariate & Bivariate Analysis:
- Distribution plots for salary and rating
- Job role and location frequency
- Boxplots for salary by employment status and job role
- Pairplots for numerical features
- Key Insights:
- Salary distributions are right-skewed
- Some job roles and companies dominate the dataset
- Outliers and missing values identified and handled
- Dropped rows with missing company names
- Removed extreme outlier in salary
- Grouped rare categories in company and job title as 'Other'
- Standardized numerical features and one-hot encoded categorical features
- Models Trained:
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
- XGBoost Regressor
- Evaluation Metrics: MAE, MSE, RMSE, R²
- Best Model: XGBoost (after hyperparameter tuning)
Model | MAE | RMSE | R² |
---|---|---|---|
Linear Regression | 356,296 | 539,130 | 0.24 |
Decision Tree | 426,012 | 702,916 | -0.30 |
Random Forest | 378,252 | 590,131 | 0.09 |
XGBoost | 349,637 | 532,842 | 0.25 |
- Used
RandomizedSearchCV
for XGBoost with a wide parameter grid - Best parameters improved R² to ~0.48 on the test set
- Frontend: Simple forms for user input (company, job title, location, etc.)
- Backend:
- Loads the trained model (
Software Industry Salary Prediction.pkl
) - Accepts user input, preprocesses it, and predicts salary
- Displays the predicted salary on a results page
- Loads the trained model (
- Templates:
index.html
: Home pagepredict.html
: Input formresult.html
: Prediction output
- Clone the repository:
git clone https://github.com/yourusername/software-salary-prediction.git cd software-salary-prediction
- Create a virtual environment and activate it:
python -m venv venv # On Windows: venv\Scripts\activate # On Mac/Linux: source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Run the Flask app:
python app.py
- Open your browser and go to: http://127.0.0.1:5000
- XGBoost performed best, but all models had limited R², suggesting salary is influenced by additional factors not in the dataset.
- The app provides quick, accessible salary predictions for various roles and companies.
Contributions are welcome! Please open issues or pull requests for improvements, bug fixes, or new features.
This project is licensed under the MIT License. See LICENSE for details.
- Kaggle Dataset
- Scikit-learn, XGBoost, Flask
- Project by Yuva Yashvin, Yuvan Bharathi, Ritvik Marwah
For questions or feedback, please contact the project maintainers.