This project focuses on cleaning and analyzing a dataset of employee records using Python, Pandas, NumPy, and Matplotlib.
- Contains employee details such as:
- Name
- Age
- City
- Salary
- Join Date
- Employee ID
-
Missing Data Handling
- Filled missing values in
Age
,Salary
,City
, andName
.
- Filled missing values in
-
Standardization
- Standardized city names (
NY
→New York
, etc.). - Corrected name formatting and removed special characters.
- Converted
Join Date
to datetime.
- Standardized city names (
-
Outliers and Validation
- Removed unrealistic age and salary values.
- Identified and flagged invalid emails.
-
Duplicates
- Removed duplicate entries based on
Name
orEmployee ID
.
- Removed duplicate entries based on
-
Feature Engineering
- Extracted
Join_Year
fromJoin Date
. - Created valid email formats using employee names.
- Added a
Data_Quality_Flag
column for rows with issues.
- Extracted
- Bar chart showing average age and salary.
- Join trends per year with color-coded bars.
- Python
- Pandas
- NumPy
- Matplotlib
Alok Bhateshwar
GitHub: @alokbhateshwar
This project is open-source and available under the MIT License.