Skip to content

praneeth-motapally/Big_Data_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data ETL Pipeline for NYC Job Applications

Project Overview

This project showcases a comprehensive Big Data ETL (Extract, Transform, Load) pipeline that processes job application filings from the NYC Department of Buildings (DOB). The system is built with modern data engineering practices, including an AI-powered data enrichment step using a Large Language Model (LLM).

Key highlights of this project include:

  • Modular ETL Architecture: The ETL logic is separated into distinct extract, transform, and load stages to ensure maintainability and clarity.
  • AI Integration: A Large Language Model (Groq's LLM API) was used for intelligent extraction of job roles from unstructured text descriptions.
  • Automated Testing: The pipeline includes a suite of comprehensive unit tests for each transformation stage.
  • CI/CD Automation: A Jenkins pipeline was set up to automatically run the correct pipeline based on the Git branch.

About Dataset:

The dataset comprises all job applications submitted through the Borough Offices, eFiling, or the HUB with a "Latest Action Date" from January 1, 2000, onwards. It does not include applications submitted via DOB NOW; a separate dataset exists for those. The dataset contains approximately 2.71 million records with 96 columns, providing detailed information for each job application.

Dataset Link: https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/about_data

Technologies Used:

This project leveraged a range of powerful tools and frameworks:

  • Python 3.13.5: The core programming language for the project.
  • PySpark & Apache Spark 4.0.0: Used as the big data and distributed data processing frameworks.
  • Groq API: The LLM service utilized for extracting job roles.
  • Jenkins: For CI/CD automation.
  • Pytest: The unit testing framework.
  • Git & GitHub: For version control and collaborative development with a branch-based strategy

Branch-Based Development Strategy:

We adopted a branch-based development strategy using GitHub, with each new feature being developed, tested, and reviewed in its own branch before being merged into the main codebase. This approach allowed for incremental and parallel development of the pipeline's features.

Screenshot 2025-08-22 175704

Feature Branch Breakdown:

1. feature1-data-extraction-and-table-creation: This branch focused on the initial data ingestion, including reading raw parquet files, standardizing column names, and creating six core tables: Job_Applications, Properties, Applicants, Owners, Job_status, and Boroughs.

Screenshot 2025-08-22 191906


2. feature2-3-tables-transformation: The focus here was on improving data quality by performing cleaning, renaming columns, converting data types, and handling null values for the Job_Applications, Properties, and Applicants tables.

Screenshot 2025-08-25 170357


3. feature3-another-3-tables-transformation: This branch extended the data transformations to the Owners, Job_status, and Boroughs datasets, including data cleanup and code mapping.

Screenshot 2025-08-25 170556


4. feature4-some-few-transformation: This branch was dedicated to specialized transformations, such as mapping job status codes to descriptive names and formatting owner names.

Screenshot 2025-08-25 171513


5. feature5-LLM-usage: The final feature branch, this integrated the Groq API to extract and enrich the data with job roles from unstructured text descriptions.

Screenshot 2025-08-22 211905

AI/ML Integration:

Job_Status_Descrp column. This step was crucial for standardizing the free-text job descriptions into clean, structured role labels, significantly improving the data's usability for downstream analysis.

The workflow for this was:

  1. Read the raw Job_Status_Descrp column.
  2. Send each description to the Groq LLM API.
  3. Extract and clean the returned job role.
  4. Create a new Role column with the standardized role labels.
  5. Join the enriched data back with the original PySpark DataFrame

Sample output:

Screenshot 2025-08-25 163153

Continuous Integration and Testing:

The project's reliability is ensured through a robust testing and CI/CD strategy.

Testing Strategy: We used Pytest with PySpark to create comprehensive unit tests for the transformation scripts. The tests cover everything from schema validation and column renaming to type conversions and null handling, including edge cases.

CI/CD Pipeline with Jenkins: The Jenkins pipeline is configured for branch-based automation. When a new commit is pushed, Jenkins automatically detects the active branch and runs the appropriate tests and pipeline for that branch. This includes stages for checking out the code, setting up the environment, running tests, and executing the ETL pipeline.

Example workflow:

Screenshot 2025-08-22 231558

Final ETL Processed Tables:

Screenshot 2025-08-25 163919

About

This project builds a data pipeline to process millions of NYC building job applications. It cleans, transforms, and organizes the data using big data tools, and adds smart job role labels using AI. The pipeline is developed step-by-step in different branches, tested thoroughly, and automated using Jenkins for easy updates.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages