Big Data ETL Pipeline for NYC Job Applications

Project Overview

This project showcases a comprehensive Big Data ETL (Extract, Transform, Load) pipeline that processes job application filings from the NYC Department of Buildings (DOB). The system is built with modern data engineering practices, including an AI-powered data enrichment step using a Large Language Model (LLM).

Key highlights of this project include:

Modular ETL Architecture: The ETL logic is separated into distinct extract, transform, and load stages to ensure maintainability and clarity.
AI Integration: A Large Language Model (Groq's LLM API) was used for intelligent extraction of job roles from unstructured text descriptions.
Automated Testing: The pipeline includes a suite of comprehensive unit tests for each transformation stage.
CI/CD Automation: A Jenkins pipeline was set up to automatically run the correct pipeline based on the Git branch.

About Dataset:

The dataset comprises all job applications submitted through the Borough Offices, eFiling, or the HUB with a "Latest Action Date" from January 1, 2000, onwards. It does not include applications submitted via DOB NOW; a separate dataset exists for those. The dataset contains approximately 2.71 million records with 96 columns, providing detailed information for each job application.

Dataset Link: https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/about_data

Technologies Used:

This project leveraged a range of powerful tools and frameworks:

Python 3.13.5: The core programming language for the project.
PySpark & Apache Spark 4.0.0: Used as the big data and distributed data processing frameworks.
Groq API: The LLM service utilized for extracting job roles.
Jenkins: For CI/CD automation.
Pytest: The unit testing framework.
Git & GitHub: For version control and collaborative development with a branch-based strategy

Branch-Based Development Strategy:

We adopted a branch-based development strategy using GitHub, with each new feature being developed, tested, and reviewed in its own branch before being merged into the main codebase. This approach allowed for incremental and parallel development of the pipeline's features.

Feature Branch Breakdown:

1. feature1-data-extraction-and-table-creation: This branch focused on the initial data ingestion, including reading raw parquet files, standardizing column names, and creating six core tables: Job_Applications, Properties, Applicants, Owners, Job_status, and Boroughs.

2. feature2-3-tables-transformation: The focus here was on improving data quality by performing cleaning, renaming columns, converting data types, and handling null values for the Job_Applications, Properties, and Applicants tables.

3. feature3-another-3-tables-transformation: This branch extended the data transformations to the Owners, Job_status, and Boroughs datasets, including data cleanup and code mapping.

4. feature4-some-few-transformation: This branch was dedicated to specialized transformations, such as mapping job status codes to descriptive names and formatting owner names.

5. feature5-LLM-usage: The final feature branch, this integrated the Groq API to extract and enrich the data with job roles from unstructured text descriptions.

AI/ML Integration:

Job_Status_Descrp column. This step was crucial for standardizing the free-text job descriptions into clean, structured role labels, significantly improving the data's usability for downstream analysis.

The workflow for this was:

Read the raw Job_Status_Descrp column.
Send each description to the Groq LLM API.
Extract and clean the returned job role.
Create a new Role column with the standardized role labels.
Join the enriched data back with the original PySpark DataFrame

Sample output:

Continuous Integration and Testing:

The project's reliability is ensured through a robust testing and CI/CD strategy.

Testing Strategy: We used Pytest with PySpark to create comprehensive unit tests for the transformation scripts. The tests cover everything from schema validation and column renaming to type conversions and null handling, including edge cases.

CI/CD Pipeline with Jenkins: The Jenkins pipeline is configured for branch-based automation. When a new commit is pushed, Jenkins automatically detects the active branch and runs the appropriate tests and pipeline for that branch. This includes stages for checking out the code, setting up the environment, running tests, and executing the ETL pipeline.

Example workflow:

Final ETL Processed Tables:

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config		config
etl		etl
logs		logs
pipelines		pipelines
src/utils		src/utils
tests		tests
.gitignore		.gitignore
Jenkinsfile		Jenkinsfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data ETL Pipeline for NYC Job Applications

Project Overview

About Dataset:

Technologies Used:

Branch-Based Development Strategy:

Feature Branch Breakdown:

AI/ML Integration:

Continuous Integration and Testing:

Final ETL Processed Tables:

Documentation Link: https://drive.google.com/file/d/1x7xxB9BVeTbTcUAE8Fpv0LIvdJC8of__/view?usp=sharing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Big Data ETL Pipeline for NYC Job Applications

Project Overview

About Dataset:

Technologies Used:

Branch-Based Development Strategy:

Feature Branch Breakdown:

AI/ML Integration:

Continuous Integration and Testing:

Final ETL Processed Tables:

Documentation Link: https://drive.google.com/file/d/1x7xxB9BVeTbTcUAE8Fpv0LIvdJC8of__/view?usp=sharing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages