This project showcases a comprehensive Big Data ETL (Extract, Transform, Load) pipeline that processes job application filings from the NYC Department of Buildings (DOB). The system is built with modern data engineering practices, including an AI-powered data enrichment step using a Large Language Model (LLM).
Key highlights of this project include:
- Modular ETL Architecture: The ETL logic is separated into distinct extract, transform, and load stages to ensure maintainability and clarity.
- AI Integration: A Large Language Model (Groq's LLM API) was used for intelligent extraction of job roles from unstructured text descriptions.
- Automated Testing: The pipeline includes a suite of comprehensive unit tests for each transformation stage.
- CI/CD Automation: A Jenkins pipeline was set up to automatically run the correct pipeline based on the Git branch.
The dataset comprises all job applications submitted through the Borough Offices, eFiling, or the HUB with a "Latest Action Date" from January 1, 2000, onwards. It does not include applications submitted via DOB NOW; a separate dataset exists for those. The dataset contains approximately 2.71 million records with 96 columns, providing detailed information for each job application.
Dataset Link: https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2/about_data
This project leveraged a range of powerful tools and frameworks:
Python3.13.5: The core programming language for the project.PySpark&Apache Spark4.0.0: Used as the big data and distributed data processing frameworks.GroqAPI: The LLM service utilized for extracting job roles.Jenkins: For CI/CD automation.Pytest: The unit testing framework.Git&GitHub: For version control and collaborative development with a branch-based strategy
We adopted a branch-based development strategy using GitHub, with each new feature being developed, tested, and reviewed in its own branch before being merged into the main codebase. This approach allowed for incremental and parallel development of the pipeline's features.
1. feature1-data-extraction-and-table-creation: This branch focused on the initial data ingestion, including reading raw parquet files, standardizing column names, and creating six core tables: Job_Applications, Properties, Applicants, Owners, Job_status, and Boroughs.
2. feature2-3-tables-transformation: The focus here was on improving data quality by performing cleaning, renaming columns, converting data types, and handling null values for the Job_Applications, Properties, and Applicants tables.
3. feature3-another-3-tables-transformation: This branch extended the data transformations to the Owners, Job_status, and Boroughs datasets, including data cleanup and code mapping.
4. feature4-some-few-transformation: This branch was dedicated to specialized transformations, such as mapping job status codes to descriptive names and formatting owner names.
5. feature5-LLM-usage: The final feature branch, this integrated the Groq API to extract and enrich the data with job roles from unstructured text descriptions.
Job_Status_Descrp column. This step was crucial for standardizing the free-text job descriptions into clean, structured role labels, significantly improving the data's usability for downstream analysis.
The workflow for this was:
- Read the raw
Job_Status_Descrpcolumn. - Send each description to the Groq LLM API.
- Extract and clean the returned job role.
- Create a new
Rolecolumn with the standardized role labels. - Join the enriched data back with the original PySpark DataFrame
Sample output:
The project's reliability is ensured through a robust testing and CI/CD strategy.
Testing Strategy: We used Pytest with PySpark to create comprehensive unit tests for the transformation scripts. The tests cover everything from schema validation and column renaming to type conversions and null handling, including edge cases.
CI/CD Pipeline with Jenkins: The Jenkins pipeline is configured for branch-based automation. When a new commit is pushed, Jenkins automatically detects the active branch and runs the appropriate tests and pipeline for that branch. This includes stages for checking out the code, setting up the environment, running tests, and executing the ETL pipeline.
Example workflow: