Skip to content

Commit 29386d8

Browse files
committed
added project description to README.md file and corrected fail in ipeline due to no tests present
1 parent ab13af1 commit 29386d8

File tree

3 files changed

+31
-87
lines changed

3 files changed

+31
-87
lines changed

.github/workflows/CI.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ jobs:
3131
run: |
3232
python -m ruff check src/ tests/
3333
34-
- name: Test with pytest
34+
- name: Run test with pytest
3535
run: |
36-
pytest -s tests/
36+
if ls tests/*.py 1> /dev/null 2>&1; then
37+
pytest -s tests/
38+
else
39+
echo "No tests found, skipping..."
40+
fi

.gitignore

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -86,30 +86,16 @@ profile_default/
8686
ipython_config.py
8787

8888
# pyenv
89-
# For a library or package, you might want to ignore these files since the code is
90-
# intended to run in multiple environments; otherwise, check them in:
91-
# .python-version
89+
.python-version
9290

9391
# pipenv
94-
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
95-
# However, in case of collaboration, if having platform-specific dependencies or dependencies
96-
# having no cross-platform support, pipenv may install dependencies that don't work, or not
97-
# install all needed dependencies.
98-
#Pipfile.lock
92+
Pipfile.lock
9993

10094
# poetry
101-
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102-
# This is especially recommended for binary packages to ensure reproducibility, and is more
103-
# commonly ignored for libraries.
104-
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105-
#poetry.lock
95+
poetry.lock
10696

10797
# pdm
108-
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109-
#pdm.lock
110-
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111-
# in version control.
112-
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
98+
pdm.lock
11399
.pdm.toml
114100
.pdm-python
115101
.pdm-build/

README.md

Lines changed: 21 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,75 +1,29 @@
1-
# Tickit Data Lake : Building a Data Lake Using AWS Resources
1+
# Tickit Data Lake : Building a Data Lake Using an Orchestrator + AWS Resources
22

3-
Welcome to the Tickit Data Lake project! This repository demonstrates the creation of a robust, 3-tier data lake
4-
using AWS resources. This pipeline is designed to handle the extraction, loading, and transformation of batch data.
5-
This pipeline automates the steps of gathering the data, extracting it, processing it, enriching it, and formatting
6-
it in a way that can be used by the business in downstream tasks and applications.
3+
## Overview
4+
Welcome to the Tickit Data Lake project! The Tickit Data Lake project demonstrates the construction
5+
of a scalable and robust 3-tier data lake on AWS, leveraging the power of Apache Airflow for orchestration
6+
and automation. This project provides a practical example of building a modern data pipeline capable of
7+
handling the extraction, loading, and transformation (ELT) of batch data, specifically designed to support
8+
the analytical needs of a business using the Tickit Dataset.
79

8-
## Project Overview
10+
## Key Features and Technologies:
911

10-
The data pipeline collects, transforms, and stores raw data files into formats tailored for business unit needs.
11-
This pipeline can be modified to source data from various external inputs including API endpoints, flat files,
12-
application logs, databases, and mobile applications. The steps in the pipeline can be performed using either
13-
the Python shell or Pyspark jobs.
12+
- Automated Orchestration: Airflow is the core orchestration engine, responsible for scheduling, monitoring,
13+
and managing the entire data pipeline. It defines the workflow as a Directed Acyclic Graph (DAG), ensuring
14+
dependencies between tasks are correctly handled. Airflow's robust features enable task retries, logging,
15+
and alerting, ensuring pipeline reliability.
1416

15-
In this project, raw, untransformed data resides in on-premises NOSQL databases and is initially extracted as .csv files
16-
into a bronze tier S3 bucket. The pipeline works on the raw data, processing it, and subsequently storing it in
17-
the appropriate data lake tier as determined by business requirements. The tiers are represented as folders within
18-
a single S3 bucket for this project. However, each tier should be given a dedicated bucket (as it is in production
19-
environments).
17+
- AWS Integration: The project seamlessly integrates with various AWS resources, including:
18+
1. EC2: Relaible and highly available computing for running the orchestrator.
2019

21-
An orchestrator triggers the extraction of the ingested raw and untransformed data into the bronze tier S3 where
22-
it is then moved on to the other tiers, getting enriched as it gets refined up the data lake tiers.
20+
1. S3: Scalable object storage for the Bronze, Silver, and Gold layers.
2321

24-
An understanding of creating and assigning IAM roles is required as the AWS resources used are configured in such
25-
a way to interact with one another, i.e., a role that grants permission to let AWS Glue access the resources it needs.
26-
The degree of restrictions can be narrowed down based on the level of security needed. For this pipeline, I assigned
27-
an IAM role to access AWS Glue resources and read/write to AWS S3 and Redshift.
28-
29-
## Medallion Architecture
30-
31-
The data lake tiers are inspired by the medallion architecture concept from Databricks, this project features the
32-
three distinct data tiers:
33-
34-
- Bronze: Raw, unprocessed data
35-
36-
- Silver: Validated, cleaned data
37-
38-
- Gold: Enriched, business-ready data
39-
40-
Each tier houses its own schemas and tables, which differ based on data update frequencies and downstream use cases.
41-
This multi-layered approach ensures data integrity and optimizes its use for business needs.
42-
43-
44-
## Pipeline Steps
45-
46-
### Ingesting the Data From Source
47-
48-
The first step in establishing the three distinct data tiers is to have a data source. In production, the raw data is
49-
usually stored across several sources and pulled into a landing area where the data pipeline kicks off. For example,
50-
the raw data could be stored in a postgresql database, mysql database, mssql database and a NoSQL database
51-
like mongodb. In this case, four separate glue crawlers would be needed to catalog each data source.
52-
53-
For this project, all source tables are housed in the same database and are extracted into a source S3 bucket. This
54-
implies that just one glue crawler is needed to catalog all tables from the data source. The source S3 bucket serves
55-
as the starting point of the pipeline. The seven tables are crawled and their metadata is saved in the glue data
56-
catalog.
57-
58-
59-
### Creating the Bronze Data Tier
60-
61-
The bronze data tier is simply the raw data ingestion layer of the medallion architecture. There is no data cleanup
62-
or validation performed in this layer. The process here is simply an Extraction-Load process with no transformation
63-
performed. For this project, all seven tables are simply moved from the data landing bucket that houses all source
64-
data, into the bronze layer, in a separate bucket (usually more secure with more access restrictions compared to the
65-
source bucket).
66-
67-
68-
### Creating the Silver Data Tier
69-
70-
The silver data tier takes the data a step further in its refinement as the data passes through extensive cleanup and
71-
validation. The silver tier sees the datatype standardization, filling and/or removal of null values, creation of
72-
desirable datatypes, detection and removal of duplicates, and to a certain degree, some data aggregation as some facts
73-
and dimension tables could be merged in the silver tier to allow downstream users utilize the data.
22+
1. Redshift: Scalable data warehouse used for providing a high-performance analytical database.
7423

24+
## Value
25+
This project serves as a valuable example of building a modern data lake on AWS using Airflow, showcasing best
26+
practices for data ingestion, processing, and transformation. It provides a solid foundation for building a
27+
robust data platform to support a wide range of analytical needs.
7528

29+
Feel free to fork, clone or zip the contents of this repository for your needs.

0 commit comments

Comments
 (0)