|
1 |
| -# Tickit Data Lake : Building a Data Lake Using AWS Resources |
| 1 | +# Tickit Data Lake : Building a Data Lake Using an Orchestrator + AWS Resources |
2 | 2 |
|
3 |
| -Welcome to the Tickit Data Lake project! This repository demonstrates the creation of a robust, 3-tier data lake |
4 |
| -using AWS resources. This pipeline is designed to handle the extraction, loading, and transformation of batch data. |
5 |
| -This pipeline automates the steps of gathering the data, extracting it, processing it, enriching it, and formatting |
6 |
| -it in a way that can be used by the business in downstream tasks and applications. |
| 3 | +## Overview |
| 4 | +Welcome to the Tickit Data Lake project! The Tickit Data Lake project demonstrates the construction |
| 5 | +of a scalable and robust 3-tier data lake on AWS, leveraging the power of Apache Airflow for orchestration |
| 6 | +and automation. This project provides a practical example of building a modern data pipeline capable of |
| 7 | +handling the extraction, loading, and transformation (ELT) of batch data, specifically designed to support |
| 8 | +the analytical needs of a business using the Tickit Dataset. |
7 | 9 |
|
8 |
| -## Project Overview |
| 10 | +## Key Features and Technologies: |
9 | 11 |
|
10 |
| -The data pipeline collects, transforms, and stores raw data files into formats tailored for business unit needs. |
11 |
| -This pipeline can be modified to source data from various external inputs including API endpoints, flat files, |
12 |
| -application logs, databases, and mobile applications. The steps in the pipeline can be performed using either |
13 |
| -the Python shell or Pyspark jobs. |
| 12 | +- Automated Orchestration: Airflow is the core orchestration engine, responsible for scheduling, monitoring, |
| 13 | +and managing the entire data pipeline. It defines the workflow as a Directed Acyclic Graph (DAG), ensuring |
| 14 | +dependencies between tasks are correctly handled. Airflow's robust features enable task retries, logging, |
| 15 | +and alerting, ensuring pipeline reliability. |
14 | 16 |
|
15 |
| -In this project, raw, untransformed data resides in on-premises NOSQL databases and is initially extracted as .csv files |
16 |
| -into a bronze tier S3 bucket. The pipeline works on the raw data, processing it, and subsequently storing it in |
17 |
| -the appropriate data lake tier as determined by business requirements. The tiers are represented as folders within |
18 |
| -a single S3 bucket for this project. However, each tier should be given a dedicated bucket (as it is in production |
19 |
| -environments). |
| 17 | +- AWS Integration: The project seamlessly integrates with various AWS resources, including: |
| 18 | +1. EC2: Relaible and highly available computing for running the orchestrator. |
20 | 19 |
|
21 |
| -An orchestrator triggers the extraction of the ingested raw and untransformed data into the bronze tier S3 where |
22 |
| -it is then moved on to the other tiers, getting enriched as it gets refined up the data lake tiers. |
| 20 | +1. S3: Scalable object storage for the Bronze, Silver, and Gold layers. |
23 | 21 |
|
24 |
| -An understanding of creating and assigning IAM roles is required as the AWS resources used are configured in such |
25 |
| -a way to interact with one another, i.e., a role that grants permission to let AWS Glue access the resources it needs. |
26 |
| -The degree of restrictions can be narrowed down based on the level of security needed. For this pipeline, I assigned |
27 |
| -an IAM role to access AWS Glue resources and read/write to AWS S3 and Redshift. |
28 |
| - |
29 |
| -## Medallion Architecture |
30 |
| - |
31 |
| -The data lake tiers are inspired by the medallion architecture concept from Databricks, this project features the |
32 |
| -three distinct data tiers: |
33 |
| - |
34 |
| -- Bronze: Raw, unprocessed data |
35 |
| - |
36 |
| -- Silver: Validated, cleaned data |
37 |
| - |
38 |
| -- Gold: Enriched, business-ready data |
39 |
| - |
40 |
| -Each tier houses its own schemas and tables, which differ based on data update frequencies and downstream use cases. |
41 |
| -This multi-layered approach ensures data integrity and optimizes its use for business needs. |
42 |
| - |
43 |
| - |
44 |
| -## Pipeline Steps |
45 |
| - |
46 |
| -### Ingesting the Data From Source |
47 |
| - |
48 |
| -The first step in establishing the three distinct data tiers is to have a data source. In production, the raw data is |
49 |
| -usually stored across several sources and pulled into a landing area where the data pipeline kicks off. For example, |
50 |
| -the raw data could be stored in a postgresql database, mysql database, mssql database and a NoSQL database |
51 |
| -like mongodb. In this case, four separate glue crawlers would be needed to catalog each data source. |
52 |
| - |
53 |
| -For this project, all source tables are housed in the same database and are extracted into a source S3 bucket. This |
54 |
| -implies that just one glue crawler is needed to catalog all tables from the data source. The source S3 bucket serves |
55 |
| -as the starting point of the pipeline. The seven tables are crawled and their metadata is saved in the glue data |
56 |
| -catalog. |
57 |
| - |
58 |
| - |
59 |
| -### Creating the Bronze Data Tier |
60 |
| - |
61 |
| -The bronze data tier is simply the raw data ingestion layer of the medallion architecture. There is no data cleanup |
62 |
| -or validation performed in this layer. The process here is simply an Extraction-Load process with no transformation |
63 |
| -performed. For this project, all seven tables are simply moved from the data landing bucket that houses all source |
64 |
| -data, into the bronze layer, in a separate bucket (usually more secure with more access restrictions compared to the |
65 |
| -source bucket). |
66 |
| - |
67 |
| - |
68 |
| -### Creating the Silver Data Tier |
69 |
| - |
70 |
| -The silver data tier takes the data a step further in its refinement as the data passes through extensive cleanup and |
71 |
| -validation. The silver tier sees the datatype standardization, filling and/or removal of null values, creation of |
72 |
| -desirable datatypes, detection and removal of duplicates, and to a certain degree, some data aggregation as some facts |
73 |
| -and dimension tables could be merged in the silver tier to allow downstream users utilize the data. |
| 22 | +1. Redshift: Scalable data warehouse used for providing a high-performance analytical database. |
74 | 23 |
|
| 24 | +## Value |
| 25 | +This project serves as a valuable example of building a modern data lake on AWS using Airflow, showcasing best |
| 26 | +practices for data ingestion, processing, and transformation. It provides a solid foundation for building a |
| 27 | +robust data platform to support a wide range of analytical needs. |
75 | 28 |
|
| 29 | +Feel free to fork, clone or zip the contents of this repository for your needs. |
0 commit comments