Skip to content

Commit 42b231e

Browse files
Drafting the Readme file for the repository (#81)
* description has been added * pipeline architecture has been updated * tech stack has been added * draft has been added * part has been added * file has been update3d * line has been added
1 parent 59bfc52 commit 42b231e

File tree

1 file changed

+105
-39
lines changed

1 file changed

+105
-39
lines changed

README.md

Lines changed: 105 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,124 @@
1+
# 🏏 T20I Data Pipeline for Kaggle with AWS
12

2-
# Welcome to your CDK Python project!
3+
![Made with Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
4+
![Built on AWS](https://img.shields.io/badge/AWS-Lambda%20%7C%20S3%20%7C%20SQS-orange)
5+
![MongoDB Atlas](https://img.shields.io/badge/MongoDB-Atlas-green)
6+
![DynamoDB](https://img.shields.io/badge/DynamoDB-NoSQL-informational)
7+
![Kaggle Dataset](https://img.shields.io/badge/Kaggle-T20I%20Cricket%20Dataset-blue)
8+
![CDK](https://img.shields.io/badge/IaC-AWS%20CDK-informational)
39

4-
This is a blank project for CDK development with Python.
10+
This repository automates the end-to-end extraction, processing, and publishing of Men’s T20 International (T20I) cricket data to [Kaggle](https://www.kaggle.com/datasets/nishanthmuruganantha/mens-t20i-cricket-complete-dataset/data), sourced from [Cricsheet](https://cricsheet.org/), using AWS serverless services.
511

6-
The `cdk.json` file tells the CDK Toolkit how to execute your app.
12+
This data pipeline leverages AWS Lambda functions, EventBridge, and SQS to orchestrate an event-driven, fully serverless processing flow. Data is stored in MongoDB Atlas and AWS S3, converted into CSV format, and then automatically uploaded to Kaggle for public access and analysis.
713

8-
This project is set up like a standard Python project. The initialization
9-
process also creates a virtualenv within this project, stored under the `.venv`
10-
directory. To create the virtualenv it assumes that there is a `python3`
11-
(or `python` for Windows) executable in your path with access to the `venv`
12-
package. If for any reason the automatic creation of the virtualenv fails,
13-
you can create the virtualenv manually.
14+
The dataset is kept current with automated weekly updates, delivering up-to-date and reliable cricket data without manual effort.
1415

15-
To manually create a virtualenv on MacOS and Linux:
16+
All critical steps in the workflow send real-time execution status updates via a Telegram bot.
1617

17-
```
18-
$ python -m venv .venv
19-
```
18+
---
19+
## Pipeline Architecture Overview ⚙️
2020

21-
After the init process completes and the virtualenv is created, you can use the following
22-
step to activate your virtualenv.
21+
The data pipeline is designed using a fully serverless, event-driven architecture on AWS, ensuring scalability, efficiency, and automation throughout the data lifecycle.
2322

24-
```
25-
$ source .venv/bin/activate
26-
```
23+
Here’s how the workflow operates:
2724

28-
If you are a Windows platform, you would activate the virtualenv like this:
25+
![Pipeline Architecture](pipeline_architecture.svg)
2926

30-
```
31-
% .venv\Scripts\activate.bat
32-
```
27+
---
3328

34-
Once the virtualenv is activated, you can install the required dependencies.
29+
## Tech Stack 🧰
3530

36-
```
37-
$ pip install -r requirements.txt
38-
```
3931

40-
At this point you can now synthesize the CloudFormation template for this code.
32+
| Category | Tools & Services |
33+
|------------------------|------------------------------------------|
34+
| Programming Language | Python |
35+
| AWS Services | Lambda, CloudWatch, EventBridge, SQS, S3, DynamoDB, Parameter Store, Secrets Manager |
36+
| Database | MongoDB Atlas |
37+
| Infrastructure as Code | AWS CDK (Python) |
38+
| Data Publishing | Kaggle API |
39+
| Notifications | Telegram Bot API |
40+
| Documentation | draw.io (diagrams.net) |
4141

42+
---
43+
44+
## 🧱 Infrastructure as Code (IaC)
45+
46+
This project embraces the practice of **Infrastructure as Code (IaC)** philosophy using **[AWS CDK](https://docs.aws.amazon.com/cdk/)** (in Python) to provision and manage cloud resources.
47+
48+
49+
### 💡 Advantages of leveraging IaC
50+
51+
52+
- **Version Control**: All infrastructure is declared in code and tracked in Git.
53+
- **Centralized Control**: All AWS resources are organized and deployed under a **single CDK stack**, making them easier to maintain, modify, and tear down.
54+
- **Automation**: No manual clicks in the AWS console—deployment is fully automated.
55+
56+
### 🧰 Resources Defined via CDK
57+
58+
With AWS CDK, the following resources are created and configured programmatically:
59+
60+
| Resource Type | Purpose |
61+
|------------------------|--------------------------------------------------------|
62+
| 🪣 **S3 Buckets** | For storing downloaded and processed files |
63+
| 🧮 **DynamoDB Table** | To track processed match files |
64+
| 🧠 **Lambda Functions** | For each pipeline task (download, extract, transform, upload) |
65+
| 🔁 **SQS & EventBridge** | To trigger Lambdas asynchronously |
66+
| 🔐 **IAM Roles** | With scoped permissions for security |
67+
| 🧾 **SSM Parameters** | For storing API keys, tokens, and config |
68+
| 📆 **CloudWatch Schedulers** | To run jobs on a weekly basis |
69+
70+
71+
---
72+
## 📦 Code Packaging
73+
74+
For every code changes, this project is leveraing `build_packages` and `cdk deploy` commands to package and deploy the code respectively.
75+
76+
### 🛠 Lambda Packaging Utility (`src\build\build_packages.py`)
77+
78+
This utility script automates the packaging process for both:
79+
80+
- 📦 **AWS Lambda Layers** (for dependencies like `pymongo`, `kaggle`, `requests`)
81+
- 🧾 **Lambda Handler Zips** (each respective Lambda function code files)
82+
83+
#### Purposes
84+
85+
The `build_packages.py` script streamlines the deployment workflow by:
86+
87+
1. **Building a Lambda Layer:**
88+
- Creates a source distribution (`.tar.gz`) using your `setup.py`
89+
- Extracts it into a `site-packages` directory
90+
- Installs dependencies listed in `requirements.txt` using a Lambda-compatible environment
91+
- Zips the `site-packages` directory for deployment as an AWS Lambda Layer
92+
93+
2. **Zipping Lambda Code:**
94+
- Each handler file is compressed into its own zip archive, ready for deployment
95+
96+
3. **Cleaning Up:**
97+
- Removes temporary folders and tarballs to keep your workspace clean
98+
99+
---
100+
101+
## 🚀 Deployment
102+
After packages, all the resources will get deployed with the CDK command
103+
104+
```bash
105+
cdk deploy
42106
```
43-
$ cdk synth
44-
```
45107

46-
To add additional dependencies, for example other CDK libraries, just add
47-
them to your `setup.py` file and rerun the `pip install -r requirements.txt`
48-
command.
108+
- This command deploys S3 buckets, Lambda functions, DynamoDb tables, EventBridge rules, SQS queues, IAM roles, SSM parameters and CloudWatch schedulers.
109+
110+
- Everything is deployed in one go via a single CDK stack, making the infrastructure highly repeatable and version-controlled.
111+
112+
---
113+
114+
## 🤝 Contribution
115+
116+
This project is a solo build by me, but if you'd like to raise issues, fork, or explore, feel free to open a discussion or submit a pull request.
117+
118+
## 📄 License
49119

50-
## Useful commands
120+
This project is licensed under the MIT License – see the [LICENSE](LICENSE) file for details.
51121

52-
* `cdk ls` list all stacks in the app
53-
* `cdk synth` emits the synthesized CloudFormation template
54-
* `cdk deploy` deploy this stack to your default AWS account/region
55-
* `cdk diff` compare deployed stack with current state
56-
* `cdk docs` open CDK documentation
122+
©<a href="https://github.com/NishanthMuruganantham">Nishanth</a>
57123

58-
Enjoy!
124+
---

0 commit comments

Comments
 (0)