|
| 1 | +# 🏏 T20I Data Pipeline for Kaggle with AWS |
1 | 2 |
|
2 |
| -# Welcome to your CDK Python project! |
| 3 | + |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | + |
3 | 9 |
|
4 |
| -This is a blank project for CDK development with Python. |
| 10 | +This repository automates the end-to-end extraction, processing, and publishing of Men’s T20 International (T20I) cricket data to [Kaggle](https://www.kaggle.com/datasets/nishanthmuruganantha/mens-t20i-cricket-complete-dataset/data), sourced from [Cricsheet](https://cricsheet.org/), using AWS serverless services. |
5 | 11 |
|
6 |
| -The `cdk.json` file tells the CDK Toolkit how to execute your app. |
| 12 | +This data pipeline leverages AWS Lambda functions, EventBridge, and SQS to orchestrate an event-driven, fully serverless processing flow. Data is stored in MongoDB Atlas and AWS S3, converted into CSV format, and then automatically uploaded to Kaggle for public access and analysis. |
7 | 13 |
|
8 |
| -This project is set up like a standard Python project. The initialization |
9 |
| -process also creates a virtualenv within this project, stored under the `.venv` |
10 |
| -directory. To create the virtualenv it assumes that there is a `python3` |
11 |
| -(or `python` for Windows) executable in your path with access to the `venv` |
12 |
| -package. If for any reason the automatic creation of the virtualenv fails, |
13 |
| -you can create the virtualenv manually. |
| 14 | +The dataset is kept current with automated weekly updates, delivering up-to-date and reliable cricket data without manual effort. |
14 | 15 |
|
15 |
| -To manually create a virtualenv on MacOS and Linux: |
| 16 | +All critical steps in the workflow send real-time execution status updates via a Telegram bot. |
16 | 17 |
|
17 |
| -``` |
18 |
| -$ python -m venv .venv |
19 |
| -``` |
| 18 | +--- |
| 19 | +## Pipeline Architecture Overview ⚙️ |
20 | 20 |
|
21 |
| -After the init process completes and the virtualenv is created, you can use the following |
22 |
| -step to activate your virtualenv. |
| 21 | +The data pipeline is designed using a fully serverless, event-driven architecture on AWS, ensuring scalability, efficiency, and automation throughout the data lifecycle. |
23 | 22 |
|
24 |
| -``` |
25 |
| -$ source .venv/bin/activate |
26 |
| -``` |
| 23 | +Here’s how the workflow operates: |
27 | 24 |
|
28 |
| -If you are a Windows platform, you would activate the virtualenv like this: |
| 25 | + |
29 | 26 |
|
30 |
| -``` |
31 |
| -% .venv\Scripts\activate.bat |
32 |
| -``` |
| 27 | +--- |
33 | 28 |
|
34 |
| -Once the virtualenv is activated, you can install the required dependencies. |
| 29 | +## Tech Stack 🧰 |
35 | 30 |
|
36 |
| -``` |
37 |
| -$ pip install -r requirements.txt |
38 |
| -``` |
39 | 31 |
|
40 |
| -At this point you can now synthesize the CloudFormation template for this code. |
| 32 | +| Category | Tools & Services | |
| 33 | +|------------------------|------------------------------------------| |
| 34 | +| Programming Language | Python | |
| 35 | +| AWS Services | Lambda, CloudWatch, EventBridge, SQS, S3, DynamoDB, Parameter Store, Secrets Manager | |
| 36 | +| Database | MongoDB Atlas | |
| 37 | +| Infrastructure as Code | AWS CDK (Python) | |
| 38 | +| Data Publishing | Kaggle API | |
| 39 | +| Notifications | Telegram Bot API | |
| 40 | +| Documentation | draw.io (diagrams.net) | |
41 | 41 |
|
| 42 | +--- |
| 43 | + |
| 44 | +## 🧱 Infrastructure as Code (IaC) |
| 45 | + |
| 46 | +This project embraces the practice of **Infrastructure as Code (IaC)** philosophy using **[AWS CDK](https://docs.aws.amazon.com/cdk/)** (in Python) to provision and manage cloud resources. |
| 47 | + |
| 48 | + |
| 49 | +### 💡 Advantages of leveraging IaC |
| 50 | + |
| 51 | + |
| 52 | +- **Version Control**: All infrastructure is declared in code and tracked in Git. |
| 53 | +- **Centralized Control**: All AWS resources are organized and deployed under a **single CDK stack**, making them easier to maintain, modify, and tear down. |
| 54 | +- **Automation**: No manual clicks in the AWS console—deployment is fully automated. |
| 55 | + |
| 56 | +### 🧰 Resources Defined via CDK |
| 57 | + |
| 58 | +With AWS CDK, the following resources are created and configured programmatically: |
| 59 | + |
| 60 | +| Resource Type | Purpose | |
| 61 | +|------------------------|--------------------------------------------------------| |
| 62 | +| 🪣 **S3 Buckets** | For storing downloaded and processed files | |
| 63 | +| 🧮 **DynamoDB Table** | To track processed match files | |
| 64 | +| 🧠 **Lambda Functions** | For each pipeline task (download, extract, transform, upload) | |
| 65 | +| 🔁 **SQS & EventBridge** | To trigger Lambdas asynchronously | |
| 66 | +| 🔐 **IAM Roles** | With scoped permissions for security | |
| 67 | +| 🧾 **SSM Parameters** | For storing API keys, tokens, and config | |
| 68 | +| 📆 **CloudWatch Schedulers** | To run jobs on a weekly basis | |
| 69 | + |
| 70 | + |
| 71 | +--- |
| 72 | +## 📦 Code Packaging |
| 73 | + |
| 74 | +For every code changes, this project is leveraing `build_packages` and `cdk deploy` commands to package and deploy the code respectively. |
| 75 | + |
| 76 | +### 🛠 Lambda Packaging Utility (`src\build\build_packages.py`) |
| 77 | + |
| 78 | +This utility script automates the packaging process for both: |
| 79 | + |
| 80 | +- 📦 **AWS Lambda Layers** (for dependencies like `pymongo`, `kaggle`, `requests`) |
| 81 | +- 🧾 **Lambda Handler Zips** (each respective Lambda function code files) |
| 82 | + |
| 83 | +#### Purposes |
| 84 | + |
| 85 | +The `build_packages.py` script streamlines the deployment workflow by: |
| 86 | + |
| 87 | +1. **Building a Lambda Layer:** |
| 88 | + - Creates a source distribution (`.tar.gz`) using your `setup.py` |
| 89 | + - Extracts it into a `site-packages` directory |
| 90 | + - Installs dependencies listed in `requirements.txt` using a Lambda-compatible environment |
| 91 | + - Zips the `site-packages` directory for deployment as an AWS Lambda Layer |
| 92 | + |
| 93 | +2. **Zipping Lambda Code:** |
| 94 | + - Each handler file is compressed into its own zip archive, ready for deployment |
| 95 | + |
| 96 | +3. **Cleaning Up:** |
| 97 | + - Removes temporary folders and tarballs to keep your workspace clean |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## 🚀 Deployment |
| 102 | +After packages, all the resources will get deployed with the CDK command |
| 103 | + |
| 104 | +```bash |
| 105 | +cdk deploy |
42 | 106 | ```
|
43 |
| -$ cdk synth |
44 |
| -``` |
45 | 107 |
|
46 |
| -To add additional dependencies, for example other CDK libraries, just add |
47 |
| -them to your `setup.py` file and rerun the `pip install -r requirements.txt` |
48 |
| -command. |
| 108 | +- This command deploys S3 buckets, Lambda functions, DynamoDb tables, EventBridge rules, SQS queues, IAM roles, SSM parameters and CloudWatch schedulers. |
| 109 | + |
| 110 | +- Everything is deployed in one go via a single CDK stack, making the infrastructure highly repeatable and version-controlled. |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## 🤝 Contribution |
| 115 | + |
| 116 | +This project is a solo build by me, but if you'd like to raise issues, fork, or explore, feel free to open a discussion or submit a pull request. |
| 117 | + |
| 118 | +## 📄 License |
49 | 119 |
|
50 |
| -## Useful commands |
| 120 | +This project is licensed under the MIT License – see the [LICENSE](LICENSE) file for details. |
51 | 121 |
|
52 |
| - * `cdk ls` list all stacks in the app |
53 |
| - * `cdk synth` emits the synthesized CloudFormation template |
54 |
| - * `cdk deploy` deploy this stack to your default AWS account/region |
55 |
| - * `cdk diff` compare deployed stack with current state |
56 |
| - * `cdk docs` open CDK documentation |
| 122 | +©<a href="https://github.com/NishanthMuruganantham">Nishanth</a> |
57 | 123 |
|
58 |
| -Enjoy! |
| 124 | +--- |
0 commit comments