🏏 T20I Data Pipeline for Kaggle with AWS

This repository automates the end-to-end extraction, processing, and publishing of Men’s T20 International (T20I) cricket data to Kaggle, sourced from Cricsheet, using AWS serverless services.

This data pipeline leverages AWS Lambda functions, EventBridge, and SQS to orchestrate an event-driven, fully serverless processing flow. Data is stored in MongoDB Atlas and AWS S3, converted into CSV format, and then automatically uploaded to Kaggle for public access and analysis.

The dataset is kept current with automated weekly updates, delivering up-to-date and reliable cricket data without manual effort.

All critical steps in the workflow send real-time execution status updates via a Telegram bot.

Pipeline Architecture Overview ⚙️

The data pipeline is designed using a fully serverless, event-driven architecture on AWS, ensuring scalability, efficiency, and automation throughout the data lifecycle.

Here’s how the workflow operates:

Tech Stack 🧰

Category	Tools & Services
Programming Language	Python
AWS Services	Lambda, CloudWatch, EventBridge, SQS, S3, DynamoDB, Parameter Store, Secrets Manager
Database	MongoDB Atlas
Infrastructure as Code	AWS CDK (Python)
Data Publishing	Kaggle API
Notifications	Telegram Bot API
Documentation	draw.io (diagrams.net)

🧱 Infrastructure as Code (IaC)

This project embraces the practice of Infrastructure as Code (IaC) philosophy using AWS CDK (in Python) to provision and manage cloud resources.

💡 Advantages of leveraging IaC

Version Control: All infrastructure is declared in code and tracked in Git.
Centralized Control: All AWS resources are organized and deployed under a single CDK stack, making them easier to maintain, modify, and tear down.
Automation: No manual clicks in the AWS console—deployment is fully automated.

🧰 Resources Defined via CDK

With AWS CDK, the following resources are created and configured programmatically:

Resource Type	Purpose
🪣 S3 Buckets	For storing downloaded and processed files
🧮 DynamoDB Table	To track processed match files
🧠 Lambda Functions	For each pipeline task (download, extract, transform, upload)
🔁 SQS & EventBridge	To trigger Lambdas asynchronously
🔐 IAM Roles	With scoped permissions for security
🧾 SSM Parameters	For storing API keys, tokens, and config
📆 CloudWatch Schedulers	To run jobs on a weekly basis

📦 Code Packaging

For every code changes, this project is leveraing build_packages and cdk deploy commands to package and deploy the code respectively.

🛠 Lambda Packaging Utility (`src\build\build_packages.py`)

This utility script automates the packaging process for both:

📦 AWS Lambda Layers (for dependencies like pymongo, kaggle, requests)
🧾 Lambda Handler Zips (each respective Lambda function code files)

Purposes

The build_packages.py script streamlines the deployment workflow by:

Building a Lambda Layer:
- Creates a source distribution (.tar.gz) using your setup.py
- Extracts it into a site-packages directory
- Installs dependencies listed in requirements.txt using a Lambda-compatible environment
- Zips the site-packages directory for deployment as an AWS Lambda Layer
Zipping Lambda Code:
- Each handler file is compressed into its own zip archive, ready for deployment
Cleaning Up:
- Removes temporary folders and tarballs to keep your workspace clean

🚀 Deployment

After packages, all the resources will get deployed with the CDK command

cdk deploy

This command deploys S3 buckets, Lambda functions, DynamoDb tables, EventBridge rules, SQS queues, IAM roles, SSM parameters and CloudWatch schedulers.
Everything is deployed in one go via a single CDK stack, making the infrastructure highly repeatable and version-controlled.

🤝 Contribution

This project is a solo build by me, but if you'd like to raise issues, fork, or explore, feel free to open a discussion or submit a pull request.

📄 License

This project is licensed under the MIT License – see the LICENSE file for details.

©Nishanth

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
aws		aws
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cdk.json		cdk.json
pipeline_architecture.svg		pipeline_architecture.svg
pyproject.toml		pyproject.toml
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
source.bat		source.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏏 T20I Data Pipeline for Kaggle with AWS

Pipeline Architecture Overview ⚙️

Tech Stack 🧰

🧱 Infrastructure as Code (IaC)

💡 Advantages of leveraging IaC

🧰 Resources Defined via CDK

📦 Code Packaging

🛠 Lambda Packaging Utility (`src\build\build_packages.py`)

Purposes

🚀 Deployment

🤝 Contribution

📄 License

About

Uh oh!

Languages

License

NishanthMuruganantham/kaggle-data-pipeline-with-aws

Folders and files

Latest commit

History

Repository files navigation

🏏 T20I Data Pipeline for Kaggle with AWS

Pipeline Architecture Overview ⚙️

Tech Stack 🧰

🧱 Infrastructure as Code (IaC)

💡 Advantages of leveraging IaC

🧰 Resources Defined via CDK

📦 Code Packaging

🛠 Lambda Packaging Utility (src\build\build_packages.py)

Purposes

🚀 Deployment

🤝 Contribution

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

🛠 Lambda Packaging Utility (`src\build\build_packages.py`)