This repository automates the end-to-end extraction, processing, and publishing of Men’s T20 International (T20I) cricket data to Kaggle, sourced from Cricsheet, using AWS serverless services.
This data pipeline leverages AWS Lambda functions, EventBridge, and SQS to orchestrate an event-driven, fully serverless processing flow. Data is stored in MongoDB Atlas and AWS S3, converted into CSV format, and then automatically uploaded to Kaggle for public access and analysis.
The dataset is kept current with automated weekly updates, delivering up-to-date and reliable cricket data without manual effort.
All critical steps in the workflow send real-time execution status updates via a Telegram bot.
The data pipeline is designed using a fully serverless, event-driven architecture on AWS, ensuring scalability, efficiency, and automation throughout the data lifecycle.
Here’s how the workflow operates:
Category | Tools & Services |
---|---|
Programming Language | Python |
AWS Services | Lambda, CloudWatch, EventBridge, SQS, S3, DynamoDB, Parameter Store, Secrets Manager |
Database | MongoDB Atlas |
Infrastructure as Code | AWS CDK (Python) |
Data Publishing | Kaggle API |
Notifications | Telegram Bot API |
Documentation | draw.io (diagrams.net) |
This project embraces the practice of Infrastructure as Code (IaC) philosophy using AWS CDK (in Python) to provision and manage cloud resources.
- Version Control: All infrastructure is declared in code and tracked in Git.
- Centralized Control: All AWS resources are organized and deployed under a single CDK stack, making them easier to maintain, modify, and tear down.
- Automation: No manual clicks in the AWS console—deployment is fully automated.
With AWS CDK, the following resources are created and configured programmatically:
Resource Type | Purpose |
---|---|
🪣 S3 Buckets | For storing downloaded and processed files |
🧮 DynamoDB Table | To track processed match files |
🧠 Lambda Functions | For each pipeline task (download, extract, transform, upload) |
🔁 SQS & EventBridge | To trigger Lambdas asynchronously |
🔐 IAM Roles | With scoped permissions for security |
🧾 SSM Parameters | For storing API keys, tokens, and config |
📆 CloudWatch Schedulers | To run jobs on a weekly basis |
For every code changes, this project is leveraing build_packages
and cdk deploy
commands to package and deploy the code respectively.
This utility script automates the packaging process for both:
- 📦 AWS Lambda Layers (for dependencies like
pymongo
,kaggle
,requests
) - 🧾 Lambda Handler Zips (each respective Lambda function code files)
The build_packages.py
script streamlines the deployment workflow by:
-
Building a Lambda Layer:
- Creates a source distribution (
.tar.gz
) using yoursetup.py
- Extracts it into a
site-packages
directory - Installs dependencies listed in
requirements.txt
using a Lambda-compatible environment - Zips the
site-packages
directory for deployment as an AWS Lambda Layer
- Creates a source distribution (
-
Zipping Lambda Code:
- Each handler file is compressed into its own zip archive, ready for deployment
-
Cleaning Up:
- Removes temporary folders and tarballs to keep your workspace clean
After packages, all the resources will get deployed with the CDK command
cdk deploy
-
This command deploys S3 buckets, Lambda functions, DynamoDb tables, EventBridge rules, SQS queues, IAM roles, SSM parameters and CloudWatch schedulers.
-
Everything is deployed in one go via a single CDK stack, making the infrastructure highly repeatable and version-controlled.
This project is a solo build by me, but if you'd like to raise issues, fork, or explore, feel free to open a discussion or submit a pull request.
This project is licensed under the MIT License – see the LICENSE file for details.