Foodmandu Scraper extracts restaurant information from FOODMANDU using Scrapy and Playwright. It captures details such as restaurant URLs, images, names, addresses, and cuisines. The extracted data is then stored in a SQLite database.
To set up the project, follow these steps:
git clone https://github.com/rheaacharya77/foodmandu-scraper.git
cd foodmandu
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Here's an overview of the key components in the Foodmandu Scraper project:
/.github/workflows/
: Contains the GitHub Actions workflow files for automation./foodmandu/
: The main project directory with all the Scrapy components./spiders/
: Contains the spiderrestaurants.py
that defines the scraping logic.items.py
: Defines the data structure for scraped data.middlewares.py
: Manages custom middleware for Scrapy.pipelines.py
: Processes and stores data items after scraping.settings.py
: Configures settings for Scrapy.
foodmandu.db
: The SQLite database where scraped data is stored.requirements.txt
: Lists all the dependencies required to run the project.scrapy.cfg
: Configuration file for Scrapy projects.
Modify the scraping settings in settings.py
as needed, then run the scraper with:
scrapy crawl restaurants
The scraper employs two main pipelines in pipelines.py
for processing and storing scraped data:
- Manages SQLite database interactions by establishing a connection, creating a fresh
restaurants
table, and inserting scraped data.
- Eliminates duplicate data by checking against a set of visited URLs and dropping any repeats during the scraping session.
These pipelines ensure efficient data storage and integrity by managing database operations and eliminating duplicate data.
The scraped restaurant data is stored in a SQLite database, utilizing a table with the following schema:
id
: An auto-incrementing integer that serves as the primary key.url
: Text field storing the restaurant's URL.image
: Text field storing the URL of the restaurant's image.name
: Text field for the restaurant's name.address
: Text field for the restaurant's address.cuisine
: Text field describing the type of cuisine offered by the restaurant.
This schema is designed to capture essential details about each restaurant, facilitating easy access and analysis of the collected data.
GitHub Actions is used to automate the scraping process and ensure our data is always up to date. The workflow, defined in .github/workflows/actions.yml
, performs the following tasks:
- Trigger: It's set to run automatically every Saturday at 1:45 PM UTC. Additionally, it can be manually triggered via GitHub's
workflow_dispatch
event. - Environment Setup: Prepares an Ubuntu environment, sets up Python 3.10, and installs all necessary dependencies from
requirements.txt
. - Data Scraping: Executes our Scrapy spider named
restaurants
to scrape the latest restaurant data. - Commit: Any changes in the data are committed to the repository with a timestamp.
- Push: Updates the main branch with the latest data.
This automated workflow minimizes manual effort and keeps our data fresh with scheduled and on-demand runs.
Contributions are welcome! Here's how to contribute:
- Fork the repo and clone your fork.
- Create a branch for your changes.
- Make your changes and test them.
- Commit your changes with clear messages.
- Submit a pull request (PR) with a detailed description of your changes.
Thank you for helping improve the Foodmandu Scraper!