Skip to content

Releases: Togeee12/web-scraper-project

v1.0.0

18 Aug 09:14

Choose a tag to compare

v1.0.0 – Initial Release

Overview

First public release of the Web Scraping Tool! A simple, Python-based CLI tool to extract links, emails, social media links, author names, and phone numbers.

Features

  • Extracts:
    • Links
    • Email addresses
    • Social media profiles (Facebook, Twitter, Instagram, etc.)
    • Author names
    • Phone numbers (country-specific)
    • Images (with optional download)
    • Documents (PDF, DOCX, XLSX, etc.)
    • Tables (with optional CSV export)
    • Metadata (title, meta tags)
  • Output to terminal (with colors) or file
  • Supports TXT, JSON, CSV, Markdown, Excel, and SQLite formats
  • Recursive and parallel scraping
  • Live preview mode
  • Scheduled scraping
  • Data filtering and processing (deduplication, sorting)
  • Modular codebase for easy extension

Usage

  1. Clone the repo:
    git clone https://github.com/Togeee12/web-scraper-project.git
    cd web-scraper-project
  2. Install dependencies:
    pip install -r requirements.txt
  3. Run the script:
    python main.py --url <website_url> --output <terminal|file> [options]

Key Arguments:

  • --url (required): Website URL to scrape.
  • --output: Output mode (terminal or file).
  • --format: File format (txt, json, csv, md, xlsx, sqlite).
  • --filename: Output filename.
  • --country: Country code for phone numbers (default: US).
  • --depth: Depth for recursive scraping.
  • --recursive: Enable recursive scraping.
  • --parallel: Enable parallel scraping.
  • --urls: List of URLs for parallel scraping.
  • --max-workers: Number of parallel workers.
  • --schedule: Schedule scraping every X hours.
  • --schedule-output: Output file for scheduled scraping.
  • --filter-keyword: Filter results by keyword.
  • --filter-regex: Filter results by regex pattern.
  • --process: Deduplicate and sort data.
  • --download-images: Download images locally.
  • --live-preview: Enable live preview mode.

🛠️ Dependencies

Install all dependencies with:

pip install -r requirements.txt

🙏 Acknowledgments

  • Created by Togeee12
  • Thanks to the developers of the Python libraries