A flexible, Python-based web scraping utility to extract data from a curated list of URLs. This project logs the success and failure of requests, handles exceptions gracefully, and outputs results to a JSON file. Designed for beginners and experienced developers alike.
- Scrapes a list of URLs from a file (
urls.txt) - Automatically logs:
- ✅ Successful scrapes
- ❌ Failed requests (with error reasons)
- Saves the successfully scraped data into
scraped_data.json - Provides a cleaned list of valid URLs via
urls_clean.txt - Modular and easily extendable
Web-Scraper/ ├── web-scraper.py # Main scraping logic ├── urls.txt # Input URLs to scrape ├── urls_clean.txt # Output of working URLs (auto-generated) ├── scraped_data.json # Final scraped content (auto-generated) ├── requirements.txt # List of dependencies └── README.md # Project documentation
- Python 3.7+
requestsbeautifulsoup4urllib3logging
Install dependencies:
pip install -r requirements.txt
- Add the URLs you want to scrape into
urls.txt, one per line. - Run the scraper:
python web-scraper.py- Check your results:
scraped_data.json: Scraped HTML or textual contenturls_clean.txt: Filtered URLs that worked- Logs in the console will tell you which URLs failed
Console Log
2025-04-22 23:35:10,039 - INFO - Scraped https://example.com/
2025-04-22 23:35:21,373 - ERROR - Failed to fetch https://www.amazon.com/s?k=laptops: 503 Server Error
scraped_data.json
[
{
"url": "https://example.com",
"content": "<!doctype html>..."
},
...
]Want to scrape specific elements or parse structured data like tables or product listings? Just extend the logic in web-scraper.py using BeautifulSoup!
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string- Some pages (like Amazon) actively block bots and may require headers, user-agent spoofing, or Selenium.
- API URLs that need authentication (e.g. NYT, Coindesk) may return
401 Unauthorizedor403 Forbidden.
Built with curiosity, Python, and lots of trial & error 🚀