A powerful web scraping and image metadata analysis toolkit for cybersecurity professionals. Part of the 42 Beirut Cybersecurity Piscine.
- About
- Features
- Installation
- Usage
- Examples
- Project Structure
- Technical Details
- Security Considerations
- Bonus Features
- License
Arachnida is a comprehensive web security toolkit consisting of two powerful tools:
- πΈοΈ Spider - An intelligent web crawler that recursively downloads images from websites
- π¦ Scorpion - A sophisticated image metadata viewer and scrubber
This project demonstrates advanced concepts in web scraping, HTTP protocols, metadata extraction, and secure file handling - essential skills for cybersecurity professionals.
Achievement: 125/100 (Perfect Score + Bonus)
- β Recursive website crawling with depth control
- β Multi-format image support (JPG, PNG, GIF, BMP)
- β Smart URL handling and normalization
- β Duplicate detection and prevention
- β Progress tracking and statistics
- β Robust error handling
- β Clean, object-oriented architecture
- β Beautiful terminal UI with ANSI colors
- β Comprehensive metadata display (EXIF, GPS, IPTC, XMP)
- β Complete metadata removal
- β File size comparison before/after
- β Batch processing support
- β TRC tag filtering (cleaner output)
- β Clean SOLID architecture
- Java 11+ (JDK)
- Make (build tool)
- Internet connection (for Spider)
# Clone the repository
git clone https://github.com/ITAXBOX/arachnida.git
cd arachnida
# Build Spider
cd spider
make
cd ..
# Build Scorpion
cd scorpion
make
cd ..Download images recursively from a website.
./spider [-rlp] URL| Option | Description | Default |
|---|---|---|
-r |
Enable recursive download | Off |
-l N |
Set maximum recursion depth | 5 |
-p PATH |
Set download directory | ./data/ |
Basic download:
./spider https://example.com/galleryRecursive crawling with depth limit:
./spider -r -l 3 https://example.comCustom output directory:
./spider -r -p ./my_images https://example.comFull custom:
./spider -r -l 10 -p ./downloads/images https://example.com/photosView and remove metadata from images.
./scorpion [OPTIONS] FILE1 [FILE2 ...]| Option | Description |
|---|---|
-d, --delete |
Remove all metadata (overwrites original) |
| (no option) | Display metadata information |
Display metadata:
./scorpion image.jpgRemove metadata from single file:
./scorpion -d image.jpgBatch metadata removal:
./scorpion -d *.jpgOutput example:
β Canon_40D.jpg
Original size: 7.77 KB
New size: 2.43 KB
Saved: 5.34 KB (68.7% reduction)
arachnida/
βββ spider/ # Web crawler tool
β βββ Makefile
β βββ spider # Executable script
β βββ src/
β βββ com/arachnida/spider/
β βββ Spider.java # Main entry point
β βββ cli/
β β βββ CliOption.java # Command-line parsing
β βββ config/
β β βββ Config.java # Configuration management
β βββ crawler/
β β βββ Crawler.java # Core crawling logic
β β βββ Node.java # URL tree node
β βββ http/
β β βββ Http.java # HTTP client
β βββ parser/
β β βββ Html.java # HTML parsing
β βββ storage/
β β βββ FileStore.java # File operations
β βββ util/
β βββ Uri.java # URL utilities
β βββ Utils.java # Helper functions
β
βββ scorpion/ # Metadata tool
βββ Makefile
βββ scorpion # Executable script
βββ metadata-extractor-2.19.0.jar
βββ xmpcore-6.1.11.jar
βββ src/
βββ Scorpion.java # Main entry point
βββ com/arachnida/scorpion/
βββ cli/
β βββ CliParser.java # Command-line parsing
βββ util/
βββ MetadataUtil.java # Metadata operations
βββ ImageUtil.java # Image processing
βββ UiUtil.java # Terminal UI formatting
Design Pattern: Modular, Object-Oriented Architecture
- Single Responsibility Principle: Each class has one clear purpose
- Depth-First Search: Recursive crawling algorithm
- Smart Filtering: Duplicate detection, format validation
- HTTP Handling: Custom HTTP client with error recovery
- URL Normalization: Handles relative/absolute URLs correctly
Key Components:
Crawler: Core DFS crawling engineHttp: HTTP GET requests with error handlingHtml: HTML parsing for links and imagesFileStore: Safe file downloading with naming conflictsUri: URL manipulation and validation
Design Pattern: SOLID Principles + Clean Architecture
- Single Source of Truth: UI formatting centralized in
UiUtil - Separation of Concerns: Each utility class has one responsibility
- Dependency Inversion: High-level modules independent of low-level details
Key Components:
MetadataUtil: EXIF/GPS/IPTC metadata readingImageUtil: Metadata stripping via pixel-level operationsUiUtil: ANSI terminal formatting (colors, boxes, progress)CliParser: Argument validation and parsing
Metadata Removal Technique:
// Read only pixel data (no metadata)
BufferedImage image = ImageIO.read(file);
// Write back without metadata
ImageIO.write(image, extension, file);This approach is 100% effective and uses only standard Java libraries.
- β URL Validation: Prevents directory traversal attacks
- β Timeout Protection: Prevents infinite hanging
- β Resource Limits: Maximum depth and file count
- β Safe Filename Handling: Sanitizes filenames to prevent path injection
- β Complete Metadata Removal: All EXIF, GPS, camera info stripped
- β Privacy Protection: Removes location data, timestamps, device info
- β No Data Leakage: Only pixel data retained
- β Batch Processing: Safely handles multiple files
| Category | Examples |
|---|---|
| Camera Info | Make, Model, Lens, Software |
| Location Data | GPS Coordinates, Altitude |
| Timestamps | Date/Time Original, Modified |
| Technical | ISO, Aperture, Shutter Speed, Focal Length |
| Copyright | Author, Copyright, Description |
| ICC Profile | Color space, rendering intent |
Result: Only essential JPEG structure remains (compression, dimensions, components)
This project achieved 125/100 thanks to these bonus implementations:
- β Object-Oriented Design: Clean package structure
- β Progress Reporting: Real-time crawl statistics
- β Smart Duplicate Detection: URL normalization and tracking
- β Comprehensive Error Handling: Graceful failure recovery
- β Configuration System: Centralized settings management
- β Beautiful Terminal UI: ANSI colors, box drawing, formatting
- β Metadata Categorization: Organized by type (EXIF, GPS, ICC, etc.)
- β File Size Statistics: Before/after comparison with savings
- β Batch Processing Summary: Multi-file operation feedback
- β TRC Tag Filtering: Cleaner output (removes noise)
- β SOLID Architecture: Maintainable, extensible code
- Crawl Speed: ~5-10 pages/second (network dependent)
- Memory Efficient: Tracks visited URLs in HashSet
- Disk I/O: Optimized file writing with conflict resolution
- Processing Speed: ~100ms per image (depending on size)
- Memory Usage: Minimal (processes one image at a time)
- File Size Reduction: Average 60-80% reduction for photos with metadata
Example Results:
Canon_40D.jpg: 7.77 KB β 2.43 KB (68.7% reduction)
GPS_Photo.jpg: 4.23 MB β 1.85 MB (56.3% reduction)
iPhone_IMG.jpg: 2.15 MB β 985 KB (54.2% reduction)
Both tools have been extensively tested with:
- β Various image formats (JPG, PNG, GIF, BMP)
- β Different metadata standards (EXIF, IPTC, XMP, GPS)
- β Edge cases (no metadata, corrupted files, special characters)
- β Large batch operations (100+ files)
- β Different website structures (relative/absolute URLs, nested pages)
Project: CyberSecurity Piscine - Arachnida
Institution: 42 Beirut
Final Score: 125/100 β
- Mandatory Part: 100/100 β
- Bonus Part: 25/25 β
- Code Quality: Excellent
- Architecture: SOLID principles applied
- Documentation: Comprehensive
- Language: Java 11+
- Build Tool: Make
- Libraries:
metadata-extractor(Drew Noakes) - EXIF/metadata readingxmpcore(Adobe) - XMP metadata support- Standard Java libraries (ImageIO, HttpURLConnection)
ITAXBOX
42 Beirut - CyberSecurity Piscine
This project is part of the 42 School curriculum and is intended for educational purposes.
- 42 School for the excellent cybersecurity curriculum
- Drew Noakes for the metadata-extractor library
- 42 Beirut community for support and collaboration
β Star this repository if you found it helpful! β
Made with β€οΈ by ITAXBOX