Skip to content

Conversation

Feliren88
Copy link
Collaborator

Image Duplicate Finder using Nomic Vision Embeddings

This PR adds a standalone script for detecting duplicate or highly similar images within a directory. The implementation:

  • Uses the powerful Nomic Vision embedding model (nomic-embed-vision-v1) to generate high-quality image representations
  • Implements efficient pairwise similarity comparison with PyTorch
  • Provides GPU acceleration for faster processing
  • Includes a configurable similarity threshold to control matching strictness
  • Outputs results in a clearly formatted CSV file

Key improvements:

  • Single folder processing - finds duplicates within any given directory
  • Memory-efficient design - stores embeddings on CPU while processing
  • Progress tracking with tqdm
  • Comprehensive documentation in README.md
  • Error handling for corrupted or problematic images

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant