This repository provides a set of tools and examples for converting and utilizing powerful vision models, DINOv3 and EdgeTAM (SAM2), within the ONNX ecosystem. The focus is on creating efficient, PyTorch-independent inference pipelines for tasks like one-shot segmentation, foreground extraction, and robust video object tracking. Also tried TFLite/LiteRT.
βββ notebooks/
β   βββ dinov3_onnx_export.ipynb               # Exports DINOv3 to ONNX
β   βββ dinov3_tflite_export.ipynb               # Exports DINOv3 to TFLite
β   βββ edgetam_onnx_export.ipynb              # Exports EdgeTAM encoder/decoder to ONNX
β   βββ foreground_segmentation_onnx_export.ipynb # Trains and exports a foreground classifier
β   βββ dinov3_one_shot_segmentation_onnx.ipynb  # Demo for one-shot segmentation with ONNX
β   βββ dinov3_one_shot_segmentation_tflite.ipynb  # Demo for one-shot segmentation with TFLite
β
βββ scripts/
    βββ hybrid_tracker.py   Each notebook is self-contained and can be run directly in Google Colab.
| Notebook | Description | Link | 
|---|---|---|
dinov3_onnx_export.ipynb | 
Converts the DINOv3 Vision Transformer (ViT) feature extractor to ONNX format. | link | 
dinov3_tflite_export.ipynb | 
Converts the DINOv3 Vision Transformer (ViT) feature extractor to TFLite format. | link | 
edgetam_onnx_export.ipynb | 
Exports the EdgeTAM image encoder and mask decoder models to ONNX for efficient segmentation. | link | 
foreground_segmentation_onnx_export.ipynb | 
Trains a logistic regression classifier on DINOv3 features for foreground segmentation and exports it to ONNX | link | 
dinov3_one_shot_segmentation_onnx.ipynb | 
Demonstrates one-shot segmentation using DINOv3 features and a reference mask, all in ONNX. | link | 
dinov3_one_shot_segmentation_tflite.ipynb | 
Demonstrates one-shot segmentation using DINOv3 features and a reference mask, all in TFLite. | link | 
This work builds upon the official implementations and research from the following projects:
DINOv3: facebookresearch/dinov3
EdgeTAM: facebookresearch/EdgeTAM
Space-Time Correspondence as a Contrastive Random Walk: ajabri/videowalk