Ollama HuggingFace Bridge

Ollama-compatible API server for any HuggingFace embedding model with local inference and persistent workers.

⚠️ Embedding Models Only: This project implements Ollama's embedding endpoints (/api/embed, /api/embeddings) for vector generation. It does not support text generation, chat, or conversational models. For full Ollama functionality with generative models, use Ollama directly.

Features

🤗 Any HuggingFace Model: Use any embedding model from HuggingFace Hub
🔀 Multi-Model Support: Load and switch between multiple models in a single container
⚡ Persistent Workers: 100x+ performance improvement over process spawning
🔌 Ollama Compatible: Drop-in replacement for Ollama embedding endpoints only
🐳 Production Ready: Docker support with configurable models
🚀 GPU Support: Automatic CUDA detection, CPU fallback
🌐 No API Keys: Completely local inference

Use Cases

Perfect for embedding-focused applications:

Vector Search: MongoDB Vector Search, Elasticsearch, Pinecone, Weaviate
RAG Systems: Document/query embedding for retrieval-augmented generation
Semantic Search: Content similarity and search applications
ML Pipelines: Embedding generation for downstream ML tasks

Not suitable for: Text generation, chatbots, or conversational AI (use Ollama for that).

Multi-Model Support

🔀 Load and switch between multiple models seamlessly:

Build-time Configuration: Specify multiple models with comma-separated MODEL_NAME
Memory Efficient: All models loaded once at startup, no switching overhead
Dynamic Selection: Choose model per request via API parameter
Ollama Compatible: Works with existing Ollama clients
Error Handling: Clear error messages for invalid model requests

Why This Project?

Fills the gap between Ollama and HuggingFace for embeddings:

Feature	Ollama	Ollama HF Bridge	HuggingFace Direct
Embedding Models	Limited selection	Any HF model	Any HF model
Ollama API Compatible	✅	✅	❌
Local Inference	✅	✅	✅
Model Conversion Required	✅ (GGUF)	❌	❌
Production Performance	✅	✅ (Persistent workers)	❌ (Process per request)
Text Generation	✅	❌	✅
Specialized Embeddings	Limited	✅ (Any domain/language)	✅

Example Multi-Model Usage

(more examples further below)

# Build with 3 different models
docker build --build-arg MODEL_NAME="sentence-transformers/all-MiniLM-L6-v2,intfloat/e5-small-v2,BAAI/bge-large-en-v1.5" -t ollama-multi .

# All models available via /api/tags
curl http://localhost:11434/api/tags

# Switch between models per request
curl http://localhost:11434/api/embed -d '{"model": "intfloat/e5-small-v2", "input": ["text"]}'
curl http://localhost:11434/api/embed -d '{"model": "BAAI/bge-large-en-v1.5", "input": ["text"]}'

Docker Usage (Recommended)

Prerequisites

For GPU acceleration (recommended for better performance):

NVIDIA GPU with recent drivers (550+)

NVIDIA Container Runtime installed:

# Add NVIDIA repository (example for Debian / Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Build with Default Model

docker build -t ollama-hf-embed-bridge .

# CPU-only inference
docker run -p 11434:11434 ollama-hf-embed-bridge

# GPU-accelerated inference (recommended)
docker run -p 11434:11434 --gpus all ollama-hf-embed-bridge

Build with Custom Model

# Popular English models
docker build --build-arg MODEL_NAME="sentence-transformers/all-mpnet-base-v2" -t ollama-hf-embed-bridge .
docker build --build-arg MODEL_NAME="intfloat/e5-large-v2" -t ollama-hf-embed-bridge .

# Multilingual models  
docker build --build-arg MODEL_NAME="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" -t ollama-hf-embed-bridge .

# Czech models (our original use case)
docker build --build-arg MODEL_NAME="Seznam/small-e-czech" -t ollama-hf-embed-bridge .
docker build --build-arg MODEL_NAME="Seznam/simcse-small-e-czech" -t ollama-hf-embed-bridge .

# Run with custom model (CPU-only)
docker run -p 11434:11434 ollama-hf-embed-bridge

# Run with custom model (GPU-accelerated)
docker run -p 11434:11434 --gpus all ollama-hf-embed-bridge

Build with Multiple Models

# Multiple English models
docker build --build-arg MODEL_NAME="sentence-transformers/all-MiniLM-L6-v2,intfloat/e5-small-v2" -t ollama-hf-embed-bridge .

# Mixed model types
docker build --build-arg MODEL_NAME="sentence-transformers/all-mpnet-base-v2,Seznam/small-e-czech,intfloat/e5-large-v2" -t ollama-hf-embed-bridge .

# Run with multiple models
docker run -p 11434:11434 --gpus all ollama-hf-embed-bridge

Runtime Model Override

# Override any model at runtime (downloads on first use)
docker run -p 11434:11434 --gpus all -e MODEL_NAME="BAAI/bge-large-en-v1.5" ollama-hf-embed-bridge

# Override with multiple models at runtime
docker run -p 11434:11434 --gpus all -e MODEL_NAME="BAAI/bge-large-en-v1.5,sentence-transformers/all-mpnet-base-v2" ollama-hf-embed-bridge

GPU Support Details

The container automatically detects and uses GPU when available:

Requirements: NVIDIA GPU with driver 550+ and NVIDIA Container Runtime
Compatibility: RTX 20xx, 30xx, 40xx series and newer
Fallback: Automatically uses CPU if GPU is unavailable or --gpus all flag is omitted
Performance: 5-10x faster inference with GPU acceleration

Test GPU availability:

docker run --rm --gpus all ollama-hf-embed-bridge python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Local Installation (Advanced)

💡 Docker is recommended for most users. Use local installation only if you need to modify the code or have specific requirements.

1. Install Python Dependencies

pip install -r requirements.txt

2. Build and Run

# Build
go build -o ollama-hf-bridge .

# Run (default: localhost:11434)
./ollama-hf-bridge

# Run on custom host/port  
./ollama-hf-bridge -host 0.0.0.0 -port 8080

API Endpoints

POST /api/embed

# English example (default model)
curl http://localhost:11434/api/embed -d '{
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "input": ["Hello world", "How are you?"]
}'

# Czech example
curl http://localhost:11434/api/embed -d '{
  "model": "Seznam/small-e-czech",
  "input": ["Dobrý den", "Jak se máte?"]
}'

# Multilingual example
curl http://localhost:11434/api/embed -d '{
  "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
  "input": ["Hello", "Bonjour", "Hola", "Ahoj"]
}'

# Multi-model selection (choose specific model)
curl http://localhost:11434/api/embed -d '{
  "model": "intfloat/e5-small-v2",
  "input": ["Switching between models", "Dynamic model selection"]
}'

# Without specifying model (uses first available model)
curl http://localhost:11434/api/embed -d '{
  "input": ["Uses default model"]
}'

POST /api/embeddings

curl http://localhost:11434/api/embeddings -d '{
  "model": "sentence-transformers/all-MiniLM-L6-v2", 
  "prompt": "This is a test sentence"
}'

# With specific model selection
curl http://localhost:11434/api/embeddings -d '{
  "model": "intfloat/e5-small-v2", 
  "prompt": "Different model, different embeddings"
}'

GET /api/tags

# Lists all available models
curl http://localhost:11434/api/tags

Drop-in Ollama Replacement

This server is compatible with existing Ollama clients:

import ollama

# Default port
client = ollama.Client(host='http://localhost:11434')

# Or custom port
client = ollama.Client(host='http://localhost:8080')

response = client.embeddings(model='sentence-transformers/all-MiniLM-L6-v2', prompt='Hello world!')

# Works with any model you've configured
response = client.embeddings(model='Seznam/small-e-czech', prompt='Ahoj světe!')

# Switch between multiple loaded models
response1 = client.embeddings(model='intfloat/e5-small-v2', prompt='First model')
response2 = client.embeddings(model='sentence-transformers/all-mpnet-base-v2', prompt='Second model')

Supported Models

Works with any HuggingFace embedding model:

Popular English Models

sentence-transformers/all-MiniLM-L6-v2 (default)
sentence-transformers/all-mpnet-base-v2
intfloat/e5-large-v2
BAAI/bge-large-en-v1.5

Multilingual Models

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
sentence-transformers/LaBSE

Czech Models (Original Use Case)

Seznam/small-e-czech - Czech ELECTRA fine-tuned with SimCSE
Seznam/simcse-small-e-czech
Seznam/dist-mpnet-czeng-cs-en

Architecture

Go HTTP Server: High-performance API server with Ollama compatibility
Python Workers: Persistent processes for model inference using PyTorch
Process Pool: Multiple workers for concurrent request handling
Docker: Production-ready containerization with configurable models

Contributing

We welcome contributions! This project was collaboratively developed and benefits from community input.

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
model		model
.dockerignore		.dockerignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
embeddings_inference.py		embeddings_inference.py
go.mod		go.mod
go.sum		go.sum
main.go		main.go
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ollama HuggingFace Bridge

Features

Use Cases

Multi-Model Support

Why This Project?

Example Multi-Model Usage

Docker Usage (Recommended)

Prerequisites

Build with Default Model

Build with Custom Model

Build with Multiple Models

Runtime Model Override

GPU Support Details

Local Installation (Advanced)

1. Install Python Dependencies

2. Build and Run

API Endpoints

POST /api/embed

POST /api/embeddings

GET /api/tags

Drop-in Ollama Replacement

Supported Models

Popular English Models

Multilingual Models

Czech Models (Original Use Case)

Architecture

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

matusbielik/ollama-hf-embed-bridge

Folders and files

Latest commit

History

Repository files navigation

Ollama HuggingFace Bridge

Features

Use Cases

Multi-Model Support

Why This Project?

Example Multi-Model Usage

Docker Usage (Recommended)

Prerequisites

Build with Default Model

Build with Custom Model

Build with Multiple Models

Runtime Model Override

GPU Support Details

Local Installation (Advanced)

1. Install Python Dependencies

2. Build and Run

API Endpoints

POST /api/embed

POST /api/embeddings

GET /api/tags

Drop-in Ollama Replacement

Supported Models

Popular English Models

Multilingual Models

Czech Models (Original Use Case)

Architecture

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages