-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Hi there, thank you for building such a powerful tool with anythinglmm!
While using the OllamaEmbedder for document embedding, I noticed that the current implementation processes text chunks sequentially (using for...of + await). This results in very poor performance when handling large documents, even when the backend Ollama instance is running on high-end hardware with multiple GPUs.
📌 Problem Description
Currently, in OllamaEmbedder.embedChunks():
Each chunk is sent in a separate /api/embeddings request
Only one chunk is processed per request
It does not leverage Ollama’s built-in support for batched input via the input: string[] field
This leads to extremely long processing times (e.g., 30+ minutes for 10k+ chunks)
✨ Suggested Improvement
Please consider adding support for one of the following optimizations:
✅ Option 1: Use Ollama’s Batched Embeddings API (Recommended)
Ollama’s /api/embeddings endpoint supports passing an array of strings:
await client.embeddings({
model: "qwen3-embedding:8b",
input: ["text1", "text2", #...],
})
This drastically reduces network overhead and fully utilizes GPU parallelism.
✅ Option 2: Concurrent Processing (Fallback)
If batched input is not compatible with certain models, please support concurrent requests using Promise.all() with a configurable concurrency limit (e.g., maxConcurrentChunks) to prevent OOM issues.
🔧 Example Code (for reference)
async embedChunks(textChunks) {
const response = await this.client.embeddings({
model: this.model,
input: textChunks, // Batched input
options: { num_ctx: this.embeddingMaxChunkLength }
});
return response.embeddings;
}