Skip to content

[FEAT]: OllamaEmbedder Support Batched or Parallel Embeddings to Improve Performance #4529

@theseaiying

Description

@theseaiying

Hi there, thank you for building such a powerful tool with anythinglmm!

While using the OllamaEmbedder for document embedding, I noticed that the current implementation processes text chunks sequentially (using for...of + await). This results in very poor performance when handling large documents, even when the backend Ollama instance is running on high-end hardware with multiple GPUs.

📌 Problem Description
Currently, in OllamaEmbedder.embedChunks():

Each chunk is sent in a separate /api/embeddings request
Only one chunk is processed per request
It does not leverage Ollama’s built-in support for batched input via the input: string[] field
This leads to extremely long processing times (e.g., 30+ minutes for 10k+ chunks)
✨ Suggested Improvement
Please consider adding support for one of the following optimizations:

✅ Option 1: Use Ollama’s Batched Embeddings API (Recommended)
Ollama’s /api/embeddings endpoint supports passing an array of strings:

await client.embeddings({
  model: "qwen3-embedding:8b",
  input: ["text1", "text2",  #...],
})

This drastically reduces network overhead and fully utilizes GPU parallelism.

✅ Option 2: Concurrent Processing (Fallback)
If batched input is not compatible with certain models, please support concurrent requests using Promise.all() with a configurable concurrency limit (e.g., maxConcurrentChunks) to prevent OOM issues.

🔧 Example Code (for reference)

async embedChunks(textChunks) {
  const response = await this.client.embeddings({
    model: this.model,
    input: textChunks, // Batched input
    options: { num_ctx: this.embeddingMaxChunkLength }
  });
  return response.embeddings;
}

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions