Skip to content

Need tokenizer endpoint for embedding service #3111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gavinlichn opened this issue Mar 7, 2025 · 1 comment
Open

Need tokenizer endpoint for embedding service #3111

gavinlichn opened this issue Mar 7, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@gavinlichn
Copy link

Describe the bug

Client need to insure embedding inputs with limited size(max tokens), so need embedding service provide tokenizer endpoint also.
Most embedding model also included tokenizer models also, service level need expose tokenizer is make sense.

Currently client need implement local tokenizer to calculate the token count. that consume Client resource very much.

Other engine (TEI) provide tokenize endpoint

To Reproduce
Steps to reproduce the behavior:

  1. Steps to prepare models repository '...'
  2. OVMS launch command '....'
  3. Client command (additionally client code if not using official client or demo) '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Logs
Logs from OVMS, ideally with --log_level DEBUG. Logs from client.

Configuration

  1. OVMS version
  2. OVMS config.json file
  3. CPU, accelerator's versions if applicable
  4. Model repository directory structure
  5. Model or publicly available similar model that reproduces the issue

Additional context
Add any other context about the problem here.

@gavinlichn gavinlichn added the bug Something isn't working label Mar 7, 2025
@dtrawins
Copy link
Collaborator

dtrawins commented Mar 12, 2025

@gavinlichn The easiest way to get such functionality would be to deploy just individual tokenizer model and access it via KServe API. When you deploy embedding instance, it download and expose also the individual pipeline models for tokenizer and embeddings.
You can list the models using a call curl http://localhost:8000/v1/config
You can get the tokens by sending the text using KServer infer call like this:
curl -X POST http://localhost:2000/v2/models/Alibaba-NLP%2Fgte-large-en-v1.5_tokenizer_model/infer -H "Content-Type: application/json" -d '{"inputs" : [ {"name" : "Parameter_1", "shape" : [ 1 ], "datatype" : "BYTES", "data" : ["This is my test"]} ]}'

Note that the model name needs to be http encoded in the URL. The reponse will be similar to:

{
    "model_name": "Alibaba-NLP/gte-large-en-v1.5_tokenizer_model",
    "model_version": "1",
    "outputs": [{
            "name": "attention_mask",
            "shape": [1, 6],
            "datatype": "INT64",
            "data": [1, 1, 1, 1, 1, 1]
        }, {
            "name": "input_ids",
            "shape": [1, 6],
            "datatype": "INT64",
            "data": [101, 2023, 2003, 2026, 3231, 102]
        }, {
            "name": "token_type_ids",
            "shape": [1, 6],
            "datatype": "INT64",
            "data": [0, 0, 0, 0, 0, 0]
        }]
}

Would it be sufficient?
Note that it is also possible to export the model with automatic truncating of the input text to match the model context length. Check the --help for export_models.py script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants