Need tokenizer endpoint for embedding service #3111

gavinlichn · 2025-03-07T08:38:10Z

Describe the bug

Client need to insure embedding inputs with limited size(max tokens), so need embedding service provide tokenizer endpoint also.
Most embedding model also included tokenizer models also, service level need expose tokenizer is make sense.

Currently client need implement local tokenizer to calculate the token count. that consume Client resource very much.

Other engine (TEI) provide tokenize endpoint

To Reproduce
Steps to reproduce the behavior:

Steps to prepare models repository '...'
OVMS launch command '....'
Client command (additionally client code if not using official client or demo) '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Logs
Logs from OVMS, ideally with --log_level DEBUG. Logs from client.

Configuration

OVMS version
OVMS config.json file
CPU, accelerator's versions if applicable
Model repository directory structure
Model or publicly available similar model that reproduces the issue

Additional context
Add any other context about the problem here.

dtrawins · 2025-03-12T08:09:07Z

@gavinlichn The easiest way to get such functionality would be to deploy just individual tokenizer model and access it via KServe API. When you deploy embedding instance, it download and expose also the individual pipeline models for tokenizer and embeddings.
You can list the models using a call curl http://localhost:8000/v1/config
You can get the tokens by sending the text using KServer infer call like this:
curl -X POST http://localhost:2000/v2/models/Alibaba-NLP%2Fgte-large-en-v1.5_tokenizer_model/infer -H "Content-Type: application/json" -d '{"inputs" : [ {"name" : "Parameter_1", "shape" : [ 1 ], "datatype" : "BYTES", "data" : ["This is my test"]} ]}'

Note that the model name needs to be http encoded in the URL. The reponse will be similar to:

{
    "model_name": "Alibaba-NLP/gte-large-en-v1.5_tokenizer_model",
    "model_version": "1",
    "outputs": [{
            "name": "attention_mask",
            "shape": [1, 6],
            "datatype": "INT64",
            "data": [1, 1, 1, 1, 1, 1]
        }, {
            "name": "input_ids",
            "shape": [1, 6],
            "datatype": "INT64",
            "data": [101, 2023, 2003, 2026, 3231, 102]
        }, {
            "name": "token_type_ids",
            "shape": [1, 6],
            "datatype": "INT64",
            "data": [0, 0, 0, 0, 0, 0]
        }]
}

Would it be sufficient?
Note that it is also possible to export the model with automatic truncating of the input text to match the model context length. Check the --help for export_models.py script.

gavinlichn added the bug Something isn't working label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Need tokenizer endpoint for embedding service #3111

Need tokenizer endpoint for embedding service #3111

gavinlichn commented Mar 7, 2025

dtrawins commented Mar 12, 2025 •

edited

Loading

Uh oh!

Need tokenizer endpoint for embedding service #3111

Need tokenizer endpoint for embedding service #3111

Comments

gavinlichn commented Mar 7, 2025

dtrawins commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtrawins commented Mar 12, 2025 •

edited

Loading