Skip to content

replace_vocabulary crashes when unk_token == pad_token (CLIP case) — ValueError: '<|endoftext|>' is not in list #282

@rola93

Description

@rola93

replace_vocabulary crashes when unk_token == pad_token (CLIP case) — ValueError: '<|endoftext|>' is not in list

Summary

When distilling a static embedding from a model whose tokenizer has the same string for both unk_token and pad_token (CLIP-style tokenizers frequently use "<|endoftext|>" for both), model2vec throws:

ValueError: '<|endoftext|>' is not in list

The exception occurs in tokenizer.py::_rename_added_token called from replace_vocabulary during distill_from_model(...) (and transitively via StaticEmbedding.from_distillation(...) in SentenceTransformers).

Why this happens (root cause)

Here replace_vocabulary(...) unconditionally calls:

added_tokens = _rename_added_token(unk_token, "[UNK]", added_tokens, pre_tokenized_tokens)
added_tokens = _rename_added_token(pad_token, "[PAD]", added_tokens, pre_tokenized_tokens)

_rename_added_token mutates the vocabulary list in-place:

vocabulary[idx] = new_form

If unk_token == pad_token (e.g., both are "<|endoftext|>"), the first call renames that entry to "[UNK]". The second call then tries to find the original form ("<|endoftext|>") in vocabulary, which now no longer exists, so vocabulary.index(form) raises ValueError.

Expected behavior

distill_from_model should handle the case unk_token == pad_token gracefully (e.g., skip the second rename, or make _rename_added_token idempotent) and proceed to build the static embedding.

###Actual behavior
It crashes with:

ValueError: '<|endoftext|>' is not in list

originating from model2vec/tokenizer/tokenizer.py::_rename_added_token.

Reproduce it!

This notebook reproduces the error in google colab with those dependencies:

Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
PyTorch: 2.8.0+cu126
Transformers: 4.56.1
Sentence-Transformers: 5.1.0
model2vec: 0.6.0

I tried a simple local patch by renaming the current replace_vocabulary function as follows:

def _replace_vocabulary(
    tokenizer: Tokenizer, new_vocabulary: list[Token], unk_token: str | None, pad_token: str | None
) -> Tokenizer:
  """
  Let this as is, just rename replace_vocabulary with _replace_vocabulary
  """
  pass


def replace_vocabulary(
    tokenizer: Tokenizer, new_vocabulary: list[Token], unk_token: str | None, pad_token: str | None
) -> Tokenizer:

  if unk_token is not None and unk_token == pad_token:
    # both token are the same! ignore pad token then
    return _replace_vocabulary(
    tokenizer=tokenizer, new_vocabulary=new_vocabulary, unk_token=unk_token, pad_token=None
  )
  else:
    # otherwise just call _replace_vocabulary as it was
    return _replace_vocabulary(
    tokenizer=tokenizer, new_vocabulary=new_vocabulary, unk_token=unk_token, pad_token=pad_token
    )

Then the error disappears, but I'm not sure this is the most effective way to prevent it (mainly because of potential consequences in the remainder of the execution and other use cases).

Since this is my first contribution to the repository, I’m not sure what the preferred way to address this issue is. I’m happy to implement the fix and open a PR, but I’d like to confirm the correct or recommended approach first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions