-
Notifications
You must be signed in to change notification settings - Fork 102
Description
replace_vocabulary
crashes when unk_token == pad_token
(CLIP case) — ValueError: '<|endoftext|>' is not in list
Summary
When distilling a static embedding from a model whose tokenizer has the same string for both unk_token
and pad_token
(CLIP-style tokenizers frequently use "<|endoftext|>"
for both), model2vec
throws:
ValueError: '<|endoftext|>' is not in list
The exception occurs in tokenizer.py::_rename_added_token
called from replace_vocabulary
during distill_from_model(...)
(and transitively via StaticEmbedding.from_distillation(...)
in SentenceTransformers).
Why this happens (root cause)
Here replace_vocabulary(...)
unconditionally calls:
added_tokens = _rename_added_token(unk_token, "[UNK]", added_tokens, pre_tokenized_tokens)
added_tokens = _rename_added_token(pad_token, "[PAD]", added_tokens, pre_tokenized_tokens)
_rename_added_token
mutates the vocabulary list in-place:
vocabulary[idx] = new_form
If unk_token == pad_token
(e.g., both are "<|endoftext|>"
), the first call renames that entry to "[UNK]"
. The second call then tries to find the original form ("<|endoftext|>"
) in vocabulary
, which now no longer exists, so vocabulary.index(form)
raises ValueError
.
Expected behavior
distill_from_model
should handle the case unk_token == pad_token
gracefully (e.g., skip the second rename, or make _rename_added_token
idempotent) and proceed to build the static embedding.
###Actual behavior
It crashes with:
ValueError: '<|endoftext|>' is not in list
originating from model2vec/tokenizer/tokenizer.py::_rename_added_token
.
Reproduce it!
This notebook reproduces the error in google colab with those dependencies:
Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]
PyTorch: 2.8.0+cu126
Transformers: 4.56.1
Sentence-Transformers: 5.1.0
model2vec: 0.6.0
I tried a simple local patch by renaming the current replace_vocabulary
function as follows:
def _replace_vocabulary(
tokenizer: Tokenizer, new_vocabulary: list[Token], unk_token: str | None, pad_token: str | None
) -> Tokenizer:
"""
Let this as is, just rename replace_vocabulary with _replace_vocabulary
"""
pass
def replace_vocabulary(
tokenizer: Tokenizer, new_vocabulary: list[Token], unk_token: str | None, pad_token: str | None
) -> Tokenizer:
if unk_token is not None and unk_token == pad_token:
# both token are the same! ignore pad token then
return _replace_vocabulary(
tokenizer=tokenizer, new_vocabulary=new_vocabulary, unk_token=unk_token, pad_token=None
)
else:
# otherwise just call _replace_vocabulary as it was
return _replace_vocabulary(
tokenizer=tokenizer, new_vocabulary=new_vocabulary, unk_token=unk_token, pad_token=pad_token
)
Then the error disappears, but I'm not sure this is the most effective way to prevent it (mainly because of potential consequences in the remainder of the execution and other use cases).
Since this is my first contribution to the repository, I’m not sure what the preferred way to address this issue is. I’m happy to implement the fix and open a PR, but I’d like to confirm the correct or recommended approach first.