`replace_vocabulary` crashes when `unk_token == pad_token` (CLIP case) — `ValueError: '<|endoftext|>' is not in list`

# `replace_vocabulary` crashes when `unk_token == pad_token` (CLIP case) — `ValueError: '<|endoftext|>' is not in list`

## Summary
When distilling a static embedding from a model whose tokenizer has the **same string** for both `unk_token` and `pad_token` (CLIP-style tokenizers frequently use `"<|endoftext|>"` for both), `model2vec` throws:

```
ValueError: '<|endoftext|>' is not in list
```

The exception occurs in [`tokenizer.py::_rename_added_token`](https://github.com/MinishLab/model2vec/blob/96a06ae59b8146078f4e9d656cce8cf1c781ca65/model2vec/tokenizer/tokenizer.py#L82) called from `replace_vocabulary` during `distill_from_model(...)` (and transitively via `StaticEmbedding.from_distillation(...)` in SentenceTransformers).

## Why this happens (root cause)
[Here](https://github.com/MinishLab/model2vec/blob/96a06ae59b8146078f4e9d656cce8cf1c781ca65/model2vec/tokenizer/tokenizer.py#L63) `replace_vocabulary(...)` unconditionally calls:

```python
added_tokens = _rename_added_token(unk_token, "[UNK]", added_tokens, pre_tokenized_tokens)
added_tokens = _rename_added_token(pad_token, "[PAD]", added_tokens, pre_tokenized_tokens)
```

`_rename_added_token` **mutates** the vocabulary list in-place:

```python
vocabulary[idx] = new_form
```

If `unk_token == pad_token` (e.g., both are `"<|endoftext|>"`), the **first** call renames that entry to `"[UNK]"`. The **second** call then tries to find the original form (`"<|endoftext|>"`) in `vocabulary`, which now no longer exists, so `vocabulary.index(form)` raises `ValueError`.

#### Expected behavior
`distill_from_model` should handle the case `unk_token == pad_token` gracefully (e.g., skip the second rename, or make `_rename_added_token` idempotent) and proceed to build the static embedding.

###Actual behavior
It crashes with:

```
ValueError: '<|endoftext|>' is not in list
```

originating from `model2vec/tokenizer/tokenizer.py::_rename_added_token`.


## Reproduce it!

[This notebook](https://colab.research.google.com/drive/1PQdo2n65PSn64NpApOF2RiWNuHedxUb8?usp=sharing) reproduces the error in google colab with those dependencies:

```
Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
PyTorch: 2.8.0+cu126
Transformers: 4.56.1
Sentence-Transformers: 5.1.0
model2vec: 0.6.0
```

I tried a simple local patch by renaming the current [`replace_vocabulary`](https://github.com/MinishLab/model2vec/blob/96a06ae59b8146078f4e9d656cce8cf1c781ca65/model2vec/tokenizer/tokenizer.py#L53) function as follows:

```python

def _replace_vocabulary(
    tokenizer: Tokenizer, new_vocabulary: list[Token], unk_token: str | None, pad_token: str | None
) -> Tokenizer:
  """
  Let this as is, just rename replace_vocabulary with _replace_vocabulary
  """
  pass


def replace_vocabulary(
    tokenizer: Tokenizer, new_vocabulary: list[Token], unk_token: str | None, pad_token: str | None
) -> Tokenizer:

  if unk_token is not None and unk_token == pad_token:
    # both token are the same! ignore pad token then
    return _replace_vocabulary(
    tokenizer=tokenizer, new_vocabulary=new_vocabulary, unk_token=unk_token, pad_token=None
  )
  else:
    # otherwise just call _replace_vocabulary as it was
    return _replace_vocabulary(
    tokenizer=tokenizer, new_vocabulary=new_vocabulary, unk_token=unk_token, pad_token=pad_token
    )
```

Then the error disappears, but I'm not sure this is the most effective way to prevent it (mainly because of potential consequences in the remainder of the execution and other use cases). 

Since this is my first contribution to the repository, I’m not sure what the preferred way to address this issue is. I’m happy to implement the fix and open a PR, but I’d like to confirm the correct or recommended approach first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`replace_vocabulary` crashes when `unk_token == pad_token` (CLIP case) — `ValueError: '<|endoftext|>' is not in list` #282

`replace_vocabulary` crashes when `unk_token == pad_token` (CLIP case) — `ValueError: '<|endoftext|>' is not in list`

Summary

Why this happens (root cause)

Expected behavior

Reproduce it!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

replace_vocabulary crashes when unk_token == pad_token (CLIP case) — ValueError: '<|endoftext|>' is not in list #282

Description

replace_vocabulary crashes when unk_token == pad_token (CLIP case) — ValueError: '<|endoftext|>' is not in list

Summary

Why this happens (root cause)

Expected behavior

Reproduce it!

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`replace_vocabulary` crashes when `unk_token == pad_token` (CLIP case) — `ValueError: '<|endoftext|>' is not in list` #282

`replace_vocabulary` crashes when `unk_token == pad_token` (CLIP case) — `ValueError: '<|endoftext|>' is not in list`