Skip to content

Force the model to write some tokens mid-generation? #37771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
blazgocompany opened this issue Apr 24, 2025 · 3 comments
Open

Force the model to write some tokens mid-generation? #37771

blazgocompany opened this issue Apr 24, 2025 · 3 comments
Labels
Feature request Request for a new feature

Comments

@blazgocompany
Copy link

Feature request

Here’s an example:

User: Hello make a python function for something
Assistant: Here’s an function for that:

def function():
pass
← This is a line we tuned the model to generate
import pytest
assert foo == bar
← Execute the tests right after this token was predicted
Result: tests succeeded ← THIS is the forced tokens, we also tuned the model to generate this
Ok, looks like the function is working…

EDIT:

The LLM is trained to respond with the same block given above, however since LLMs are bad at detecting when they have done a mistake they will lean towards saying succeeded for everything.
However after the inference pass for the token “succeeded” there will be a probablity distribution e.g.

succeeded 0.5
failed 0.3
etc.

So I want to “force” the model to pick failed (or succeeded) even though it is a less likely token. Seems like something very simple, but there is no support.

Motivation

In case you didn't realize the point already, doing this could be opensource LLMs becoming significantly better for agentic workflows. unlike stopping generation, calling tools, and otherwise creating delays, this works right between inference passes. Agentic workflows for proprietary LLMs can add up costs FAST.

Your contribution

I'm not used to this codebase. It seems very complex. but the feature is very simple. Maybe is someone could give me pointers

@blazgocompany blazgocompany added the Feature request Request for a new feature label Apr 24, 2025
@blazgocompany
Copy link
Author

Asked ChatGPT about it:

Below are some practical “entry points” in 🤗 Transformers that let you overwrite the next-token choice, even if the token you want (e.g. "failed") has lower probability than the model’s top guess.
Everything can be done without changing any C++/CUDA kernels – just a few Python classes and a call to model.generate().

Understand where the decision is made
model.generate() ──► _generate() ├── _greedy_search() / _sample() ├── logits = model(...) ├── logits = logits_processor(logits) ← place to hack ├── logits = logits_warper(logits) └── next_token = sample(argmax) ← after hack └── append next_token, loop …

Key locations in the source tree (v4.39.*):

src/transformers/generation/utils.py – generation loops
src/transformers/generation/logits_process.py – all LogitsProcessor / Warper classes
src/transformers/generation/configuration_utils.py – GenerationConfig helpers

Three out-of-the-box hooks you can already use
A. forced_decoder_ids / ForcedTokensLogitsProcessor

Works when you know in advance that at position N the token must be X.
Example:
gen_cfg = model.generation_config
gen_cfg.forced_decoder_ids = [(offset_from_prompt, tokenizer('failed').input_ids[0])]
model.generate(..., generation_config=gen_cfg)
B. prefix_allowed_tokens_fn

Callable invoked every step; you receive (batch_id, input_ids) and return the set of IDs that are legal next tokens.
If you return only [failed_id] at the step after your asserts, the model is forced to output “failed”.
def constraint(batch_id, input_ids):
text = tokenizer.decode(input_ids[0])
if text.endswith("Result: "):
return [failed_id] # only one legal token
return list(range(tokenizer.vocab_size))
model.generate(..., prefix_allowed_tokens_fn=constraint)

C. stop-generate-then-append

Let generation stop after the code block, run pytest, then feed the original prompt + "Result: failed\n" back into generate() with past_key_values to keep it fast.
Costs one extra forward pass but zero code inside Transformers.
If you need total control: write a custom LogitsProcessor
A LogitsProcessor is only ~15 lines of code; it receives the raw logits and must return modified logits of the same shape.

Example: force a single token at a particular time step

from transformers import LogitsProcessor

class ForceTokenAtPosition(LogitsProcessor):
def init(self, pos: int, token_id: int):
self.pos = pos # 0-based index after prompt
self.token_id = token_id

def __call__(self, input_ids, scores):
    cur_len = input_ids.shape[-1]            # length incl. prompt
    if cur_len == self.pos:
        mask = scores.new_full(scores.size(), float("-inf"))
        mask[:, self.token_id] = 0.0         # only token_id allowed
        return mask
    return scores

Usage:

processor = ForceTokenAtPosition(
pos=prompt_len + code_block_len + 1, # where “failed” must appear
token_id=tokenizer.convert_tokens_to_ids("failed")
)
outputs = model.generate(
input_ids,
logits_processor=[processor],
do_sample=False # or True – doesn’t matter, there’s only one choice
)

Dynamic forcing (decide after pytest)

class DynamicForceNext(LogitsProcessor):
def init(self):
self.force_next: int | None = None

def set(self, token_id: int | None):
    self.force_next = token_id

def __call__(self, input_ids, scores):
    if self.force_next is not None:
        mask = scores.new_full(scores.size(), float("-inf"))
        mask[:, self.force_next] = 0
        self.force_next = None          # reset for later steps
        return mask
    return scores

In your outer loop run dyn_proc.set(failed_id) immediately after executing the tests.

Where to drop the code in the repo
• src/transformers/generation/logits_process.py
– add the new class(es) next to MinLengthLogitsProcessor.
• transformers/generation/init.py
– export the symbol so users can import.
• Optionally extend GenerationConfig (add a flag and construct the processor inside GenerationMixin).

Then unit-test:

tests/generation/test_force_token.py

def test_force_token():
proc = ForceTokenAtPosition(3, 42)
scores = torch.randn(1, 50)
ids = torch.tensor([[1,2,3]])
out = proc(ids, scores.clone())
assert out[0,42] == 0 and torch.isinf(out).sum() == 49

Performance notes
• A LogitsProcessor runs on the same device as the logits (CUDA tensor) – no CPU-GPU sync if you keep your math inside the tensor world.
• For agentic workflows you usually step-through generation anyway, so the cost is negligible compared to another call to the model or, worse, spawning a sandboxed pytest.

TL;DR
You don’t need a new feature in Transformers – the library already has three hooks (forced_decoder_ids, prefix_allowed_tokens_fn, custom LogitsProcessor) that let you deterministically pick “failed” (or any token) even when its probability is low.
Start by prototyping a tiny LogitsProcessor in your own script; once it works, upstream it to src/transformers/generation/logits_process.py so everyone can use it.

Doesn't make much sense to me, but someone whos familiar may know what to do

@Rocketknight1
Copy link
Member

It seems like you're asking for structured generation here - cc @gante what do we recommend for people these days? Just using Outlines?

@blazgocompany
Copy link
Author

Oh, so we don't need a custom Logits Processor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants