Skip to content

Add harmony tool call parser#116

Open
louis-jan wants to merge 1 commit intoml-explore:mainfrom
louis-jan:add-harmony-tool-call-parser
Open

Add harmony tool call parser#116
louis-jan wants to merge 1 commit intoml-explore:mainfrom
louis-jan:add-harmony-tool-call-parser

Conversation

@louis-jan
Copy link
Contributor

As mentioned in issue #61, mlx-swift-lm includes a GPT-OSS model definition, but GPT-OSS requires Harmony formatting end-to-end. However, since thinking case is not yet standardized (see #79), it can be parsed from applications. For tool calls, as they are available for most supported models, this PR adds the tool call parser for gpt-oss model.

Proposed changes

Adds:

  • HarmonyToolCallParser to parse tool call for GPT-OSS model
  • Move includeStopToken to ToolHandler impl, as it's used only in RawTokenLoopHandler. generateLoopTask should not know this setting as it's ToolHandler impl-specific logic.
  • In Harmony, <|call|> is a tool call end tag but also an EOS chunk, so this PR added a check for the current tool call parser state and detokenize to complete the tool parse and emit.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Comment on lines -1033 to +1039
if includeStopToken {
// Increase tokenCount only if handler emits end token
if !handler.onStopToken(token, emit: { value in
tokenCount += 1
if !handler.onStopToken(token, emit: continuation.yield) {
stopReason = .cancelled
break
}
return continuation.yield(value)
}) {
stopReason = .cancelled
break
}

Copy link
Contributor Author

@louis-jan louis-jan Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronaldmannak FYI, take a look at this update. I've refactored so that includeStopToken is encapsulated within RawTokenHandler only. Now, whenever a token needs to be emitted, the count increments automatically without requiring this function to be aware of the includeStopToken flag.
cc @davidkoski

Comment on lines +1315 to +1332
// Checking whether toolCallProcessor is collecting tool call or not
// Then detokenizer for better performance instead of always detokenizing
// As EOS token could also tool call endTag as well (e.g. GPT-OSS)
if(toolCallProcessor.state == .collectingToolCall) {
detokenizer.append(token: token)
if let chunk = detokenizer.next() {
if chunk == toolCallProcessor.parser.endTag {
// Process chunk through the tool call processor.
let _ = toolCallProcessor.processChunk(chunk)

// Emit completed tool call if needed
if let toolCall = toolCallProcessor.toolCalls.popLast() {
let _ = emit(.toolCall(toolCall))
}
}
}
}
return true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkoski, gpt-oss model will return <|call|> as tool call endtag but also stop token EOS. This change is to check whether it needs to complete the tool parsing process instead of just ending the stream.

Comment on lines +1362 to 1366
if(self.includeStopToken) {
if case .terminated = emit(.token(token)) {
return false
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So generateLoopTask would not need to care about EOS token emitting (as this is a specific logic for this raw token handler)

Comment on lines +31 to +32
var state = State.normal
let parser: any ToolCallParser
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this change, but exposing these changes so TextToolTokenLoopHandler can check whether the tool parser is processing or not, also for endTag check.
https://github.com/ml-explore/mlx-swift-lm/pull/116/changes#diff-d636823870a24e57e1e1d2c6b125edc68ba15e35c0122cb749fab0fd714d601fR1315

@ronaldmannak
Copy link
Contributor

Hi @louis-jan — thanks for tagging me, I appreciate it.
There are a couple of Harmony-specific issues, which is why I didn’t add Harmony parsing in the Generate Token PR:

  • Parsing needs to happen at the token level, not on the tokenized string.
  • Even with token-level parsing, non–tool-calling Harmony output doesn’t map cleanly onto MLX-Swift-LM’s Generation model. That leaves you with two rough options:
    • Lose important nuance (which matters for things like Open Responses), or
    • Expose Harmony tags in the generated text.

Neither is an acceptable option IMO.
For now, the solution I am using is doing the harmony parsing in the app layer. OpenAI provides a Rust library for Harmony parsing. I built an FFI wrapper for it: Pico Harmony. The catch is that the FFI currently ships with a precompiled version of the Rust library, and I totally understand why folks would be hesitant to depend on that.
I’ve offered for OpenAI to take ownership of the library, but I haven’t heard back yet (hint hint @stephancasas). I also haven’t followed up :)
I’d be happy to transfer ownership to ml-explore as well, but integrating Harmony properly at the MLX-Swift-LM layer would require a fairly substantial refactor.
I guess it depends on where the MLX team wants to take the library

@davidkoski
Copy link
Collaborator

davidkoski commented Feb 20, 2026

Very interesting -- I wasn't familiar with harmony, but if I am reading it right, it is describing the structure of the output. Sort of the inverse of UserInputProcessor + chat template. The structure they define looks reasonable, but mlx-swift-lm would need to support outputs from many models. We could make a generic structure (like UserInput) that we could translate tokens -> harmony parser -> ModelOutput.

Aside from tool calling, most models produce bare unstructured output, which we could certainly wrap in this structure.

Although using the official library sounds nice, I don't think we want that in mlx-swift-lm as it complicates the build tremendously and we wouldn't want to require people to use a prebuilt library. Ideally we would allow somebody to build a third party library that did link the official harmony library and would interoperate with the mlx-swift-lm infrastructure just like any other model. I think that mostly requires:

  • definition of something like ModelOutput which could be streamed from generation
  • the right hooks to parse the token stream and turn it into structured output (a state machine)
  • tool parsing and bare text would fit into this for existing models

The perhaps interesting part is how this interacts with streaming text -- I think the ability to see the text as it is generated is key. Not impossible but it is a constraint on the representation and how it is built.

If Generation had the bare token, I wonder if we could accomplish this with composition? Simply wrap the streamDetails() call in a structured decoder. Maybe not as it affects tool calling, which (in #114) is part of the inner loop.

I will talk with the other mlx folks and see where mlx-lm is headed and think about how we might make something that would support a wide variety of models.

See also:

@ronaldmannak
Copy link
Contributor

ronaldmannak commented Feb 21, 2026

Those are great link, especially the comprehensive bug report.
I'm not sure I understand your concern for streaming. Harmony uses chunks. That part doesn't change (apart from the extra metadata, channels, etc).
If you're considering adding a raw token to Generation, maybe we add a reasoning option to Generation as well for reasoning models, because that is getting a bit chaotic as well (e.g. <think></think> vs [think][/think] and GLM 4.7 doesn't even emit a <think> prefix. It would make sense to have the generate loop take care of that automatically (perhaps on token level, not sure).
However, just adding a raw token to generation itself doesn't solve the issue and it even might be unwanted to tokenize the token for gpt-oss before the Harmony parser. So you probably would use generateTokenTask instead of generateTask anyway. At least, that's how I do it in Pico: two different paths for Harmony and non-Harmony.

I do think you want to use the official OpenAI Harmony parser, given the complexity and since the specs might change in the future. You don't want to reinvent the wheel and you also don't want to just support a subset of Harmony. Since you can't use the official Rust parser directly in Swift, you'll need to have an FFI like Pico Harmony. You can have the developer add their own parser as an adapter, but that might just be confusing.

How about this:

  • We add Harmony FFI to MLX-Swift-LM but keep it a separate and optional target so projects that don't support GPT-oss don't have to link the code. I'm happy to rename Pico Harmony to MLX-Harmony or something and create a PR
  • Maybe there's a better solution than bundling the compiled Rust library (even when compiled by the MLX team) and developers may be able to compile the library themselves easily (I am no expert in Swift or Rust packages, but since I know you're good at finding obscure things like JSON 5 support, you might find a better solution)
  • Instead of trying to add full Harmony support in Generation you could maybe instead create a separate generateHarmonyTask that returns streaming Harmony items and internally uses generateTokenTask and the same tool calling code as non-Harmony models.

The developer will have to use generation like so:

import MLXLMCommon
import MLXHarmony

if model.type == "gpt_oss" {
  generateHarmonyTask {...}
} else {
  generateTask {...}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants