Skip to content

Add Ouro#599

Open
kernelpool wants to merge 9 commits intoml-explore:mainfrom
kernelpool:feature/ouro
Open

Add Ouro#599
kernelpool wants to merge 9 commits intoml-explore:mainfrom
kernelpool:feature/ouro

Conversation

@kernelpool
Copy link
Contributor

@kernelpool kernelpool commented Nov 9, 2025

This adds support for the Ouro family models from ByteDance.

Example

mlx_lm.generate --model mlx-community/Ouro-1.4B-4bit -p "who is albert einstein?" -m 4096
==========
Albert Einstein was a German-born theoretical physicist who developed the theory of relativity. He is best known for his famous equation E=mc², which describes the relationship between energy and mass. Einstein was also a pacifist and a strong advocate for civil rights. He was awarded the Nobel Prize in Physics in 1921 for his contributions to the development of the theory of relativity.

==========
Prompt: 27 tokens, 281.049 tokens-per-sec
Generation: 82 tokens, 89.291 tokens-per-sec
Peak memory: 1.061 GB

Configuration

Since Ouro is a looped language model they have some additional parameters. These can be adjusted as follows.
Note that models run all (4) UT steps by default.

from mlx_lm import load

model, tokenizer = load("Ouro-1.4B-4bit")

# 1. Default (runs all 4 UT steps)
output = model(inputs, cache)

# 2. Override threshold at runtime (some tokens exit at step 1, some at 2, etc.)
output = model(inputs, cache, exit_threshold=0.7)

# 3. Force specific exit step (all tokens)
output = model(inputs, cache, exit_at_step=2)

# 4. Weighted average of all steps
output = model(inputs, cache, use_weighted_exit=True)

Benchmarks

Performance benchmarks on Apple M3 Ultra (80 GPU cores, 512GB RAM).

python benchmark.py mlx --contexts 2,4,8,16,32,64,128 --max-tokens 200 <model>

mlx-community/Ouro-1.4B-4bit

Context Prompt Speed Generation Speed Memory Time (200 tok)
2K 1,802 tok/s 71 tok/s 3.2 GB 5.3s
4K 1,752 tok/s 60 tok/s 4.9 GB 5.6s
8K 1,557 tok/s 46 tok/s 8.2 GB 11.0s
16K 1,265 tok/s 31 tok/s 14.8 GB 20.7s
32K 909 tok/s 20 tok/s 28.3 GB 46.4s
64K 581 tok/s 11 tok/s 54.6 GB 134.9s

mlx-community/Ouro-2.6B-4bit

Context Prompt Speed Generation Speed Memory Time (200 tok)
2K 904 tok/s 34 tok/s 3.5 GB 8.2s
4K 870 tok/s 29 tok/s 5.5 GB 13.2s
8K 778 tok/s 22 tok/s 9.0 GB 21.2s
16K 631 tok/s 15 tok/s 16.0 GB 41.6s
32K 453 tok/s 9 tok/s 30.0 GB 97.2s
64K 289 tok/s 5 tok/s 58.0 GB 269.4s

Comment on lines 225 to 226
for gate in gates[:-1]:
lambda_i = mx.sigmoid(gate.squeeze(-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do the sigmoid of the gates on the full vector gates[:-1] and then do the loop.

@awni
Copy link
Member

awni commented Dec 3, 2025

It's a pretty interesting model and very nicely implemented.

However, I've generally been quite skeptical of models with early exit as it doesn't play very well with GPUs. In this implementation it's less efficient than just running the model in full for every token since you have to save and evaluate all the hidden states and then decide which to keep.

Of course ideally you would try to compute up to the hidden states you actually need, but that's also quite difficult because every time you have to do control flow based on data (the probabilities) you have to stall the GPU.

I think we could merge this.. or leave it as an experimental PR.. kind of depends if anyone wants to use this model.

@kernelpool
Copy link
Contributor Author

Thanks for the feedback, @awni. Yeah, I did play around a bit with this to see if I could avoid running all loops for the early exits, but decided to ultimately just mirror the reference to keep it clean. I agree with this being experimental, so I'll leave it up to you to decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants