Add Ouro by kernelpool · Pull Request #599 · ml-explore/mlx-lm

kernelpool · 2025-11-09T15:18:34Z

This adds support for the Ouro family models from ByteDance.

Example

mlx_lm.generate --model mlx-community/Ouro-1.4B-4bit -p "who is albert einstein?" -m 4096
==========
Albert Einstein was a German-born theoretical physicist who developed the theory of relativity. He is best known for his famous equation E=mc², which describes the relationship between energy and mass. Einstein was also a pacifist and a strong advocate for civil rights. He was awarded the Nobel Prize in Physics in 1921 for his contributions to the development of the theory of relativity.

==========
Prompt: 27 tokens, 281.049 tokens-per-sec
Generation: 82 tokens, 89.291 tokens-per-sec
Peak memory: 1.061 GB

Configuration

Since Ouro is a looped language model they have some additional parameters. These can be adjusted as follows.
Note that models run all (4) UT steps by default.

from mlx_lm import load

model, tokenizer = load("Ouro-1.4B-4bit")

# 1. Default (runs all 4 UT steps)
output = model(inputs, cache)

# 2. Override threshold at runtime (some tokens exit at step 1, some at 2, etc.)
output = model(inputs, cache, exit_threshold=0.7)

# 3. Force specific exit step (all tokens)
output = model(inputs, cache, exit_at_step=2)

# 4. Weighted average of all steps
output = model(inputs, cache, use_weighted_exit=True)

Benchmarks

Performance benchmarks on Apple M3 Ultra (80 GPU cores, 512GB RAM).

python benchmark.py mlx --contexts 2,4,8,16,32,64,128 --max-tokens 200 <model>

mlx-community/Ouro-1.4B-4bit

Context	Prompt Speed	Generation Speed	Memory	Time (200 tok)
2K	1,802 tok/s	71 tok/s	3.2 GB	5.3s
4K	1,752 tok/s	60 tok/s	4.9 GB	5.6s
8K	1,557 tok/s	46 tok/s	8.2 GB	11.0s
16K	1,265 tok/s	31 tok/s	14.8 GB	20.7s
32K	909 tok/s	20 tok/s	28.3 GB	46.4s
64K	581 tok/s	11 tok/s	54.6 GB	134.9s

mlx-community/Ouro-2.6B-4bit

Context	Prompt Speed	Generation Speed	Memory	Time (200 tok)
2K	904 tok/s	34 tok/s	3.5 GB	8.2s
4K	870 tok/s	29 tok/s	5.5 GB	13.2s
8K	778 tok/s	22 tok/s	9.0 GB	21.2s
16K	631 tok/s	15 tok/s	16.0 GB	41.6s
32K	453 tok/s	9 tok/s	30.0 GB	97.2s
64K	289 tok/s	5 tok/s	58.0 GB	269.4s

awni · 2025-12-03T15:03:36Z

mlx_lm/models/ouro.py

+        for gate in gates[:-1]:
+            lambda_i = mx.sigmoid(gate.squeeze(-1))


I would do the sigmoid of the gates on the full vector gates[:-1] and then do the loop.

awni · 2025-12-03T15:09:26Z

It's a pretty interesting model and very nicely implemented.

However, I've generally been quite skeptical of models with early exit as it doesn't play very well with GPUs. In this implementation it's less efficient than just running the model in full for every token since you have to save and evaluate all the hidden states and then decide which to keep.

Of course ideally you would try to compute up to the hidden states you actually need, but that's also quite difficult because every time you have to do control flow based on data (the probabilities) you have to stall the GPU.

I think we could merge this.. or leave it as an experimental PR.. kind of depends if anyone wants to use this model.

kernelpool · 2026-01-05T01:39:53Z

Thanks for the feedback, @awni. Yeah, I did play around a bit with this to see if I could avoid running all loops for the early exits, but decided to ultimately just mirror the reference to keep it clean. I agree with this being experimental, so I'll leave it up to you to decide.

kernelpool added 7 commits November 10, 2025 01:52

Add Ouro

e746efd

Early exit and unit test

3199210

Simplify code

6cf8287

Cleanup

29f478f

Refactor and reduce clutter in Model class

0448ae4

Optimize

fe9b7e2

Merge branch 'origin/main' into feature/ouro

02f90d9

awni reviewed Dec 3, 2025

View reviewed changes

kernelpool added 2 commits January 5, 2026 12:01

Feedback

879a348

Merge remote-tracking branch 'origin/main' into feature/ouro

9fa47ad

kernelpool force-pushed the feature/ouro branch from 0632cfe to 9fa47ad Compare January 5, 2026 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ouro#599

Add Ouro#599
kernelpool wants to merge 9 commits intoml-explore:mainfrom
kernelpool:feature/ouro

kernelpool commented Nov 9, 2025 •

edited

Loading

Uh oh!

awni Dec 3, 2025

Uh oh!

awni commented Dec 3, 2025

Uh oh!

kernelpool commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for gate in gates[:-1]:
		lambda_i = mx.sigmoid(gate.squeeze(-1))

Conversation

kernelpool commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

Configuration

Benchmarks

Uh oh!

awni Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

awni commented Dec 3, 2025

Uh oh!

kernelpool commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kernelpool commented Nov 9, 2025 •

edited

Loading