-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
❓ General Questions
This engine was my favorite. The idea is cool, but as we all see, there haven't been any new commits or active development.
I’m building a product that works with engines like llama.cpp and others, and MLC-LLM was actually my favorite because it's the only engine that implements GPU parallelism and concurrency for multi-user systems and works on any kind of device (desktop mac, linux and windows). There are no alternatives, as vLLM works only with CUDA or AMD Mi300+ without voodoo magic dance, and vLLM doesn't support multi-ranked systems (different GPU architectures in same configuration). Llama.cpp is great, but isn't optimized well for parallelism and works better for a single user.
So, it's sad that the only alternative isn't developing because of a lack of attention. And I know why, and I know what's killing it.
The key is support for new, top-tier model architectures. If people can’t run the latest models, they’re not interested.
Of course, you may say, "why don’t you implement it yourself?" and you would be completely right!
I WANT TO! I'm not so experienced in C/C++, and I don't have CUDA/ROCm experience, but I do know Python and PyTorch. I understand ML concepts and transformer architecture, and MLC (compared to llama.cpp) allows me to implement it in Python.
I was happy to try implementing a new model architecture with the MLC high-level API, and I even tried once. The issue is there's no documentation and/or community to ask for real-time help (such as discord with people who responsible for project). The only link to a Jupyter example is useless. It doesn't answer the questions I have, it doesn't cover potential known issues and how to resolve them. I could continue for a long time.
So, my point isn't to blame, but to propose a solution. If there were rich documentation and experienced mentors who could help implement models, I'm sure the community would try to help. Maybe let's focus on dev documentation?
Again, maybe I don’t know a lot about the internal stuff, but that is my honest opinion.
From the user (aka developer of a product on top of the engine) point of view, to use the library, I need the following:
- More focus on server/desktop devices (e.g., ROCm/CUDA)
- CPU offload if the model doesn’t fit completely in VRAM
- Structured output via Pydantic models (+streaming)
- Tool-calls streaming in both thinking/non-thinking mode
- OpenAI-compatible server with the latest changes (like parsing thinking output into a separate reasoning property in the chunk)
- Support for:
gpt-oss
glm4.5-air
qwen3 (moe/vl...)
granite 4
gemma 3
I don't see the reason to use MLC if there is no support for what I mentioned
Thanks!