Is this project abandoned? or almost dead?

## ❓ General Questions

This engine was my favorite. The idea is cool, but as we all see, there haven't been any new commits or active development. 

I’m building a product that works with engines like llama.cpp and others, and MLC-LLM was actually my favorite because it's the only engine that implements GPU parallelism and concurrency for multi-user systems *and* works on any kind of device (desktop mac, linux and windows). There are no alternatives, as vLLM works only with CUDA or AMD Mi300+ without voodoo magic dance, and vLLM doesn't support multi-ranked systems (different GPU architectures in same configuration). Llama.cpp is great, but isn't optimized well for parallelism and works better for a single user.

So, it's sad that the only alternative isn't developing because of a lack of attention. And I know why, and I know what's killing it.

The key is support for new, top-tier model architectures. If people can’t run the latest models, they’re not interested. 

Of course, you may say, "why don’t you implement it yourself?" and you would be completely right!

I WANT TO! I'm not so experienced in C/C++, and I don't have CUDA/ROCm experience, but I do know Python and PyTorch. I understand ML concepts and transformer architecture, and MLC (compared to llama.cpp) allows me to implement it in Python. 

I was happy to try implementing a new model architecture with the MLC high-level API, and I even tried once. The issue is there's no documentation and/or community to ask for real-time help (such as discord with people who responsible for project). The only link to a Jupyter example is useless. It doesn't answer the questions I have, it doesn't cover potential known issues and how to resolve them. I could continue for a long time. 

So, my point isn't to blame, but to propose a solution. If there were rich documentation and experienced mentors who could help implement models, I'm sure the community would try to help. Maybe let's focus on dev documentation?

Again, maybe I don’t know a lot about the internal stuff, but that is my honest opinion.

From the user (aka developer of a product on top of the engine) point of view, to use the library, I need the following:

- More focus on server/desktop devices (e.g., ROCm/CUDA)
- CPU offload if the model doesn’t fit completely in VRAM
- Structured output via Pydantic models (+streaming)
- Tool-calls streaming in both thinking/non-thinking mode
- OpenAI-compatible server with the latest changes (like parsing thinking output into a separate reasoning property in the chunk)
- Support for:
  gpt-oss
  glm4.5-air
  qwen3 (moe/vl...)
  granite 4
  gemma 3

I don't see the reason to use MLC if there is no support for  what I mentioned

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is this project abandoned? or almost dead? #3382

❓ General Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is this project abandoned? or almost dead? #3382

Description

❓ General Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions