Skip to content

Is this project abandoned? or almost dead? #3382

@MikeLP

Description

@MikeLP

❓ General Questions

This engine was my favorite. The idea is cool, but as we all see, there haven't been any new commits or active development.

I’m building a product that works with engines like llama.cpp and others, and MLC-LLM was actually my favorite because it's the only engine that implements GPU parallelism and concurrency for multi-user systems and works on any kind of device (desktop mac, linux and windows). There are no alternatives, as vLLM works only with CUDA or AMD Mi300+ without voodoo magic dance, and vLLM doesn't support multi-ranked systems (different GPU architectures in same configuration). Llama.cpp is great, but isn't optimized well for parallelism and works better for a single user.

So, it's sad that the only alternative isn't developing because of a lack of attention. And I know why, and I know what's killing it.

The key is support for new, top-tier model architectures. If people can’t run the latest models, they’re not interested.

Of course, you may say, "why don’t you implement it yourself?" and you would be completely right!

I WANT TO! I'm not so experienced in C/C++, and I don't have CUDA/ROCm experience, but I do know Python and PyTorch. I understand ML concepts and transformer architecture, and MLC (compared to llama.cpp) allows me to implement it in Python.

I was happy to try implementing a new model architecture with the MLC high-level API, and I even tried once. The issue is there's no documentation and/or community to ask for real-time help (such as discord with people who responsible for project). The only link to a Jupyter example is useless. It doesn't answer the questions I have, it doesn't cover potential known issues and how to resolve them. I could continue for a long time.

So, my point isn't to blame, but to propose a solution. If there were rich documentation and experienced mentors who could help implement models, I'm sure the community would try to help. Maybe let's focus on dev documentation?

Again, maybe I don’t know a lot about the internal stuff, but that is my honest opinion.

From the user (aka developer of a product on top of the engine) point of view, to use the library, I need the following:

  • More focus on server/desktop devices (e.g., ROCm/CUDA)
  • CPU offload if the model doesn’t fit completely in VRAM
  • Structured output via Pydantic models (+streaming)
  • Tool-calls streaming in both thinking/non-thinking mode
  • OpenAI-compatible server with the latest changes (like parsing thinking output into a separate reasoning property in the chunk)
  • Support for:
    gpt-oss
    glm4.5-air
    qwen3 (moe/vl...)
    granite 4
    gemma 3

I don't see the reason to use MLC if there is no support for what I mentioned

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestion about the usage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions