Skip to content

Add Arcee (AFM) model support to vLLM #21267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

alyosha-swamy
Copy link
Contributor

@alyosha-swamy alyosha-swamy commented Jul 20, 2025

[New Model] Support Arcee (Arcee Foundational Models)

1. Purpose (Why this PR?)

Add inference support for Arcee Foundational Model (AFM) so that users can serve it with vLLM in both Python and API-server workflows. AFM uses a unique ReLU² activation in its MLP layers, differentiating it from standard Llama-based models.

2. Model details

Field Value / Reference
Source repo / HF id huggingface.co/arcee-ai/AFM-4.5B-Base
Architecture Llama-style decoder-only transformer with ReLU² MLP activation
Context length 64k tokens
Hidden size / #layers 4096 / 32
License CC BY-NC 4.0
Special quirks Uses ReLU² (squared ReLU) activation instead of SiLU in MLP layers

3. Implementation overview

  • Added ArceeForCausalLM class in vllm/model_executor/models/arcee.py with custom ArceeMLP using ReLU² activation
  • Registered model in _TEXT_GENERATION_MODELS in vllm/model_executor/models/registry.py
  • Updated docs/models/supported_models.md with Arcee entry in text generation table
  • Reused LlamaAttention from existing Llama implementation for attention layers
  • Implemented proper LoRA and Pipeline Parallelism support

4. Performance / sanity check

$ python -m vllm.entrypoints.openai.api_server --model arcee-ai/AFM-4.5B-Base --trust-remote-code
$ curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
    "model": "arcee-ai/AFM-4.5B-Base",
    "prompt": "The future of artificial intelligence is",
    "max_tokens": 50
}'

Expected: Coherent completion about life's meaning

Observed: " a question that has been asked throughout the history of mankind. The search for an answer to this question has inspired countless works of art, literature, and philosophy. Whether we consider the existentialist ideas of Albert Camus or the religious perspectives of spiritual leaders"

5. Test plan ✔️

Test Command Expected
Unit pytest tests/models/test_arcee.py All tests pass
Model Loading python -c "from vllm import LLM; llm = LLM('arcee-ai/AFM-4.5B-Base')" Model loads without errors
Integration vllm serve arcee-ai/AFM-4.5B-Base --trust-remote-code Server starts, responds to requests
Generation curl localhost:8000/v1/completions 200 OK + valid completions

6. Documentation

  • Added row to docs/models/supported_models.md under Text Generation models
  • Model listed as ArceeForCausalLM with example model arcee-ai/AFM-4.5B-Base
  • Marked as supporting LoRA (✅), Pipeline Parallel (✅), and V1 (✅)

Checklist

  • I ran pre-commit run --all-files (ruff formatting)
  • All CI tests pass locally (pytest -q)
  • The PR description follows vLLM's "Essential Elements" template
  • No breaking changes for existing model classes

Notes for reviewers

The key architectural difference from standard Llama models is the MLP activation function. Arcee uses ReLU² (squared ReLU) instead of SiLU:

  • ArceeMLP implements: x = torch.pow(torch.relu(x), 2)
  • No gating mechanism (no gate_proj), only up_proj and down_proj
  • All other components (attention, layer norm, etc.) reuse existing Llama implementations

The model has been tested with an internal HF repo during development, but the official model is arcee-ai/AFM-4.5B-Base.

Test result

seq Prompt vLLM Output
0 "The meaning of life is" " a question that has been asked throughout the history of mankind. The search for an answer to this question has inspired countless works of art, literature, and philosophy. Whether we consider the existentialist ideas of Albert Camus or the religious perspectives of spiritual leaders"
1 "Climate change is primarily caused by" " human activity, specifically the emission of greenhouse gases such as carbon dioxide (CO2) and methane (CH4). It leads to changes in average temperatures and weather patterns, impacting both nature and human society."
2 "Machine learning algorithms work by" " training a predictive model using labeled training data: the model detects patterns in the training data and learns from it. That model is then tested using a test set, which it must predict to achieve a good accuracy rate."

All outputs are coherent and contextually appropriate.

Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
@alyosha-swamy alyosha-swamy requested a review from hmellor as a code owner July 20, 2025 22:05
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models labels Jul 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Arcee (AFM) models within vLLM. The implementation exhibits a well-organized structure, effectively incorporating a specialized ArceeMLP module to manage the ReLU² activation, while leveraging the existing LlamaAttention module. The modifications are clean, and the updates to documentation and the model registry are appropriate. A high-severity issue has been identified concerning a misleading comment within the weight loading logic. Addressing this issue is crucial to ensure clarity and maintainability of the codebase.

Comment on lines 471 to 473
# No special weight name remapping needed (AFM uses standard LLaMA
# naming except no gate_proj)
return loader.load_weights(weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This comment is misleading and should be removed. While it's true that AFM uses standard LLaMA naming (except for the missing gate_proj), the AutoWeightLoader does perform weight name remapping, specifically for the QKV projection weights (fusing separate q_proj, k_proj, v_proj into a single qkv_proj). Leaving this comment as is could cause confusion during future maintenance, as it contradicts the actual code behavior.

"qkv_proj": ["q_proj", "k_proj", "v_proj"],
}
# Supported LoRA modules (same as LLaMA, minus gate_proj)
supported_lora_modules = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed supported_lora_modules a long time ago. Please delete it.

@jeejeelee
Copy link
Collaborator

…ment

- Remove deprecated supported_lora_modules attribute
- Add ArceeForCausalLM to test registry

Signed-off-by: alyosha-swamy <raghav@arcee.ai>
packed_modules_mapping = {
"qkv_proj": ["q_proj", "k_proj", "v_proj"],
}
# (No MLP prefix since there's no gate_proj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# (No MLP prefix since there's no gate_proj)

self.unpadded_vocab_size += lora_config.lora_extra_vocab_size

# Import DEFAULT_VOCAB_PADDING_SIZE
from vllm.model_executor.layers.vocab_parallel_embedding import (
Copy link
Collaborator

@jeejeelee jeejeelee Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why import here?

@jeejeelee
Copy link
Collaborator

Can you fix the pre-commit failure?

@jeejeelee
Copy link
Collaborator

It seems that this model is not available on HF

@hmellor
Copy link
Member

hmellor commented Jul 21, 2025

If this is a private model, please either:

@adarshxs
Copy link

@hmellor @jeejeelee the model is set to be released as open weights on HF soon, this is therefore for day zero support of the model. The architecture is the same as llama with relu^2 act func.

@hmellor
Copy link
Member

hmellor commented Jul 21, 2025

If the architecture is the same as Llama with the activation function changed it probably will work with --model-impl transformers.

Meaning that it's not necessary to make any changes to vLLM.

- Set is_available_online=False in test registry for CI compatibility
@hmellor hmellor mentioned this pull request Jul 21, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation new-model Requests to new models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants