-
-
Notifications
You must be signed in to change notification settings - Fork 9k
Add Arcee (AFM) model support to vLLM #21267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Arcee (AFM) models within vLLM. The implementation exhibits a well-organized structure, effectively incorporating a specialized ArceeMLP
module to manage the ReLU² activation, while leveraging the existing LlamaAttention
module. The modifications are clean, and the updates to documentation and the model registry are appropriate. A high-severity issue has been identified concerning a misleading comment within the weight loading logic. Addressing this issue is crucial to ensure clarity and maintainability of the codebase.
vllm/model_executor/models/arcee.py
Outdated
# No special weight name remapping needed (AFM uses standard LLaMA | ||
# naming except no gate_proj) | ||
return loader.load_weights(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is misleading and should be removed. While it's true that AFM uses standard LLaMA naming (except for the missing gate_proj
), the AutoWeightLoader
does perform weight name remapping, specifically for the QKV projection weights (fusing separate q_proj
, k_proj
, v_proj
into a single qkv_proj
). Leaving this comment as is could cause confusion during future maintenance, as it contradicts the actual code behavior.
vllm/model_executor/models/arcee.py
Outdated
"qkv_proj": ["q_proj", "k_proj", "v_proj"], | ||
} | ||
# Supported LoRA modules (same as LLaMA, minus gate_proj) | ||
supported_lora_modules = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed supported_lora_modules
a long time ago. Please delete it.
Please also update this model in https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py |
…ment - Remove deprecated supported_lora_modules attribute - Add ArceeForCausalLM to test registry Signed-off-by: alyosha-swamy <raghav@arcee.ai>
vllm/model_executor/models/arcee.py
Outdated
packed_modules_mapping = { | ||
"qkv_proj": ["q_proj", "k_proj", "v_proj"], | ||
} | ||
# (No MLP prefix since there's no gate_proj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# (No MLP prefix since there's no gate_proj) |
vllm/model_executor/models/arcee.py
Outdated
self.unpadded_vocab_size += lora_config.lora_extra_vocab_size | ||
|
||
# Import DEFAULT_VOCAB_PADDING_SIZE | ||
from vllm.model_executor.layers.vocab_parallel_embedding import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why import here?
Can you fix the pre-commit failure? |
It seems that this model is not available on HF |
If this is a private model, please either:
|
@hmellor @jeejeelee the model is set to be released as open weights on HF soon, this is therefore for day zero support of the model. The architecture is the same as llama with relu^2 act func. |
If the architecture is the same as Llama with the activation function changed it probably will work with Meaning that it's not necessary to make any changes to vLLM. |
- Set is_available_online=False in test registry for CI compatibility
[New Model] Support Arcee (Arcee Foundational Models)
1. Purpose (Why this PR?)
Add inference support for Arcee Foundational Model (AFM) so that users can serve it with vLLM in both Python and API-server workflows. AFM uses a unique ReLU² activation in its MLP layers, differentiating it from standard Llama-based models.
2. Model details
3. Implementation overview
ArceeForCausalLM
class invllm/model_executor/models/arcee.py
with customArceeMLP
using ReLU² activation_TEXT_GENERATION_MODELS
invllm/model_executor/models/registry.py
docs/models/supported_models.md
with Arcee entry in text generation tableLlamaAttention
from existing Llama implementation for attention layers4. Performance / sanity check
Expected: Coherent completion about life's meaning
Observed: " a question that has been asked throughout the history of mankind. The search for an answer to this question has inspired countless works of art, literature, and philosophy. Whether we consider the existentialist ideas of Albert Camus or the religious perspectives of spiritual leaders"
5. Test plan ✔️
pytest tests/models/test_arcee.py
python -c "from vllm import LLM; llm = LLM('arcee-ai/AFM-4.5B-Base')"
vllm serve arcee-ai/AFM-4.5B-Base --trust-remote-code
curl localhost:8000/v1/completions
6. Documentation
docs/models/supported_models.md
under Text Generation modelsArceeForCausalLM
with example modelarcee-ai/AFM-4.5B-Base
Checklist
pre-commit run --all-files
(ruff formatting)pytest -q
)Notes for reviewers
The key architectural difference from standard Llama models is the MLP activation function. Arcee uses ReLU² (squared ReLU) instead of SiLU:
ArceeMLP
implements:x = torch.pow(torch.relu(x), 2)
gate_proj
), onlyup_proj
anddown_proj
The model has been tested with an internal HF repo during development, but the official model is
arcee-ai/AFM-4.5B-Base
.Test result
All outputs are coherent and contextually appropriate.