[Feat] Adding Intern-S1 #39722

hhaAndroid · 2025-07-28T07:20:05Z

Adding Intern-S1

This PR adds the support of codes for the Intern-S1 models. Please visit https://huggingface.co/internlm/Intern-S1

Features

Strong performance across language and vision reasoning benchmarks, especially scientific tasks.
Continuously pretrained on a massive 5T token dataset, with over 50% specialized scientific data, embedding deep domain expertise.
Dynamic tokenizer enables native understanding of molecular formulas, protein sequences, and seismic signals.

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_checkpoint = 'xxxx'
processor = AutoProcessor.from_pretrained(model_checkpoint)
model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map="auto", torch_dtype="auto")
messages = [
        {
            "role": "user",
            "content": [
                {"type": "image",
                 "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
                {"type": "text", "text": "Please describe the image shortly."},
            ],
        }
    ]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True,
                                           return_tensors="pt").to(model.device, dtype=torch.bfloat16)

generate_ids = model.generate(**inputs, max_new_tokens=32768)
decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(decoded_output)

Progress

add modeling py
add tokenizer.py
add test
fix lint

Rocketknight1 · 2025-07-28T13:07:18Z

cc @zucchini-nlp for VLMs!

zucchini-nlp · 2025-07-29T10:40:08Z

Taking a look tomorrow

zucchini-nlp

Hey, sorry for late review, got caught up in another model release.

The model looks very much like InternVL and I want us to re-use as much code as possible with modular. In long term it will make maintenance easier for us, and much much faster review process if we can spot the differences between models. I left comments below about which class can be re-used from where

Feel free to tag me when it is ready for re-review or if you need any assistance :)

src/transformers/models/auto/tokenization_auto.py

src/transformers/models/interns1/configuration_interns1.py

src/transformers/models/interns1/processing_interns1.py

src/transformers/models/interns1/video_processing_interns1.py

src/transformers/models/interns1/modular_interns1.py

zucchini-nlp · 2025-08-01T10:05:36Z

src/transformers/models/interns1/modular_interns1.py

+        if self.is_moe_model:
+            output_router_logits = (
+                output_router_logits if output_router_logits is not None else self.config.text_config.output_router_logits
+            )
+            kwargs['output_router_logits'] = output_router_logits


will there be model checkpoints which are MoE and not MoE? If yes, for non-moe ones we can use InternVL class prob, looks identical to me tbh

Otherwise we need to write InternS1 model code as MoE only

This is a great idea, but we hope that InternS1 can serve as a unified model, supporting not only N-vision + dense but also N-vision + MoE. I’m not sure whether Hugging Face’s guidelines strictly require dense and MoE to be separated into two model folders. Looking forward to your reply! @zucchini-nlp

The transformers philosophy is to add support for an arch only when there is an official pre-trained checkpoint for it. So if InternS1 has MoE and dense checkpoint, we'd need to support both. That can be done by having separate InternS1Moe decoder layer and InternS1Dense decoder layer, imo that is more preferable

Otherwise let's just add the arch to support the released checkpoints

Thank you very much for your reply. InternS1 will have two sets of weights: ViT+235B MoE and ViT+8B MoE. Currently, they can be automatically constructed via AutoModel.from_config(config.text_config). The LLM module directly calls the internally supported LLM models from Transformers, so we don’t need to distinguish between dense layers or MoE layers.

In this case, what would be the most reasonable way to provide support?

Oh cool, then it makes everything much easier if we can just instantiate an existing LLM from transformers. All we need to do is to make sure configs are saved with correct model_type and the code calls AutoModel.from_config(config.text_config)

I see that the is_moe_model is needed only to decide on output_router_logits. Actually we can use can_record_outputs attribute which handles all extra model outputs (for ex in Coeher2Vision). It can be a bit tricky for multimodal so lmk if you need help with it

I will take a look at can_record_output as using it is much cleaner than checking config values at init time

src/transformers/models/interns1/tokenization_interns1.py

tests/models/interns1/test_modeling_interns1.py

github-actions · 2025-08-11T06:24:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, interns1

hhaAndroid · 2025-08-11T06:27:23Z

@zucchini-nlp Hello, I've revised a new version as requested. However, regarding the usage of can_record_outputs, I'm unsure if it fits my scenario. After adapting it for the MoE model, I need to pass the output_router_logits parameter to the MoE LLM, rather than just capturing the output results. Looking forward to your next round of review comments.

zucchini-nlp

Super clean after using the modular, thanks for iterating! There are some bits left especially moving all code into modular. We shouldn't be importing from other models unless it is in the modular file :)

Also, I will take a look at the can_record_output thing this week, would be nice to get it sorted

Update: Oh btw, let's make CI green and fix failing tests. You might need to rebase if unrelated test are failing

zucchini-nlp · 2025-08-12T11:23:26Z

src/transformers/models/interns1/configuration_interns1.py

+
+class InternS1VisionConfig(InternVLVisionConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`InternS1VisionModel`]. It is used to instantiate
+    an InternS1VisionModel model according to the specified arguments, defining the model architecture. Instantiating a


we need to move this to modular file. Inheriting from other models is not allowed and against transformers one model - one file philosophy

zucchini-nlp · 2025-08-12T11:24:24Z

src/transformers/models/interns1/modular_interns1.py

+
+class InternS1VisionRMSNorm(InternVLVisionRMSNorm):
+    pass
+
+
+class InternS1VisionAttention(InternVLVisionAttention):
+    pass
+
+
+@auto_docstring
+class InternS1VisionPreTrainedModel(InternVLVisionPreTrainedModel):
+    config: InternS1VisionConfig
+
+


Looks perfect and much less code 🤩

zucchini-nlp · 2025-08-12T11:27:07Z

src/transformers/models/interns1/modular_interns1.py

+NORM2FN = {"layer_norm": nn.LayerNorm, "rms_norm": InternS1VisionRMSNorm}
+
+
+class InternS1VisionLayer(GradientCheckpointingLayer):


I meant the only difference is the drop_path thus it can be inherited and we only write out the new drop_path modules

For ex, this is how it was done for attn module with new QK-Norm layers

transformers/src/transformers/models/qwen3/modular_qwen3.py

Lines 60 to 68 in 913c0a8

class Qwen3Attention(LlamaAttention):

def __init__(self, config: Qwen3Config, layer_idx: int):

super().__init__(config, layer_idx)

self.q_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps) # unlike olmo, only on the head dim!

self.k_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps) # thus post q_norm does not need reshape

self.sliding_window = config.sliding_window if config.layer_types[layer_idx] == "sliding_attention" else None

@deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")

def forward(

zucchini-nlp · 2025-08-12T11:28:42Z

src/transformers/models/interns1/modular_interns1.py

+        return layer_output, attention_weights
+
+
+class InternS1VisionEncoder(nn.Module):


The forward is identical, so we can inherit and override the init only

zucchini-nlp · 2025-08-12T11:29:37Z

src/transformers/models/interns1/modular_interns1.py

+
+
+@auto_docstring
+class InternS1VisionModel(InternS1VisionPreTrainedModel):


comment not addressed, can be copied from InternVLVisionModel

zucchini-nlp · 2025-08-12T11:32:41Z

src/transformers/models/interns1/modular_interns1.py

+            if input_ids is None:
+                special_image_mask = inputs_embeds == self.get_input_embeddings()(
+                    torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
+                )
+                special_image_mask = special_image_mask.all(-1)
+            else:
+                special_image_mask = input_ids == self.config.image_token_id
+
+            n_image_tokens = (special_image_mask).sum()
+            special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+
+            if not is_torchdynamo_compiling() and inputs_embeds[special_image_mask].numel() != image_features.numel():
+                n_image_features = image_features.shape[0] * image_features.shape[1]
+                raise ValueError(
+                    f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+                )
+            image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+            inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)


this was recently refactored to get_placeholder_mask helper method in all VLMs. Can you update here as well?

zucchini-nlp · 2025-08-12T11:33:09Z

src/transformers/models/interns1/modular_interns1.py

+        if self.is_moe_model:
+            output_router_logits = (
+                output_router_logits if output_router_logits is not None else self.config.text_config.output_router_logits
+            )
+            kwargs['output_router_logits'] = output_router_logits


I will take a look at can_record_output as using it is much cleaner than checking config values at init time

zucchini-nlp · 2025-08-12T11:33:49Z

src/transformers/models/interns1/processing_interns1.py

+        self,
+        images: Optional[ImageInput] = None,
+        text: Optional[Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]]] = None,
+        audio=None,


audio is not needed, we decided to not add unused modalities for call

zucchini-nlp · 2025-08-12T11:34:15Z

src/transformers/models/interns1/processing_interns1.py

+# TODO: It will support temporal information processing in the future.
+class InternS1Processor(InternVLProcessor):
+    r"""


if there is inheritance, it has to be defined in modular file. AFAIU the diff is meant to be in docstring only? We can let modualr handle all re-naming and simply as below will be enough

# modular_interns1.py class InternS1Processor(InternVLProcessor): pass

zucchini-nlp · 2025-08-12T11:37:01Z

src/transformers/models/interns1/video_processing_interns1.py

+class InternS1VideoProcessorInitKwargs(VideosKwargs):
+    initial_shift: Union[bool, float, int]
+
+
+@requires(backends=("torchvision",))
+class InternS1VideoProcessor(InternVLVideoProcessor):
+    valid_kwargs = InternS1VideoProcessorInitKwargs
+
+    def __init__(self, **kwargs: Unpack[InternS1VideoProcessorInitKwargs]):
+        super().__init__(**kwargs)
+


Same here, only defining that video processor is identical in modular will copy everything else for you

class InternS1VideoProcessor(InternVLVideoProcessor): pass

hhaAndroid and others added 4 commits July 28, 2025 14:26

add interns1

2c2b88e

fix interns1

65b3e67

add modeling_interns1.py

d0ecd51

Format: ruff format

0794f34

Anti-Entrophic force-pushed the interns1 branch from 2a47a32 to 0794f34 Compare July 28, 2025 07:50

hhaAndroid added 2 commits July 28, 2025 17:18

add model test

9165586

fix test

4ff1c97

[Fix] update docstring

7b4375a

Anti-Entrophic force-pushed the interns1 branch from 0b99b66 to 7b4375a Compare July 29, 2025 08:41

hhaAndroid added 3 commits July 30, 2025 11:10

add requires

263e2b2

add drop_path_ratio

052acac

support dense and moe llm

6be476e

zucchini-nlp reviewed Aug 1, 2025

View reviewed changes

hhaAndroid and others added 4 commits August 4, 2025 12:04

fix comment

20d574e

tests: update test_tokenization_interns1

ec73435

fix: remove redundant statement

70bb956

fix comment

514be46

hhaAndroid requested a review from zucchini-nlp August 11, 2025 10:05

zucchini-nlp reviewed Aug 12, 2025

View reviewed changes

		NORM2FN = {"layer_norm": nn.LayerNorm, "rms_norm": InternS1VisionRMSNorm}


		class InternS1VisionLayer(GradientCheckpointingLayer):

	class Qwen3Attention(LlamaAttention):
	def __init__(self, config: Qwen3Config, layer_idx: int):
	super().__init__(config, layer_idx)
	self.q_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps) # unlike olmo, only on the head dim!
	self.k_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps) # thus post q_norm does not need reshape
	self.sliding_window = config.sliding_window if config.layer_types[layer_idx] == "sliding_attention" else None

	@deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
	def forward(

		return layer_output, attention_weights


		class InternS1VisionEncoder(nn.Module):



		@auto_docstring
		class InternS1VisionModel(InternS1VisionPreTrainedModel):

[Feat] Adding Intern-S1 #39722

Are you sure you want to change the base?

[Feat] Adding Intern-S1 #39722

Uh oh!

Conversation

hhaAndroid commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding Intern-S1

Features

Usage

Progress

Uh oh!

Rocketknight1 commented Jul 28, 2025

Uh oh!

zucchini-nlp commented Jul 29, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

hhaAndroid commented Aug 11, 2025

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hhaAndroid commented Jul 28, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading