Multimodal (vision) support #227

sohamparikh · 2025-04-08T06:51:40Z

✨ Description

Multi-modal support, starting with pixtral's vision encoder

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

prepare now optionally takes images and image positions (where they should appear in the text). Images are stored in the memmap file along with text
GPTBaseModel optionally has a vision encoder attached (can have audio/video encoders in future), which consists of a conv2D, vision transformer and a MLP adapter.

tscholak · 2025-07-11T16:17:09Z

fast_llm/data/dataset/gpt/sampled.py

+        if self._indexed_dataset.has_images and self._truncate_documents:
+            raise RuntimeError(
+                "Truncating documents with images is not yet supported. Please turn off truncation to use images."
+            )


what does that mean in practice? documents with images that are longer than the sequence length are discarded?

tscholak · 2025-07-11T16:18:34Z

fast_llm/data/dataset/gpt/sampled.py


        # Calculate basic stats.
        if not self._truncate_documents:
            assert _extension_available, (
                "The C++ extension for dataset sampling is missing."
                " Please make sure Fast-LLM is installed correctly."
            )
-            long_docs_filter = document_sizes > self._parameters.sequence_length + 1
+            long_docs_filter = document_sizes + image_token_sizes > self._parameters.sequence_length + 1
            ignored_documents = long_docs_filter.sum().item()


I guess yes, long docs with images will be ignored. ok

jlamypoirier

Looks good, but I'm worried about the added complexity. Main suggestions:

Make vision into a separate model.
Replace the dim and kwarg name changes with simpler alternatives.
Break down some methods that have grown too big to be properly understandable.
Avoid abbreviations when possible so names are self-explanatory for everyone.

fast_llm/data/dataset/gpt/fim.py

jlamypoirier · 2025-07-16T15:59:06Z

fast_llm/data/dataset/gpt/sampled.py

@@ -133,24 +141,48 @@ def _sample(self) -> None:
        Create a `GPTSampledDataset` with the requested parameters.
        """
        # Get the document sizes, the main information needed for sampling.
-        document_sizes = torch.from_numpy(self._indexed_dataset.get_document_sizes()).to(self._device)
+        document_sizes, image_sizes = self._indexed_dataset.get_document_sizes()


sample is getting way too long and complicated to follow. Can we please break it down by step and/or feature? Same for __getitem__

jlamypoirier · 2025-07-16T16:02:54Z

fast_llm/layers/transformer/attention.py

        self._config = config
        self._tensor_space = tensor_space
+        # TODO Soham: fix assert


jlamypoirier · 2025-07-16T16:03:59Z

fast_llm/layers/transformer/attention.py

-        self._head_groups = self._tensor_space.get_tensor_dim(TransformerDimNames.head_groups).global_size
-        self._local_head_groups = self._tensor_space.get_tensor_dim(TransformerDimNames.head_groups).size
-        self._local_heads_per_group = self._tensor_space.get_tensor_dim(TransformerDimNames.group_heads).size
+        self._kv_channels = self._tensor_space.get_tensor_dim(self._transformer_dim_names.kv_channels).size


Why do we need this?

jlamypoirier · 2025-07-16T16:14:11Z

fast_llm/layers/multi_modal/embedding.py

+        super().__init__(config, tensor_space)
+
+    # @torch.compile
+    def _forward(


Can't this use super()._forward instead of copying it?

fast_llm/layers/vision_encoder/preprocessing.py

jlamypoirier · 2025-07-16T16:54:40Z

fast_llm/layers/vision_encoder/preprocessing.py

+            dtype=self._distributed_config.training_dtype.torch,
+        )
+
+    def preprocess(self, tokens, kwargs: dict[str, typing.Any]) -> None:


Can this be broken down a bit?

jlamypoirier · 2025-07-16T17:00:19Z

fast_llm/models/gpt/config.py

@@ -71,6 +71,17 @@ class DiffusionLlamaGPTHuggingfaceCheckpointFormat(GPTHuggingfaceCheckpointForma
    trust_remote_code: typing.ClassVar[bool] = True


+class LlavaGPTHuggingfaceCheckpointFormat(GPTHuggingfaceCheckpointFormat):


I don't understand, is all that added complexity just so we can call it llava instead of either pixtral or mistral?

jlamypoirier · 2025-07-16T17:03:12Z

fast_llm/models/gpt/model.py

@@ -63,6 +70,10 @@ def __init__(
        if self._config.enable_dpo:  # TODO better way to pass in?
            self._preprocessors.append(PreferenceSpanPreprocessor(self._config, self._tensor_space))

+        if self._config.vision_encoder.enabled:


I think this would be more appropriate as a separate model, like we did for SSM

No, this is intentionally part of the base GPT model. It ensures that multimodal support is transparent and inherited by all model variants (including SSM) without needing separate parallel class hierarchies. We're not introducing architectural silos unless there's a concrete need.

That's an option, but it comes with important drawbacks. Fast-LLM models are not designed to support more than one task. We already pushed GPT far beyond what a model should do with hacks and patches, and things are quickly getting difficult to manage, especially when it comes to preprocessing. Keeping everything in the same model means we have to break things down, modularize and simplify. We also need to ensure the PR has no side effect on non-vision models, which is difficult to do with the current implementation.

In short, keeping vision in the same model makes this PR significantly more difficult to merge. Either way, I don't think SSM support is an obstacle, because SSMs need to be integrated into the GPT model, and that's a lot easier and safer to do.

jlamypoirier · 2025-07-16T17:03:51Z

fast_llm/layers/language_model/config.py

@@ -47,6 +49,10 @@ class LanguageModelBaseConfig(BaseModelConfig):
        desc="Configuration for the transformer architecture.",
        hint=FieldHint.architecture,
    )
+    vision_encoder: VisionEncoderConfig = Field(


This shouldn't be part of a "language model" config. How about adding this in a separate Vision model instead?

I disagree, see also above. The key advantages of integrating multimodal support directly into the GPT base model are:

Seamless transition between text-only and multimodal models.

Automatic inheritance of multimodal support by the existing SSM subclasses, without extra complexity or maintenance.

RaymondLi0 · 2025-07-23T20:55:55Z

fast_llm/models/gpt/conversion.py


    @classmethod
    def _create_config_converters(cls) -> list[ParamConverter]:
+        cls.architecture = "MistralForCausalLM"


Why are the architecture classvars modified here?

jlamypoirier · 2025-08-15T23:05:56Z

fast_llm/engine/multi_stage/stage.py

@@ -137,7 +137,7 @@ def backward(
        assert self._mode.support_backward
        input_, output = grad_context
        output.backward(output_grad)
-        return input_.grad
+        return input_.grad if input_.grad is not None else torch.zeros_like(input_)


That's not a good idea, it will cause unnecessary operations when not needed, which is most of the time. Why is it needed?

jlamypoirier · 2025-08-15T23:19:53Z

fast_llm/functional/config.py

@@ -67,7 +68,8 @@ def _set_activation_fn_map() -> None:
    global _ACTIVATION_FN_MAP

    _ACTIVATION_FN_MAP = {
-        ActivationType.gelu: lambda x: torch.nn.functional.gelu(x, approximate="tanh"),
+        ActivationType.gelu: torch.nn.functional.gelu,


This is incorrect, Fast-LLM uses the tanh version.

jlamypoirier · 2025-08-15T23:20:40Z

fast_llm/data/dataset/gpt/indexed.py

@@ -19,12 +19,26 @@ def get_document_sizes(self) -> np.ndarray:
        and derived classes should try to avoid holding the whole array im memory.
        """

+    @abc.abstractmethod


I separated this because there were several issues with the previous version.

jlamypoirier · 2025-08-15T23:24:41Z

fast_llm/data/dataset/gpt/memmap.py

@@ -1,8 +1,10 @@
+import io


Reworked this file so the various concepts are properly compartmentalized. Next step would be to extract those components into separate classes, it's not strictly needed yet but will become if we keep adding stuff.

sohamparikh added 19 commits April 8, 2025 06:51

WIP: multimodal support

7709e65

rough idea for memmap

0db2bd2

faster image size reading

0d89f68

solidify prepare

3866a53

wip

8413983

vision model

6521e41

wip

daf586f

wip

ef4488d

missing files

6d9d595

make it work, barely

6cb8f5d

fix

5761a2d

fixes

d45d600

changes

74a99b8

patches and fixes

99ad5d9

fix dependency

bcb557a

remove for testing

a6f5364

mising

73b431b

fix

6d65676

Merge branch 'main' into soham/pixtral-support

46aefc1

tscholak mentioned this pull request May 9, 2025

StarDoc model training #5

Closed

sohamparikh added 10 commits May 9, 2025 18:39

fixes

66e7081

fix

7f86a7f

more fixes after merge

3a8a99d

conv cleanup

d16284e

more conv cleanup

b3134aa

images + loss-masks

c8aa66e

minor fixes

0baae59

cleanup

48855be

cleanup

f35e003

cleanup

4eb34cb

RaymondLi0 and others added 11 commits June 20, 2025 15:14

pixtral fix conversion (#315)

ad18ea1

handle no image samples

29e66d9

mask special image tokens

06a0910

avoid multiple labels cloning

bbd71df

merge main

bdc138c

fix training

96a5fd8

fix prepare config

8f93a27

fix imports

c3eda1c

fix tests

1cf0ea0

fix tests

77d294c

cleanup

8434b20

sohamparikh marked this pull request as ready for review July 9, 2025 18:38

sohamparikh requested review from tscholak and jlamypoirier July 9, 2025 18:38

tscholak mentioned this pull request Jul 11, 2025

WIP: Multimodal Audio #272

Draft

25 tasks

tscholak reviewed Jul 11, 2025

View reviewed changes

jlamypoirier reviewed Jul 16, 2025

View reviewed changes

tscholak changed the title ~~WIP: multimodal support~~ Multimodal (vision) support Jul 17, 2025

tscholak and others added 3 commits July 17, 2025 10:27

resolve merge conflicts

c03faa5

fix torchvision import

ef982c9

Merge branch 'main' into soham/pixtral-support

956a8dd

RaymondLi0 reviewed Jul 23, 2025

View reviewed changes

sohamparikh and others added 3 commits July 29, 2025 18:55

cosmetic changes

ca68072

fixes

55a3706

Dataset

273cc55

jlamypoirier reviewed Aug 15, 2025

View reviewed changes

		@@ -71,6 +71,17 @@ class DiffusionLlamaGPTHuggingfaceCheckpointFormat(GPTHuggingfaceCheckpointForma
		trust_remote_code: typing.ClassVar[bool] = True


		class LlavaGPTHuggingfaceCheckpointFormat(GPTHuggingfaceCheckpointFormat):

Multimodal (vision) support #227

Are you sure you want to change the base?

Multimodal (vision) support #227

Uh oh!

Conversation

sohamparikh commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sohamparikh commented Apr 8, 2025 •

edited

Loading

jlamypoirier Jul 29, 2025 •

edited

Loading

jlamypoirier Aug 15, 2025 •

edited

Loading