Skip to content

Add support for Prithvi geospatial model in serving mode #20307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mgazz
Copy link

@mgazz mgazz commented Jul 1, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Sample implementation enabling Tensor Inputs to support the Prithvi Geospatial Model in serving mode.

Opened at the support of the RFC #20234

Test Plan

See section Try it out in the RFC

Test Result

(Optional) Documentation Update

christian-pinto and others added 14 commits June 25, 2025 11:38
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
- Improved formatting around
- made is_pooling_model a @Property in ModelConfig

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
- Add Tensor modality
- Consolidate the processing of Image Embeddings and Tensors given that both use base64 encoded tensors

Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
…l modalities

Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
@mergify mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) structured-output labels Jul 1, 2025
Copy link

mergify bot commented Jul 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgazz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mgazz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the Prithvi geospatial model in serving mode, specifically enabling it to process raw tensor inputs. This involves extending the multi-modal input handling capabilities, introducing a new 'tensors' modality, and adapting the VLLM core engine to support models that do not require a tokenizer or KV cache, such as attention-free models.

Highlights

  • Prithvi Model Integration: I've added comprehensive support for the Prithvi geospatial model in serving mode. This includes configuring it to use float16 dtype and enforce_eager=True for optimal performance, and introducing a new test case to validate its multi-modal input handling in embedding mode.
  • Raw Tensor Input Support: I've extended the multi-modal input capabilities to include a new tensors modality. This allows arbitrary tensor data, encoded as base64 strings or dictionaries of strings, to be passed directly in chat completion requests, generalizing the existing image embedding mechanism.
  • Tokenizer-less Model Support: The VLLM engine now gracefully handles models that do not require a tokenizer. This involves skipping tokenizer initialization when skip_tokenizer_init is enabled and expecting prompt_token_ids to be provided in the request's additional_data.
  • Attention-Free Model Optimization: I've introduced a DummyKVCacheManager and adapted the scheduler and core engine to bypass KV cache allocation and batch reordering for models identified as 'attention-free'. This optimizes resource usage for models like Prithvi that do not rely on a KV cache.
  • API and Internal Enhancements: I've enhanced vllm.config with new properties like model_supports_multimodal_raw_input and is_pooling_model. Additionally, multi-modal input processing has been refactored to be more generic, allowing for better handling and passing of diverse multi-modal data to the model's forward pass.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Prithvi geospatial model, which involves significant changes to handle raw tensor inputs and models without a tokenizer. The approach of introducing new interfaces like SupportsMultiModalWithRawInput and a DummyKVCacheManager is well-designed for modularity.

I've found two critical bugs in the new tensor parsing logic in vllm/entrypoints/chat_utils.py that could lead to runtime errors. These should be fixed before merging. The rest of the changes look solid and consistent with the goal of the PR.

Comment on lines 763 to 777
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors = {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
placeholder = self._tracker.add("image_embeds", embeds)
placeholder = self._tracker.add(modality_str, tensors)

if isinstance(image_embeds, str):
embedding = self._connector.fetch_image_embedding(image_embeds)
placeholder = self._tracker.add("image_embeds", embedding)
if isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add("image_embeds", tensor)

self._add_placeholder(placeholder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The parse_tensors function can raise an UnboundLocalError if tensor_encodings is neither a dict nor a str. Using an if/elif/else structure would be safer, and also ensure that the correct modality_str is used when adding the placeholder.

Suggested change
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors = {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
placeholder = self._tracker.add("image_embeds", embeds)
placeholder = self._tracker.add(modality_str, tensors)
if isinstance(image_embeds, str):
embedding = self._connector.fetch_image_embedding(image_embeds)
placeholder = self._tracker.add("image_embeds", embedding)
if isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add("image_embeds", tensor)
self._add_placeholder(placeholder)
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors = {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
placeholder = self._tracker.add(modality_str, tensors)
elif isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add(modality_str, tensor)
else:
raise TypeError(f"Unsupported type for tensor_encodings: {type(tensor_encodings)}")
self._add_placeholder(placeholder)

Comment on lines 819 to +833

if isinstance(image_embeds, dict):
embeds = {
k: self._connector.fetch_image_embedding(v)
for k, v in image_embeds.items()
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors= {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
future.set_result(embeds)
future.set_result(tensors)

if isinstance(image_embeds, str):
embedding = self._connector.\
fetch_image_embedding(image_embeds)
future.set_result(embedding)
if isinstance(tensors, str):
tensor= self._connector.\
fetch_tensor_encoding(tensor_encodings)
future.set_result(tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This asynchronous version of parse_tensors has a critical bug. On line 830, isinstance(tensors, str) will raise a NameError if tensor_encodings is a string, because tensors is only defined in the preceding if block. This should be isinstance(tensor_encodings, str).

Additionally, similar to the synchronous version, using if/elif/else would make the logic more robust.

Suggested change
if isinstance(image_embeds, dict):
embeds = {
k: self._connector.fetch_image_embedding(v)
for k, v in image_embeds.items()
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors= {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
future.set_result(embeds)
future.set_result(tensors)
if isinstance(image_embeds, str):
embedding = self._connector.\
fetch_image_embedding(image_embeds)
future.set_result(embedding)
if isinstance(tensors, str):
tensor= self._connector.\
fetch_tensor_encoding(tensor_encodings)
future.set_result(tensor)
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors= {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
future.set_result(tensors)
elif isinstance(tensor_encodings, str):
tensor= self._connector.\
fetch_tensor_encoding(tensor_encodings)
future.set_result(tensor)
else:
raise TypeError(f"Unsupported type for tensor_encodings: {type(tensor_encodings)}")

Copy link

github-actions bot commented Jul 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

placeholder = self._tracker.add("image_embeds", embedding)
if isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add("image_embeds", tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean modality_str here instead of "image_embeds"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll fix this with the next commit.

@@ -43,6 +44,9 @@ def _get_data(
pt_float32 = output.data.to(dtype=torch.float32)
pooling_bytes = np.array(pt_float32, dtype="float32").tobytes()
return base64.b64encode(pooling_bytes).decode("utf-8")
elif encoding_format == "tensor":
tensor_encoding_io = ImageEmbeddingMediaIO()
return tensor_encoding_io.encode_base64(output.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. The pooler output is a torch.Tensor but ImageEmbeddingMediaIO.encode_base64 expects an image. How does this work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highlighted this in the RFC as part of Additional features explored to enable tensors:.

ImageMediaIO expects an image. ImageEmbeddingMediaIO expects Tensors. I thought about reusing ImageEmbeddingMediaIO to encode tensors output but I had a hard time reconstructing the tensor on the user side when encoded via encode_base64. that is why I added encode_tensor

If this makes it too confusing we can have a dedicated class like TensorIO that performs the encoding so that we can keep separation of concerns.

Copy link
Author

@mgazz mgazz Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I run a couple of tests using ImageEmbeddingMediaIO and the behaviour is strange.

Here an example

from vllm.multimodal.image import ImageEmbeddingMediaIO
import torch

pixel_values = torch.full((6, 512, 512), 1.0,dtype=torch.float16)
image_embeds_media_io = ImageEmbeddingMediaIO()
encoded = image_embeds_media_io.encode_base64(pixel_values)
decoded = image_embeds_media_io.load_base64("",encoded)

Here the error:

(myenv) mgazz@mgazz-vllm-devpod-6c47989df9-hstsz:~/vllm$ python test.py 
Traceback (most recent call last):
  File "/workspace/vllm/test.py", line 7, in <module>
    decoded = image_embeds_media_io.load_base64("",encoded)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/multimodal/image.py", line 91, in load_base64
    return self.load_bytes(base64.b64decode(data))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/multimodal/image.py", line 88, in load_bytes
    return torch.load(buffer, weights_only=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/myenv/lib/python3.12/site-packages/torch/serialization.py", line 1548, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 0

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

If we encoded via encode_tensor that is part of this PR it works fine.

from vllm.multimodal.image import ImageEmbeddingMediaIO
import torch

pixel_values = torch.full((6, 512, 512), 1.0,dtype=torch.float16)
image_embeds_media_io = ImageEmbeddingMediaIO()
encoded = image_embeds_media_io.encode_tensor(pixel_values)
decoded = image_embeds_media_io.load_base64("",encoded)
print(type(decoded))

Looks like a bug in the current implementation. Maybe the solution is updating encode_base64 with the implementation used in encode_tensor

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see Gemini is suggesting against using tensor.save ... #20307 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a separate issue on this: #20427

…rompt.

Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
@mgazz
Copy link
Author

mgazz commented Jul 2, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Prithvi geospatial model, which is a significant step towards handling models with raw tensor inputs. The changes are extensive and well-structured, particularly the introduction of the SupportsMultiModalWithRawInput interface and the DummyKVCacheManager for attention-free models.

My review focuses on the new data paths for tensor inputs. I've found a couple of critical issues in vllm/entrypoints/chat_utils.py related to reference-before-assignment that could lead to runtime errors. I've also pointed out a potential issue with the serialization method in vllm/multimodal/image.py and some minor formatting issues.

Overall, this is a great contribution. Once the identified issues are addressed, the PR should be in good shape.

Comment on lines +760 to 778
def parse_tensors(self,
tensor_encodings: Union[str, dict[str, str]],
modality_str: ModalityStr) -> None:
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors = {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
placeholder = self._tracker.add("image_embeds", embeds)
placeholder = self._tracker.add(modality_str, tensors)

if isinstance(image_embeds, str):
embedding = self._connector.fetch_image_embedding(image_embeds)
placeholder = self._tracker.add("image_embeds", embedding)
if isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add(modality_str, tensor)

self._add_placeholder(placeholder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential UnboundLocalError in parse_tensors. If tensor_encodings is neither a dict nor a str, placeholder will not be assigned before being used in self._add_placeholder(placeholder).

To fix this, add an else clause that raises a TypeError for unsupported types and ensure placeholder is always assigned.

Suggested change
def parse_tensors(self,
tensor_encodings: Union[str, dict[str, str]],
modality_str: ModalityStr) -> None:
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors = {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
placeholder = self._tracker.add("image_embeds", embeds)
placeholder = self._tracker.add(modality_str, tensors)
if isinstance(image_embeds, str):
embedding = self._connector.fetch_image_embedding(image_embeds)
placeholder = self._tracker.add("image_embeds", embedding)
if isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add(modality_str, tensor)
self._add_placeholder(placeholder)
def parse_tensors(self,
tensor_encodings: Union[str, dict[str, str]],
modality_str: ModalityStr) -> None:
if modality_str not in ["image_embeds", "tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors = {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
placeholder = self._tracker.add(modality_str, tensors)
elif isinstance(tensor_encodings, str):
tensor= self._connector.fetch_tensor_encoding(tensor_encodings)
placeholder = self._tracker.add(modality_str, tensor)
else:
raise TypeError(f"Unsupported type for tensor_encodings: {type(tensor_encodings)}")
self._add_placeholder(placeholder)

Comment on lines +815 to 836
def parse_tensors(self,
tensor_encodings: Union[str, dict[str, str]],
modality_str: ModalityStr) -> None:
future: asyncio.Future[Union[str, dict[str, str]]] = asyncio.Future()

if isinstance(image_embeds, dict):
embeds = {
k: self._connector.fetch_image_embedding(v)
for k, v in image_embeds.items()
if modality_str not in ["image_embeds","tensors"]:
raise Exception("tensors are acceptable only as part "
"of 'image_embeds' or 'tensors' modalities.")
if isinstance(tensor_encodings, dict):
tensors= {
k: self._connector.fetch_tensor_encoding(v)
for k, v in tensor_encodings.items()
}
future.set_result(embeds)
future.set_result(tensors)

if isinstance(image_embeds, str):
embedding = self._connector.\
fetch_image_embedding(image_embeds)
future.set_result(embedding)
if isinstance(tensors, str):
tensor= self._connector.\
fetch_tensor_encoding(tensor_encodings)
future.set_result(tensor)

placeholder = self._tracker.add("image_embeds", future)
placeholder = self._tracker.add(modality_str, future)
self._add_placeholder(placeholder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In parse_tensors, there's a potential UnboundLocalError. The tensors variable is only assigned within the if isinstance(tensor_encodings, dict): block. If tensor_encodings is a string, tensors will not be defined, leading to an error when used in isinstance(tensors, str).

To fix this, use tensor_encodings instead of tensors in the isinstance check.

        if isinstance(tensor_encodings, dict):
            tensors= {
                k: self._connector.fetch_tensor_encoding(v)
                for k, v in tensor_encodings.items()
            }
            future.set_result(tensors)

        if isinstance(tensor_encodings, str):
            tensor= self._connector.\
                fetch_tensor_encoding(tensor_encodings)
            future.set_result(tensor)

Comment on lines +99 to +106
# currently not used but it makes it easy
# for users to reconstruct the result tensor without knowledge of the array shape
def encode_tensor(self, media: torch.Tensor) -> str:
buffer_tiff = BytesIO()
torch.save(media.data, buffer_tiff)
buffer_tiff.seek(0)
binary_data = buffer_tiff.read()
return base64.b64encode(binary_data).decode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The encode_tensor function uses torch.save for serialization, which relies on Python's pickle module. This is generally discouraged for data interchange due to security risks and potential versioning issues. Consider using a more robust serialization method like saving the tensor's raw data, shape, and dtype separately using numpy.save or writing its raw bytes.

@mergify mergify bot added the new-model Requests to new models label Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models structured-output v1
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants