Skip to content

Interpolate positional embeddings for input images with larger sizes#416

Open
EeroHeikkinen wants to merge 2 commits into
mlfoundations:mainfrom
EeroHeikkinen:main
Open

Interpolate positional embeddings for input images with larger sizes#416
EeroHeikkinen wants to merge 2 commits into
mlfoundations:mainfrom
EeroHeikkinen:main

Conversation

@EeroHeikkinen
Copy link
Copy Markdown

This allows to use input images with resolutions larger than the trained resolution.

Simple example:

import torch
from PIL import Image
import open_clip

model, _, _ = open_clip.create_model_and_transforms('xlm-roberta-base-ViT-B-32', pretrained='laion5b_s13b_b90k')
preprocess = open_clip.image_transform(
        448,
        is_train=False,
        mean=None,
        std=None,
    )
tokenizer = open_clip.get_tokenizer('xlm-roberta-base-ViT-B-32')

image = preprocess(Image.open("file.jpg")).unsqueeze(0)
text = tokenizer(["a penguin", "vanilla ice cream", "a dog"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant