Skip to content

Question about the action Token and image augmentation #8

@Tschian

Description

@Tschian

action_token_out = transformer_out[:, :, 0, :].

Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?

In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.

Could you help me figure out why and how? Thanks a lot

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions