Question about the action Token and image augmentation

action_token_out = transformer_out[:, :, 0, :].

Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder? 

In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value:  num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.

Could you help me figure out why and how? Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about the action Token and image augmentation #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the action Token and image augmentation #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions