-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
action_token_out = transformer_out[:, :, 0, :].
Hello, i don't know why directly take the first dimension of the output as the action_token_out. After your grouping, the grouped input should follow this order: spatial_context_feature + region_feature + action_token + other obs feature. Would the dimension be changed when they pass through the transformer_decoder?
In addition, about the image augmentation (padding + random_crop), how many crops did you take? I saw around the code, only take the default value: num_crops=1. Doesn't the global feature really get lost if there is only one? Because i saw your code, the feature map is extracted from the cropped image.
Could you help me figure out why and how? Thanks a lot
Metadata
Metadata
Assignees
Labels
No labels