-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hello, first of all congratulations for this work, keep it up!
Reading the Relational Proxies paper, i spotted some reasonable argument (page 6, paragraph 1, last sentence in Arxiv version) which states "Unlike the usual vision transformer, we omit the usage of positional embedding".
It completely makes sense, and I agree with that. But if I correctly understood, this contradicts with src.networks.ast.AST
, on which a learnable positional encoding is used. Is it still valid as no inductive bias is done on this encoding? Doesn't it change the final embedding if we shuffle the crops? It can make sense as well, I think both approaches are valid, but I want to double-check I'm tackling the problem properly.
(edit): Line 29 from src.models.relational_proxies.py
, the proxy criterion is defined after the optimizer. In the official documentation it specifies that the parameters of this loss (proxies) are optimized among the other parameters. I suspect that it implies that this code optimizes with fixed proxies, did I miss something?
Thanks!