Skip to content

feedback on ch 4 #8

@murphyk

Description

@murphyk

p105. The discussion of good performance by AlexNet seems
at odds with the earlier discussion of neural collapse on p94.
After all, AlexNet still has a linear softmax layer at the end.
Maybe clarify that you mean AlexNet has good predictive performance
(by virtue of scaling model and data),
rather than claiming it learns better features.

p108. Eq 4.1.6. Should it be q*(z^l) rather than q*^l,
since it seems that z^l (not just l)
is needed to define the target labels for
ridge regression. Also can we not drop the superscript l for z^l
since the linear projection holds for any z?

p110. Projecting onto S^{d-1}. Remind readers how to do this.
How does this relate to layernorm?

p111. "the architecture of ReduNet with those of popular empirically
designed networks, ResNet and NesNeXt shown in Figure 4.3, the similarity
is somewhat uncanny. " Typo on NesNext.
Also, it's odd (from this POC) that ResNet does not have K different branches,
the way that ReduNet and ResNext do. Does ResNet learn worse features?

p112. "different clusters are orthogonal to each other."
It looks like the data get mapped to indivusal points,
one per class, rather than subspaces. So it is not clear
what orthogonality means. What are the d_k values?

p114. Last eqn. x should be z

p121. Remark 4.4. This is a very important point, and allows
the rate reduction method to be applied to unsupervised data.
This should be highlighted more
(e.g., mentioned earlier in the book), since it was not at all
clear how (to me) how you would side-step the dependence on class
labels...

p121. Eq 4.2.4. How does U[K] relate to f? Presumably it is
part of it? Or are you jointly optimizing over f and U?
(I think what you do is max_U max_f DeltaR(f(X)|U)),
where the inner max_f is computed by the manually designed
unrolled forwards pass, and the outer max_U is SGD. Right?)

p121. "Therefore, we would like to transform the
representations ... Therefore, to ensure the final representations
are amenable to more compact coding...."
These two consecutive sentences are almost identical (as are footnotes 14 and 15).

p122. "improve computational traceability "
Tractability.

p122. If multi-head attention is like your MSSA operator,
presumably we should be using values like K=10,000 or more,
since we are trying to approximate the data as locally piecewise linear,
and we probably need many pieces (of course it also depends
on the embedding dimensionality p_k). But IIUC, people normally use
things like K=32-128. Can you comment?
(Also how does it relate to "group query attention"?)

p123. why is it reasonable assumption to use first-order Neuman
series approximation to the inverse? It is because you apply
layer norm to Z (part of fig 4.13 but not mentioned in the text)?

p123. Eq 4.2.11. Unclear where softmax arises in 4.2.10.
Where is the exp and division function?

p123. "To solve λ∥Z∥_1 − R_ϵ(Z)" .
Justify why you ignore the second term.

p126. Fig 4.15. Should the green GD box be associated with (b) not (a)?

p126. "we show this could be the case".
Weak sauce! Maybe say "we show that this is the case" :)

p126. "We assume that the initial token
representations Z(1) are sampled from a
mixture of low-rank Gaussians perturbed by noise".
If the tokens in the initial input layer are already representable
by a mixture of low-rank Gaussians, why do we need the subsequent layers?
Is it expected that the sparsity level increases as you go up the layers,
or the relevant subspace dimensionality drops, because we find a more
parsimonious way of repressenting the same data?

p127. Eq 4.3.2. Explain Connection to eq 4.2.10.

p127. "let the columns of Z_k^l denotes the token representations
from the k-th subspace at the ℓ-th layer."
Is Z_k^l = Z^l U_k? If not, how is it computed?

p128. "This theorem provides a theoretical foundation for the practical
denoising capability of the transformer architecture
derived by unrolling (4.3.2)".
It's reasonable to think of the input tokens as noisy,
and subsequent layers denoising them, but this seems at odds
with the discussion of diffusion models, where we mapped
from Gaussian noise to the empirical token distribtuion,
and the latter was considered the gold standard of low-rank
representations that we are aiming to reproduce....
In other words, it seems that diffusion models treat
the input data (pixels) as already living on a low-dimensional
subspace (which they do), and we just want to map to that from
a high entropy source, whereas CRATE treats the input data (tokens)
as noisy samples (away from a linear manifold), and we want to remove the noise
from the input itself to map to something which is sparser
and on-manifold. Can you discuss?

p129. "Attention-Only Transformer".
How well does this work? Throwing away the MLP/sparse dictionary
layer drastically reduces the number of parameters, which I imagine
would hurt performance?

p129. Sec 4.3.2. It would be good to show some empirical results for TSSA,
to motivate slogging through all the (quite hairy!) math.

p130. Below Eqn 4.3.11. "TOST" not defined (except in caption of fig 4.18).

p131. "we have avoided discussion of
how tokens are “grouped” into various attention heads via the Π matrix"
This seems more than just an "implementation detail",
it seems like a crucial part of the algorithm specification.

p131. "We can thus explicitly
find a low-dimensional orthonormal basis for the image of this covariance,
i.e., the linear span of the data in group k."
Does this mean you don't just use backprop to learn
unrestricted U_k?

p132. "In the following," Maybe put the table (comparing DNNs with ReduNets)
inside a proper latex table, which can be cross referenced.
Also move this to sec 4.1, since it is unrelated to transformers.

p133. Eq 4.5.1. (Neuman) Sum should start with k=0.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions