I built something similar in structure, if not in method, using a hierarchy of c...

I built something similar in structure, if not in method, using a hierarchy of cross attention and learned queries, made sparse by applying L1 to the attention matrices.

Discrete hierarchical representations are super cool. The pattern of activations across layers amounts to a “parse tree” for each input. You have effectively compressed the image into a short sequence of integers.