I wish this was written with more care. None of the symbols are defined.
Worst of all, they use "channel dimension" in a sequence model. What even is a channel in a sequence of tokens?
This happens as soon as you have a single person with CNN background on the team and it makes zero sense.
What if you actually have channels in your data? What then?
If you have more specific feedback, like a specific digram or page, and how it can be made better. I will gladly forward that info, to improve the paper draft.
Because channel mixing, is a core component of this architecture, and that keyword "channel"is all over the place. I have no idea what is it you are critiquing specifically (i could not find the mention of "channel dimension" in the paper)
Worst of all, they use "channel dimension" in a sequence model. What even is a channel in a sequence of tokens? This happens as soon as you have a single person with CNN background on the team and it makes zero sense. What if you actually have channels in your data? What then?