"A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP"
I can't understand the technical jargon, but my interpretation is that MLP turns out to not be as good as CNN / Transformer. Maybe someone with more expertise can weigh in!
TLDR - we've come full circle.
In a normal CNN network, the kernel has a spatial width and height and a certain number of input and output channels. Their MLP blocks are effectively a convolution where things are mixed up spatially (but channels remain the same) followed by a convolution where channels are mixed up (but only within one spatial block).
The obvious advantage of this architecture is that the weights for remixing channels can be shared regardless of where in the image it occurs. So one would expect this to be more translation-invariant than a typical CNN architecture. But for other tasks - such as image reconstruction - that loss of spatial information might actually hurt performance.
To me, this looks like an approach that isn't really that new and isn't performing better, but mainly it's just different from what people usually use.
I think it's quite common to use quasi "MLP mixtures" in natural language processing (NLP), since your input layer is typically a word embedding, which itself is a shallow MLP. There are also architectures such as deep averaging networks (DAN) that are effectively MLP "all the way through", see https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf (NOTE: PDF document)
I don't do ML research and from the outside a lot of it looks like:
1. Mix up the computation in a differrent way.
2. Check the performance.
3. If it performs well, you publish a paper.
How do these people stay motivated?
That said, there are incremental papers that tweak some parameters, and then there are others like this one that take a big, risky step back from the rest of the community (I.e using convolutions for computer vision), and make a discovery on the common benchmark.
Speaking my own experience, I think what keeps researchers addicted is the random variable reward we get from seeing our new algorithm's performance on a common benchmark. Even better if the performance near state of the art, but not necessary.
That's copium my dude. There is no doubt great insightful work in ML but for the vast majority of publications OP's questions are valid. Most people use ML like a hammer and are instead trying to turn their problems into nails.
[1.] Weight Agnostic Neural Networks (interactive site with link to paper) - https://weightagnostic.github.io/
You can see it in this paper too - there isn't any motivating theory about how to come up with something like this; the entire paper is "we tried some things, here's what worked and what didn't". (This is just an observation, I'm not criticising the authors at all)
Maybe I'm being oversensitive about my field, but it comes across as dismissive of the field itself. Especially with "how do these people stay motivated?"
> there's a limited number of operations that go into a neural network and the justification for the best architectures is that they have the best performance.
There are a limited number of ways to put ink on a page, but the 'rest of the owl' meme is all about how there's a lot more to it than that.
And the justification of best performance is because it's an unsolved problem that is being incrementally solved. Over the last decade you can see tons of tasks go from 20% accuracy being an achievement to 24% to 30% to 31% to 32% to 38%... eventually to 90%. The motivation is that these little changes on a 'does it do the job' metric add up. It's very motivating to see things slowly move from impossible for computers to pip installable.
I'm not convinced. If you want to stick to the analogy, ML research looks like being able to draw a bunch of different animal parts and trying out different configurations of them; then once the drawing is done, you check whether you got close to an owl. That's what I meant when I said that there's little underlying theory to motivate the decisions that are being made (apart from the resulting performance of course).