Hacker News new | past | comments | ask | show | jobs | submit login
MLP-Mixer: An All-MLP Architecture for Vision (arxiv.org)
38 points by g42gregory 33 days ago | hide | past | favorite | 16 comments

I'm not an expert, but I became aware of this paper recently:

"A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP" https://arxiv.org/pdf/2108.13002.pdf

I can't understand the technical jargon, but my interpretation is that MLP turns out to not be as good as CNN / Transformer. Maybe someone with more expertise can weigh in!

MLPs discard more spatial information. That is great for "Do you see a cat?" but bad for "Where is the cat?". So which one is better depends on your use case.

MLP don't "discard" spatial information. Rather, spatial invariance is a very useful inductive bias for learning that MLPs don't have. If you had a perfect optimizer MLP would outperform CNNs (assuming hard enough tasks) but you don't have a perfect optimizer.

MLP is basically the vanilla neural network. The thing everyone sees first when getting taught about deep learning or similar. It turned out that (usually) MLPs are not very efficient to process matrix structures with it (e.g. 2d matrices like images). So, computer vision invented convolutional neural networks specifically to make image data highly efficient to process for neural networks. They outperformed MLPs in basically all aspects across the board. Transformers / Attention is a relatively new invention initially made to solve NLP problems more efficiently; but as it turns out, they work great on images as well.

TLDR - we've come full circle.

It appears that this is quite similar to Google's Depthwise Separable Convolution from 2019.

In a normal CNN network, the kernel has a spatial width and height and a certain number of input and output channels. Their MLP blocks are effectively a convolution where things are mixed up spatially (but channels remain the same) followed by a convolution where channels are mixed up (but only within one spatial block).

The obvious advantage of this architecture is that the weights for remixing channels can be shared regardless of where in the image it occurs. So one would expect this to be more translation-invariant than a typical CNN architecture. But for other tasks - such as image reconstruction - that loss of spatial information might actually hurt performance.

To me, this looks like an approach that isn't really that new and isn't performing better, but mainly it's just different from what people usually use.

In the conclusion the authors state: "It would be particularly interesting to see whether such a design works in NLP or other domains."

I think it's quite common to use quasi "MLP mixtures" in natural language processing (NLP), since your input layer is typically a word embedding, which itself is a shallow MLP. There are also architectures such as deep averaging networks (DAN) that are effectively MLP "all the way through", see https://people.cs.umass.edu/~miyyer/pubs/2015_acl_dan.pdf (NOTE: PDF document)

Kind of cool to see this result.

I don't do ML research and from the outside a lot of it looks like:

1. Mix up the computation in a differrent way.

2. Check the performance.

3. If it performs well, you publish a paper.

How do these people stay motivated?

The key is that step 1 isn't random: it's guided by past experience, and intuition. At the local level it might be random, but from a higher level, the search is guided.

That said, there are incremental papers that tweak some parameters, and then there are others like this one that take a big, risky step back from the rest of the community (I.e using convolutions for computer vision), and make a discovery on the common benchmark.

Speaking my own experience, I think what keeps researchers addicted is the random variable reward we get from seeing our new algorithm's performance on a common benchmark. Even better if the performance near state of the art, but not necessary.

> The key is that step 1 isn't random: it's guided by past experience, and intuition

That's copium my dude. There is no doubt great insightful work in ML but for the vast majority of publications OP's questions are valid. Most people use ML like a hammer and are instead trying to turn their problems into nails.

In this case I'd argue this is more than just the usual iterative fodder. The architecture itself is not the interesting part, it's the evidence that many decent architectures exist and performant networks don't have to be analogous to human visual systems (CNNs). How do they stay motivated? 1) the pursuit of knowledge 2) publishing as a way to gain prestige and climb up the career ladder

The focus on architecture is interesting. Another paper Weight Agnostic Neural Networks [1.] explores neural network architecture search and focuses on how influential structure alone is with some success.

[1.] Weight Agnostic Neural Networks (interactive site with link to paper) - https://weightagnostic.github.io/

Yeah, the motivation is pretty easy if you consider that much of software engineering is the same process (e.g., performance tuning), and on top of that you get prestige and do some math on the side. It’s basically a research version of test-driven development.

Step 1 is kind of ‘rest of the owl.’ It’s about as vague as ‘write some code.’

What I meant more specifically is that there's a limited number of operations that go into a neural network and the justification for the best architectures is that they have the best performance.

You can see it in this paper too - there isn't any motivating theory about how to come up with something like this; the entire paper is "we tried some things, here's what worked and what didn't". (This is just an observation, I'm not criticising the authors at all)

> I'm not criticising the authors at all

Maybe I'm being oversensitive about my field, but it comes across as dismissive of the field itself. Especially with "how do these people stay motivated?"

> there's a limited number of operations that go into a neural network and the justification for the best architectures is that they have the best performance.

There are a limited number of ways to put ink on a page, but the 'rest of the owl' meme is all about how there's a lot more to it than that.

And the justification of best performance is because it's an unsolved problem that is being incrementally solved. Over the last decade you can see tons of tasks go from 20% accuracy being an achievement to 24% to 30% to 31% to 32% to 38%... eventually to 90%. The motivation is that these little changes on a 'does it do the job' metric add up. It's very motivating to see things slowly move from impossible for computers to pip installable.

> There are a limited number of ways to put ink on a page, but the 'rest of the owl' meme is all about how there's a lot more to it than that.

I'm not convinced. If you want to stick to the analogy, ML research looks like being able to draw a bunch of different animal parts and trying out different configurations of them; then once the drawing is done, you check whether you got close to an owl. That's what I meant when I said that there's little underlying theory to motivate the decisions that are being made (apart from the resulting performance of course).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact