
Show HN: Beating Hinton et al.'s capsule net with fewer params and less training - fheinsen
Hello HN,<p>I recently posted a <i>work-in-progress</i> paper, along with code necessary for replicating all its results, at:<p><pre><code>  https:&#x2F;&#x2F;github.com&#x2F;glassroom&#x2F;heinsen_routing
</code></pre>
Among other things, the code in this repo outperforms Hinton et al.&#x27;s recent state-of-the-art result in visual recognition[0] while requiring <i>fewer parameters</i> and <i>an order-of-magnitude fewer training epochs</i>.<p>Most of the original research we do at work tends to be either proprietary in nature or tightly coupled to internal code, so we cannot share it with the world. In this case, however, I was able to remove all traces of internal code and release this as stand-alone open-source software without having to disclose any key IP.<p>I&#x27;ve reached out to academics in different groups for feedback, and the response so far has been positive, although most have only skimmed the paper. It will likely take a few weeks to get proper feedback from academia.<p>In the meantime, I figured there are a lot of super-smart, knowledgeable people on HN who would love to take a look at this and share their thoughts. Please feel free to ask questions. Let me know what you think!<p>[0] https:&#x2F;&#x2F;ai.google&#x2F;research&#x2F;pubs&#x2F;pub46653
======
p1esk
Hey, congrats on publishing!

1\. Could you briefly summarize your algorithm (novelty, how it’s better, why
it’s better, etc)?

2\. Since the original paper there have been dozens of published attempts to
improve upon it. Do you compare your results to the latest in capsules
research?

3\. I personally would like to see Imagenet results. Norb a toy dataset. If
you beat EfficientNet in terms of both accuracy and number of params/flops,
many people will be impressed (including Hinton). Or match the performance of
a good convnet using 1/10 training data.

Don’t take this the wrong way, but two years after the original paper Norb
results, no matter how good, are underwhelming.

~~~
fheinsen
Great questions. Happy to answer them here.

First of all, this work builds on Hinton et al.’s second paper, the one about
_EM routing of matrix capsules_ , from last year:
[https://ai.google/research/pubs/pub46653](https://ai.google/research/pubs/pub46653)
This work is only minimally related to his previous paper (Sabour et al.'s
paper) from two years ago!

RESPONSES TO #1:

* The same algorithm also achieves SOTA in another domain, natural language. _Same code._ I think it’s significant that the same code, without change, produces SOTA in two domains. See the README and tables 3 and 4 in the draft paper. Don't you think this is significant?

* It requires fewer parameters: 272K instead of 310K for Hinton et al. (2018)’s model and 2.7M for the best performing CNN on record (Cireşan et al.); see table 2. That’s 10x fewer parameters than the best performing CNN on record.

* It requires an order of magnitude less training: 50 epochs instead of 300 for Hinton et al. (2018)'s model.

* It’s trained with minimal data augmentation, unlike Hinton et al.’s and Cireşan et al.’s models (the latter, in particular, uses a _ton_ of data augmentation). Also, unlike Hinton’s model, it accepts full-size images instead of 32x32 crops that are 9 times smaller. Finally, we do not measure accuracy as a mean of multiple crops. So, the model has fewer parameters, requires less training, and has greater capacity.

* It seems to be learning a form of "reverse graphics" on its own, from only pixels and labels, without having to optimize explicitly for it. See the README, figure 4, and the 24 plots and captions in supplemental figures 6 and 7. This is rather significant, don't you think?

RESPONSES TO #2:

* As far as I know, the best attempt at recreating Hinton et al.’s work on EM routing is by Ashley Gritzman at IBM, in July of this year -- only a bit over two months ago. As far as I can tell, his model does not come close to matching Hinton’s performance:

[https://arxiv.org/abs/1907.00652](https://arxiv.org/abs/1907.00652)

[https://github.com/IBM/matrix-capsules-with-em-
routing](https://github.com/IBM/matrix-capsules-with-em-routing)

[https://medium.com/@ashleygritzman/available-now-open-
source...](https://medium.com/@ashleygritzman/available-now-open-source-
implementation-of-hintons-matrix-capsules-with-em-routing-e5601825ee2a)

* There have been a few other efforts, all of which seem to fall short of Hinton's performance. Gritzman does a good job of covering those other efforts in his Medium article. None of these efforts propose any new ideas, as far as I can tell.

RESPONSES TO #3:

* Me too. So does Hinton: [https://openreview.net/forum?id=HJWLfGWRb](https://openreview.net/forum?id=HJWLfGWRb) ... and so does everyone else.

* Alas, as Paul Barham and Michal Isard at Google Brain showed earlier this year, currently it can be challenging to scale capsule networks to large datasets and output spaces, in some circumstances, due in part to current software (e.g., PyTorch, TensorFlow) and hardware (e.g., GPUs, TPUs) systems, which are highly optimized for a fairly small set of computational kernels, in a way that is tightly coupled with memory hardware, leading to poor performance on non-standard workloads, including basic operations on capsules. Source: Barham and Isard (2019) - [https://dl.acm.org/citation.cfm?id=3321441](https://dl.acm.org/citation.cfm?id=3321441) (the PDF is available for free download at that link).

* My draft paper mentions Barham and Isard’s work.

~~~
p1esk
1\. Both Hinton’s capsules papers have been released at the same time (Oct
2017). You can see the first comment on OpenReview page for the EM paper is
dated Nov 2017. From what I remember, the two papers appear very similar with
the main difference in how the routing is implemented.

2\. You cite a convnet result from 2011 (!). Don’t you think a modern convnet
would do vastly better on this task?

3\. Could input size play a role? Did you try feeding 96x96 inputs to the
models you’re comparing against, to see if they also benefit from it?

4\. I’m a bit confused as to why other implementations failed to reproduce
Hinton’s results given that he open sourced their code (link in the first
OpenReview comment).

5\. Ok, Imagenet is too slow, how about Cifar-10? What would it take to reach,
say, 95%? That would be equivalent to a well trained Resnet-18. If you can
show such result, I personally would become more interested, because I worked
quite a bit with Cifar-10, but not with Norb.

I think you might be onto something, but it’s still not clear that capsules
approach is scalable and ultimately superior to plain convnets.

~~~
fheinsen
I’m surprised you did not comment on the fact that my version of EM routing
also achieves SOTA on another domain, natural language. _Same code._

Here are the answers to your questions:

1\. The final, published version is stamped “ICLR 2018,” so I used that year.

2\. I don’t know if a conventional CNN can do this with 10x fewer parameters,
while also learning to do a form of “reverse graphics” without explicitly
optimizing for it. (I wouldn’t know how to get a CNN to do that without
explicitly making it a training objective.)

3\. IIRC, the convnet model from 2011 accepts 96x96 images. As to why Hinton
et al. downsample images to 9x smaller, I suspect (but don’t know for sure)
they had no choice to conserve memory and computation using their version of
EM routing. I was able to reduce memory and computation with my variant of EM
routing (by between one and two orders of magnitude) by setting the first
routing layer to accept a variable number of inputs, without regard to
location in image.

4\. Me too. But you asked me about work other than Hinton’s, and that’s all I
could find!

5\. CIFAR10 is on the to-do list (work permitting!) :-)

~~~
p1esk
How does a regular convnet do on another domain?

Learning to do “reverse graphics” is only useful if you can show it is the
reason behind performance improvement, compared to a plain convnet. Until we
have cifar-10 results it’s not clear.

What I’m saying is - no one has yet demonstrated a clear superiority of any
capsules based model to the best available plain convnet. Even on cifar-10.
Looking forward to your results!

~~~
fheinsen
> How does a regular convnet do on another domain?

As far as I know, regular convnets have failed to outperform query-key-value
self-attention models (i.e., transformers based on Vaswani et al.'s work) on
pretty much every sequence task, including natural language tasks.

> Learning to do “reverse graphics” is only useful if you can show it is the
> reason behind performance improvement.

I would strongly disagree. Building systems that can learn "reverse graphics"
on their own has long been a goal of computer vision. It seems a prerequisite
for building machines that can build internal representations of the state of
the physical world around them. Hinton et al.'s 2018 paper has a summary of
recent efforts on this front on the "Related Work" section.

> What I’m saying is - no one has yet demonstrated a clear superiority of any
> capsules based model to the best available plain convnet.

No one is saying otherwise. :-) Convnets are still the right tool for most
production systems in visual recognition today.

That said, I don't think a convnet can achieve 99.1% accuracy on smallNORB
with only 272K parameters, after training from scratch without using any
additional data or metadata of any kind -- like the model using my routing
algorithm. If you think you can do that with a convnet, do it and put it up
online (I'd love to see it :-)

~~~
p1esk
You’re comparing sentence classification done using transformer embeddings to
older results which use inferior embeddings. How do regular convnets do when
you feed them transformer embeddings?

Re learning reverse graphics - ok, maybe it is indeed the main feature of your
work. I’d need to look into that, because from skimming your paper it’s not
immediately clear what’s going on there.

Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10 as
soon as you have the results.

~~~
fheinsen
> You’re comparing sentence classification done using transformer embeddings
> to older results which use inferior embeddings. How do regular convnets do
> when you feed them transformer embeddings?

Actually, I'm comparing it to recent models, including XLNet, MT-DNN, Snorkel,
and (of course) BERT. AFAIK, convnets have not been able to outperform
multihead self-attention, even on pretrained embeddings.

> Re learning reverse graphics - ok, maybe it is indeed the main feature of
> your work. I’d need to look into that, because from skimming your paper it’s
> not immediately clear what’s going on there.

I agree, it's not immediately clear. Nonetheless, I find it kind of
unbelievable that a model with so few parameters can seem to do it. (I was
shocked when I first saw the plots.)

> Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10
> as soon as you have the results.

That's a little disappointing... but OK.

Thank you so much for all your questions :-)

~~~
p1esk
Ah, I missed table 4 with the recent models. I looked closer and it does look
impressive, however you should ask someone who worked on that task to review
your experiments (I haven’t).

Actually, it looks like you got a solid paper. I recommend submitting either
to CVPR or ICML, especially if you can get good results on cifar.

~~~
fheinsen
Thank you!

Yes, I think this has legs.

Maximizing "bang per bit" (a) seems _truly a new idea_ , as opposed to some
minor tweak on the same old thing, and (b) the evidence so far shows it _works
better than previous methods_.

(FWIW, we've been using this algorithm internally at work with similar
outperformance over other methods, in yet another domain that is neither
vision nor language... but I cannot share those results publicly.)

Before submitting this anywhere, I'd like to get more informal feedback from
other AI researchers. I've reached out to people at Google Brain, Facebook AI,
DeepMind, OpenAI, and a handful of top academic institutions and research
groups. So far, the response has been positive, but I expect it will take
everyone at least a couple of weeks, and probably longer, to read and
understand the draft paper in sufficient detail to give me more than
superficial comments.

New things often look like toys at first. :-)

~~~
p1esk
Keep in mind that someone might still your ideas. Right now there are probably
a dozen people preparing capsules related papers for CVPR (due in 2 weeks) so
if one of them comes across your paper there’s a temptation.

~~~
p1esk
*steal

~~~
fheinsen
Thank you for saying that. Sometimes I forget how _petty_ and _small_ people
can be, especially when they are under pressure, academic and otherwise.

I'll take a look at submitting it to CVPR.

In the meantime, please circulate my work. It's on record, online. The more
people who are aware that others have seen it, the less likely someone will
try to plagiarize it.

I'm not under any kind of academic pressure, so I don't need citations,
conference slots, etc. But I do deserve credit for this, don't you think?

PS. And now that you mention it, a couple of people to whom I reached out
mentioned they were under deadline over the next two weeks.

PPS. Send me an email!

~~~
fheinsen
FYI, I reached out to two of those individuals (one is a CVPR reviewer, it
turns out) and both suggested I first upload this to arXiv, so I did that
yesterday. The paper is now stamped with a date, on the queue for site-wide
notification. Thank you again for your feedback!

~~~
p1esk
Yes, that's a good move.

Let me know when you have CIFAR-10 results, I will try to match your accuracy
using the same number of parameters in a regular convnet. I actually
implemented the original, vector based capsnet a while ago:
[https://github.com/michaelklachko/CapsNet](https://github.com/michaelklachko/CapsNet)
but I haven't really explored it. Your success on CIFAR-10 would definitely
provide motivation for me to do so.

~~~
fheinsen
Thanks. Will do (work permitting!).

FWIW, a while back I reimplemented and tinkered a bit with the Sabour et al.
version too... and did not see much promise in it.

Note that the routing algorithm I've proposed generalizes to vectors (by
setting the dimension of the covector space _d_cov_ to 1).

