Hacker News new | past | comments | ask | show | jobs | submit login
A New Twist on Neural Networks (ampproject.org)
348 points by sonabinu on Nov 2, 2017 | hide | past | favorite | 165 comments

Hinton explains the concept of capsules in this video: https://www.youtube.com/watch?v=rTawFwUvnLE

Which is a lot better than reading someone tell you about this new idea that is called "capsules" but doesn't go into detail. The only thing is that, when this presentation was given, it seems they hadn't worked much more than MNIST (so the new thing now would be the toys-recognition net).

Better source, with date: http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wr... (December 2014, for the lazy).

Thank you for the link to the video - this was excellent. His key critique of convnets is that (paraphrasing) all of these pooling layer (max, avg, whatever) are really a very poor way of attempting to solve what is actually a routing problem.

at a first glance this looks like it's going back to old 80s research on neural representations, back then this kind of stuff was known as Parallel Distributed Processing


Indeed. A blast from the past.

Oddly the capsule approach is how I naively thought image recognition worked until I learned more about it.

The only thing is that, when this presentation was given, it seems they hadn't worked much more than MNIST (so the new thing now would be the toys-recognition net).

The paper however details their experiments on Cifar-10 and other datasets in the discussion section.

They don't produce good enough results but the paper proposes certain hypotheses for the poorer performance and that it could be overcome in the future.

Arxiv-link: https://arxiv.org/abs/1710.09829

Apparently, one of the main changes since the talk, is that then he used a "probabilistic unit" next to his "property vector". Now the length of the property vector indicates the probability.

In their second publication they actually use the probabilistic unit, but computing those requires running an EM algorithm for each layer of capsules: https://openreview.net/forum?id=HJWLfGWRb

The author writes "Human children don’t need such explicit and extensive training to learn to recognize a household pet."

This claim seems dubious. Study's have shown humans can react to visual stimuli in as little as 1-3ms. If a child observes a cat in the room for only 10 seconds, that's already between 3,000 to 10,000 samples from various perspectives. While our human experience may describe this as a single viewing 'instance', our neurons are actually getting an extensive, continuous training. Is this accounted for in the literature?

Off-topic, but I'd be really interested in seeing a reference for such a study. A single action potential is usually only 1-3 ms (see https://en.wikipedia.org/wiki/Action_potential and references therein), and retina to LGN to V1 (the first two stops in the early visual pathway in mammals) takes seveeral tens of ms (off-hand cannot find a ref, sorry). If a human can react to a visual stim in anything less than that, it would seem to be because (i) they are anticipating the stim based on some prior cue, or (ii) there is some shortcut directly from the eye to motor neurons, and the rest of the brain isn't involved directly. In any case 1-3 ms still seems extremely fast.

Also, a child observing a cat continuous for 10 seconds is getting highly correlated samples, not new independent instances. The effective number of samples (which I quite agree would be >1 if the child got to examine the cat from different perspectives, or the cat moves around, etc etc) should still be lower than what a putative sampling rate would suggest.

The optic nerve is directly connected to the superior colliculus, which directly controls saccades and is sensitive to novel (bright) visual stimuli. If such a stimulus hits the the periphery just outside the fovea the time from stimulus onset to "fixation" should be minimized, without having to go through the "slow" ventral pathway/stream (V1 etc.). This time should still be more than 1-3 ms, but likely is below 100ms. No references, sorry.

Edit: First said LGN projects SC, but pathway is even shorter than that.

Edit2: The consensus on these fast, superior colliculus-guided "Express" saccade latency seems to be 80-120 ms. See: http://www.scholarpedia.org/article/Human_saccadic_eye_movem...

So in response to the grandparent, 1-3 ms is way too short, but it kinda of depends on the definition of "react" in this instance. The delay from a photon hitting the retina to triggering neurons in the superior colliculus to spike is likely pretty low (I would guess ~10 ms?), but when ocular motor movement becomes involved (ie stimulus onset -> fixation) things slow down considerably.

Cool, thanks!

Why is there so much expertise being shared without any refs on this topic? Are they hard to find online for some reason?

One of the issues is this type of neuroscience is becoming too "basic", so it's what gets taught in classes as the truth and professors don't necessarily give references anymore. Also I was typing on my mobile phone, but have switched to my laptop now. The scholarpedia (basically peer-reviewed wikipedia) article on saccades should be pretty interesting.

If it's very basic, the reference could be a text book.

I wouldn't trust a textbook, they usually leave out the nature of the evidence for whatever is being claimed (not even citing sources). You really have to get to the primary literature. For example, this sounds factual enough:

"The optic nerve is directly connected to the superior colliculus"

But I wonder how this "direct connection" was established. Was it done in humans or only mice/rats? Is it really always a direct connection or only in eg 80% of people, etc.


This is a good text, at an upper-div/grad level, of fundamental neuroscience with all sources cited.

That particular connection is straightforward to do in humans. A Golgi stain to the rector muscles/ON and dissection in cadavers would be sufficient to trace the reflex to the SC and then another Golgi stain to that area to get back to the optic nerve. I'm unfamiliar with the toxicity of Golgi stains, but it may be able to be done alive.

Also, the visual systems to the brain-stem are remarkably conserved through evolution. I would not be surprised to see this connection in lampreys. That any significant percent of humans lack it would be a hell of paper.

Blind individuals usually have these reflexes too (like Stevie Wonder): https://en.wikipedia.org/wiki/Blindsight

>"This is a good text, at an upper-div/grad level, of fundamental neuroscience with all sources cited."

I was able to check a bit and see no citations: "The human brain contains a huge number of these cells, on the order of 10^11 neurons, that can be classified into at least a thousand different types." https://neurology.mhmedical.com/content.aspx?bookid=1049&sec...

That 10^11 number is out of thin air. How was it determined? That is what a citation is for.

>"I'm unfamiliar with the toxicity of Golgi stains, but it may be able to be done alive."

No, the gogli stain is very toxic. It depends on a precipitate forming in "random" (no one knows why) cells. Also I see no reason it couldn't spread from cell to cell (via gap junctions, etc) so that method isn't too convincing.

>"Also, the visual systems to the brain-stem are remarkably conserved through evolution."

You can remove a rat's cerebrum and have it stay alive and keep doing stuff: "Cage climbing, resistance to gravity, suspension and muscle tone reactions, rhythmic vibrissae movements and examination of objects with snout and mandible were difficult to distinguish from controls." https://www.ncbi.nlm.nih.gov/pubmed/630411

Rodents are much more reliant on their brainstem than humans, I wouldn't be at all surprised that there are large differences. In fact, there's been a long debate about a similar claim regarding the cortico-spinal tract:

"Direct connections between corticospinal (CS) axons and motoneurons (MNs) appear to be present only in higher primates, where they are essential for discrete movement of the digits. Their presence in adult rodents was once claimed but is now questioned." https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4704511/

I don't know what to tell you then. My copy of Kandel is pretty robust on the citations, IMO. Like, yeah, they don't cite any papers on who discovered the brain, but like, you know you have one. Remember, bio is squishy, especially neuro. We just discovered that the immune system is actually in the brain too, like, 3 years ago.

If you really have a problem with Kandel, use email. Most authors of these types of book NEVER get any email about them and would be thrilled to have some interaction with a reader.

It's nothing specific to Kandel, I just find the standards of scholarship practiced by textbooks to be poor. Like I said, the book will be filled with claims like:

"The optic nerve is directly connected to the superior colliculus"

It seems very factual and set in stone but I bet if you read the primary literature there will be variation and doubt. If you read my last ref you will see they claim direct connections between CST and motorneurons in rats of some ages but not others. Perhaps this optic nerve claim was made based on using animals of a certain age, so it won't generalize. Who knows? That's why there should be a citation.

tl;dr Current textbook practices promote false certainty, and I don't think it is helpful for learning about a topic.

Thats what I love about mathematics. Its entirely proof based. And yet the good books strive to give intuition too and show how new ideas and structures can be used.

Ideally. Problem is, things move too quickly to get into textbooks or even review articles, especially on the experimental side of things. I'm not a neuroscientist myself, and have cobbled together what little I know from books (which I read but do not always trust), papers (which are not always easy for outsiders to interpret), and most importantly talking to neuroscientists, participating in / co-running journal clubs, etc.

Ofcourse, but textbooks are super-broad in scope. I don't think just referencing "Kandel, 5th edition" would be of much help and I don't have these books on hand to actually point you to the right page/section of the book.

I'm pretty sure that was the reasoning to build the ImageNet (about a million labeled images) in the first place. But labeling images is expensive, and there are hints there's more at play with human cognition.

If you see a black cat and a white cat, and someone tells you there are striped colored cats, you can imagine it. And if you were to come across it, you'd instantly recognize it as a cat. Neurals nets can't do that. You can also see a lynx and recognize it as "some kind of cat". Again, neural nets are not there yet. Which is why there are people researching to find new, better algorithms that better mimic what we recognize as intelligence.

Are you sure about your examples of things neural nets can't do? I think GANs might be able to "imagine" striped cats, provided they have been trained on enough images to capture the space of black/white/striped objects. And a lynx being classified as a cat doesn't seem so outlandish. It has to be classified as something and cats are likely the closest in appearance.

Of course these are just based on my intuition of what neural nets are capable of, so if you have examples of cases where these specific tasks were attempted unsuccessfully, I'm interested.

Let me remind you of sofas being classified as cats[0] and people being classified as gorillas[1]. You're overestimating the guesswork convnets are able to do, based on fragile training (which is still better guesswork than what previous models did).

[0] http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.ht...

[1] https://www.theverge.com/2015/7/1/8880363/google-apologizes-...

People being classified as gorillas was actually what I was thinking about regarding the lynx/cat example. The model might have been unsure about the kind of ape it was looking at, but clustering them together is its own kind of achievement.

The thing is that the convnets are unable to learn about "macro structures" (or structures in general). A cat has ~4 legs, a tail and pointy ears. Gorillas are black, have a primate-y face and fur. The sofa is lacking the tail, head and pointy ears. People were missing the fur. Yet those things did not prevent the net from missclassffication (because those features weren't detected in the learning phase).

Once again, children are able to see a cat and extract all that relevant information: four legs, head, tail, eyes & nose & ears with a particular shape, different than dogs, most cats fur (except for those alien-looking furless cats, of course).

If you ask a child to draw a hand, they will almost always draw it with five fingers stuck straight out, widely separated. This is a view of a hand that one almost never actually sees; generally you'll have fingers clustered together, occluding each other, foreshortened, etc. So why do they draw it like that?

It's because they're drawing a conceptual, represntational model of a hand, not a distilation of visual "hand" characteristics. That's the difference with human learning: it's based on representational model-making, which is not at all the same thing as pattern matching.

The representation is the distillation of pattern matching. They are isomorphic.

Is that a belief or a fact? I believe, but cannot prove, that symbolic representation is not isomorphic to pattern matching.

If you reverse the output of a CNN "hand" classification, it'll give you images that resemble the geometry and shading of fingers, palms, nails, knuckles, etc. -- these, I submit, are the distillation pattern matching for the actuality of "hands". Under no circumstances will it give you the five widely-separated fingers which a child draws. That's because the child-drawn hand is based on literal visual stimuli, but rather on an abstract logical model of a hand. That logical model is fully integrated with a similarly abstract model of the world, and includes functional relationships between abstractions, like the knowledge that "hands" can open "jars". The value of these being logical models rather than matched patterns is that they can then be extended to include never-before-seen objects. Confronted with a strange but roughly jar-sized object, a child can surmise that maybe it, too, can be opened with hands. That isn't pattern-matching: it's algebra.

If you look at network activations in higher levels of a convolutional neural network, you will find some units that are activated by the individual parts of the hand. (There is a video out there where they show that for parts of the face, I'll be looking for a link.)

Those activations are a representation of the matched patterns at a similar level of abstraction as the compositional model humans might come up with. They are not exactly the same as what a child would draw, but mostly because the neural network isn't trained to draw with hands. With a bit of work, I'm sure a bunch of PhDs could make a neural network model generate child-like drawings from realistic images.

Algebra is pattern matching of a set of operation rules with regard to a space. Your jar example just extends the domain to physicality. And I agree - until these sorts of learning mechanisms have a wide range of quality realisms to pattern match from - they will not be able to form the type of cross visual/physical knowledge that is a much deeper and abstract undestanding of reality. But don't fool yourself. Humans and well..life..are just input output machines with incredible pattern matching capabilities. Algebraic representations are the structural result of that pattern matching.

> But don't fool yourself. Humans and well..life..are just input output machines with incredible pattern matching capabilities.

See, that seems to me like a statement of faith which I just don't share. I think that building relational models of the world via abstract inductive reasoning is qualitatively different than pattern matching. I don't think there's some magic tonnage of pattern matching at which abstract inductive reasoning will suddenly emerge. I don't think that they're isomorphic. I think the AI toolkit still has a few missing pieces.

The only way to induce a consequence in a scenario is to have pattern matched the scenario. Pattern matching can be very abstract. It can use programs that may not halt. You are conflating patterns with exact details. A pattern can be as general as "[wildcard]." The human psyche promotes survival over *, every scenario.

You talk about representations and reasoning but are not assessing the fact that the human brain is literally a decision maker, acting on stored procedures and memory. Any representations and any reasoning will only apply to a select scenario or select objects, regardless to how you wish to define the pattern, the fact that a subset of abstractness/generality out of the whole of existence is specified, implies a pattern that is coded for implicitly or explicitly.

You claim to have the facts on the human brain?

My God, the level of hubris expressed by members of the cult of AI has reached a fever-pitch.

Stored procedures and memory?

Newton, in the age of clocks, managed to present the universe in the image of a clock. Is it any wonder that computer programmers present the universe in the image of the computer?

I would be interested regardless of the outcome.

GANs and style transfer are not the same as being able to recognize and imagine changes on the representation you just learned. Also, look at the GAN examples: even the water in the background is being affected by the "horse to zebra" transfer. You can perfectly imagine a brown horse, standing in the beach, and then being told "now iamgine the horse is white" without than "instruction" affecting how you imagine the beach.

Perhaps explain what your preferred dataset is.

+ Brown horses on beaches

+ Some way to indicate "white" "horse".

If you have a good idea about how we can train for your problem, I'm not so convinced that it cannot be solved.

No, the problem I was pointing at is that you want to change a part of the image (horse into zebra), but the style transfer GAN learned to map pixels from one space into another. So it knows it has to change the colour of some things, and add stripes here and there, and also probably there too. But it isn't consistent with the stripes (you can see in some gifs how the pattern suddenly changes and adjusts), and it doesn't recognize (segment) the horse as the only relevant thing that has to be changed.

But that's not bad per se about style transfer. It's an interesting technique, but if you want to convert all horses to zebras in an image, that seems to be a bit too general for current-generation GAN architectures. Maybe it can be improved upon, or a different, novel architecture is required, and not just something you can solve by throwing more data at it.

Yes, the segmentation could be better. I still think you're not pointing out fundamental issues though. :-)

Show a child who has never seen an elephant a picture of one, and she will probably correctly identify the first one she sees.

My 4 year old daughter was able to recognise a fairly artistic photo of a proboscis monkey [0] about one month after having seen an episode of Wild Kratts [1] (in Belgium they only have the cartoon and not any live footage). So this was not a toy that she'd played with, she'd never seen a photo and they don't have them at the local zoo. Nor was it an obvious animal. But clearly the nose is distinctive enough to match up from a cartoon to a photo.

In dutch it's effectively called a 'nose monkey' (neusaap) which makes the name easier to remember for kids.

[0]: http://timflach.com/work/endangered/slideshow/#74

[1]: https://i.ytimg.com/vi/RPnKpAsxUaM/maxresdefault.jpg

Define child. 2 year old? 5 year old?

I have three kids and I can tell you that they wouldn't be able to do this reliably until probably age 5, and even then maybe.

2+. A child I know only ever saw plush/Lego/picture elephants, yet on the first trip to the zoo, pointed at a live one and exclaimed triumphantly: "'phant!!!"

Right, and those exposures were probably good enough for several hundred thousand iterations of "training."

Plus as the child walks around the elephant in the zoo there's a ton of information. Just one of the angles that the child sees has to match the 3D object that they've been picking up and moving around in their hands.

Correct. It's spatial intelligence.

This thread now contains a cycle.


Most kids 3-5 can identify parts of animals they have never seen as long as there is an analog in their experience.

Not just an "analog" - specific exposure to a variety of different animals.

My 5 year old son was a monster at animal puzzles when he was 3, but we had thousands of hours with animal picture books, animal toys etc... from 4 months on.

In what way would a precise definition advance this discussion? Especially as it has gone several steps without such a definition being an issue.

Ironically, the problem that Dr. Hinton is attempting to solve could be characterized as being that ordinary CNNs have trouble learning what's most relevant in an image.

There is a massive difference between a 6 month old and a 10 year old and both would be considered children.

In that time period all of the brain infrastructure to do single shot/transfer learning at speed is developed. So showing a 10 year old a picture of an elephant with relevant label could probably be learned in a single shot. Not so with a 1 year old.

Sorry, but the relevance of these facts to this discussion still escapes me. You have previously said that your 5-year-old could "probably"/"maybe" (from the same sentence!) reliably do this, which is comfortably within your 10-year bound. Even if only adults were capable of this, Dr. Hinton's observation, appropriately amended to reflect that fact, would still be pertinent.

The relevance is that human inference is built from exposure to billions of "samples" of audio, visual, tactile information and that the CNN's that he invented are only capturing the slightest smallest sample of that. Not even the equivalent of an embryo level of data and processing capability.

People basically ignore that it takes humans YEARS of 24/7 training on ungodly amounts of data to be able to do anything close to reliable inference on even the most basic of tasks. That's the point.

...to which you might add half a billion years of neural architecture evolution. Who do you say are ignoring these facts? No-one in this discussion, as far as I can tell. If anyone is downplaying the difficulty, I would expect to find that among people who think AI is already a solved problem.

I do not want to attribute to you a claim that you have not explicitly made, but you seem to be suggesting that, because of the difficulty, current performance does not rule out the possibility that CNNs alone might eventually be able to perform as well as humans (in terms of things like using existing knowledge to speed the assimilation of new knowledge, including new knowledge in new categories.) Maybe so, but that very much remains to be demonstrated (and the burden of doing so would rest with anyone who wants to make the claim.) Meanwhile, Hinton cannot be faulted for attempting to improve on the process.

I think the gist of your claim is accurate.

Said more clearly, I am all but certain that the logical/mathematical process of correlating identifiable and measurable attributes through iterative search is the correct approach to reach general intelligence goals.

There are many improvements in efficiency, both in data acquisition, labeling, processing etc... that will need to happen make it tractable computationally, but fundamentally I think it's the correct approach.

Where I differ from Hinton is that he seems to think that human level processing requires less data than I believe it does. It's a subtle point actually.

So why does Hinton think otherwise?

My 2 year old daughter could.

Without ever having seen one? Or with having seen pictures of one?

It's true though that we can generalise from descriptions and recognise the real thing from those. If you describe an elephant as a big grey animal with big ears and a trunk they can use to grab stuff, then an adult (not a 2 year old I suspect) seeing one for the first time, will recognise it from that description.

When we see a cartoonish drawing of one, we can still distill the defining characteristics from it and use it to create a description or recognise the real thing. We can recognise a very crude childish drawing of one by looking for these characteristics. We have a lot of additional knowledge that influences our image recognition, and having a big toolbox of general recognition of tons of different objects, we don't really need to train to recognise new objects anymore, because we will distill its identifying characteristics the first time we see it.

Computers clearly don't look like that.

Some pattern recognition seems to be innate, for example chicks just hours old can recognize the shadow of a flying bird as either harmless (long neck short tail) or a predator (short neck long tail).

So, while this pattern in particular doesn't apply to humans (it really doesn't?), many animals have ready-to-use pattern recognition when they are just hours or days old.

Those are instincts bred by evolutionary because they're vital for survival. Recognising an elephant is not the same thing for us. Instead, we get a generic toolbox for quickly learning to recognise very different objects, from animals to machines to abstract shapes. We're not born being able to recognise them, but we can pick them up very quickly from even rough descriptions.

True, but if we are into exploring the possibilities of what can be done with artificial neural networks, I see no reason to limit our models to human brains only.

Children also understand that a cartoon-cat is a cat, even though they don't look very similar.

> If a child observes a cat in the room for only 10 seconds, that's already between 3,000 to 10,000 samples from various perspectives.

Why does that count as 3000 - 10,000 samples? Why is it not a single sample? I don't think our brains sample image in that way. And that might be a fundamental difference between how humans process images and how we're expecting computers to do it.

If the cat was perfectly still, not twitching and neither were you, it would be one sample. With both entities moving, you are getting constant, discrete samples of what a "cat" is.

I think what the GP means is that there are no discrete samples and the information stream is instead one continuous sample.

Reacting to visual stimuli is reflexive and not directly related to learning.

Human brain receives roughly 25-50 images worth of data per second, so less than 50 samples per second. (consciously we observe only 25 images per second).

Short-term synaptic plasticity works on a timescale of 20ms to few minutes, so also roughly 50 times per second timescale.

If I try to translate this to deep learning framework, it would mean max 500 training steps in 10 seconds per neuron.

Learning to recognize cat using _unsupervised learning_ in just 10 seconds would be really impressive.

Noam Chomsky would argue it's because (nearly) all of us are born with some constrains in our brain designed for dealing with our world.

At the level of object recognition we are discussing now, the "constraint" (more accurately learning bias) comes in the form of very advanced neural architecture that is able to learn in spatial environment. Not in some kind of pre-trained neuron weight collection.

We have some very particular biases like fear of snakes or heights, but learning to recognize spatial objects is something very general.

Fear is neither prover or disproven to be a genetic trait.

I believe Human brain also reuses abstract concepts it learn from other data. Such as eyes. So when a new animal is shown its very easy to find eyes on it and does not need another million images of that new animal to detect it's eyes.

We need a SpaceX of Deep Learning, where a lot of learning is reused and linked in creating much larger web of knowledge about the world.

But if I give you a single image of a scene with an object you’ve never seen before, you’ll likely be able to instantly segment it, describe it, and relate it to things you do know.

Does this mean you expect AI to be able to train on one video of one cat from different angles?

It's not accounted for in the literature because brains do not work like computers processing 1 frame every x milliseconds.

So we should be using 10 second videos instead of images to train AI?

> To teach a computer to recognize a cat from many angles, for example, could require thousands of photos covering a variety of perspectives. Human children don’t need such explicit and extensive training to learn to recognize a household pet.

Human children see their pet from a million different viewpoints every day

Sure, but they're not being explicitly trained to do anything. It just happens because that's what children do. Also, you don't have to label their cat experiences to identify them from the other millions of experiences they have in a day. You don't need to, they just figure it out without even realizing it.

That's pretty great.

Sure, but at some point you do label "This is our cat Tabby. He lives here." Now every time they see a cat in the house, it is implicitly labelled. I'm definitely disagreeing with "extensive" more than "explicit" in the original quote, but I think it's silly to differentiate explicit vs implicit when talking about a human vs computer

How do you know it's the same cat? A neural network needs to be explicitly told, "these are all the same cat".

This article says that google photos’ model can still have trouble telling your pets apart if you have the same breed - it kind of disproves your point. The model needs to be explicitly told which cat is which. That said, I’m pretty sure a toddler would probably need to be told which is which too.

Don't forget humans don't just see pictures - they see a video upon which they are learning.

Micro-movements of cat body parts and their more general character traits could be the only hint that separates two supposedly same cats.

Yes, and toddlers (my 14 month old...) can see a cat in a picture, and recognize a cat in real life, even if it's different color.

Or one photo rotated 36 times during training? Instead of having CNN with just translation invariance, you can have one with rotational invariance as well.

Humans see video; it's easier to derive stuff from videos than from static photos. It's awesome we can teach a car to drive from single photos, but those models that incorporate time factor, 3D convolutions + RNNs, perform better...

If you could give an adversarial network control over a 3D camera around an object, you could train that way.

A final exam task for my high school informatics was to find coordinates of a cat in a picture within bounding box of that cat.

We weren't given anything to refer to except that the picture is 1000x1000 rgb array, and that the cat had stripes on the body.

I wonder, if the brain has no specific function to recognize geometry first, and then deduce the object further by secondary characteristics as in today's computer vision, but it synthesizes a resulting decision in reverse from many nearly independent "circuits" (people can guess that a cat is a cat even if they look at its partial image, or even if the cat is painted purple and the person has never seen a purple cat in real life)

Children don't see random pictures, they see videos (pictures in a sequence).

It amazes me how often this fact is ignored in any discussion of AI.

The real world is a pretty damn good "training environment".

I like your sentiment but then I realized of course it is. There is nothing more real than what is real. If suddenly we were dropped in a more random universe with less probabilistic rules, then what we have now wouldn't be good. Short of another reality, that's almost tautological.

Can the article link please be changed to this non AMP link?

The article does not mention the first and second authors of the research work, which is an atrocious thing to do.

The paper was authored by Sara Sabour, Nicholas Frosst, Geoffrey E Hinton in that order.

Capsules are Hinton's idea. His speech "What is Wrong With Convolutional Neural Nets?" was 2014 and he was already working with capsules. http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wr...

The credit from the paper should go to all researchers of course, but Hinton is main driving force behind the research.

I'd say the first and second authors still warrant a mention, at least.

The article was updated to include the two other authors names. There is an update at the end of the article mentioning this.

I would argue that ideas are cheap until they are proven correct, relatively, compared to the effort to proven them, in some cases, also to apply them into real world scenarios.

I see the society give unproportionally large credit to the so-called "leaders", because they pioneered certain "ideas".

That's wrong.

Histories bury too many talented individuals who are not recognized, because of some pioneers are frantically sucked all the attentions the society can give.

In professional research community, authors are listed because they are deemed a critical contributor to the work. Plain and simple.

One might argue that any number of researchers could have performed the labor of testing the idea, whereas the originator of the idea itself is non-fungible.

I agree but it's unsurprising. Hinton has a bit of "star power" in the ML community. He's done a lot of pioneering work on neural networks, and stated that he's been working on this capsule concept for years.

I get why the focus is on him but they could have at least mentioned the others.

I had the good fortune of seeing Nicholas Frosst speak at Google I/O in Waterloo this year, and he was great.

That guy has a clear passion for what he's doing, and seems immensely knowledgeable on the subject. My favourite speaker of the event by far.

I haven't heard yet from Ms. Sabour, but if Frosst is at all representative of the team (as I'm sure is at minimum the case), then it's a shame they aren't mentioned. From what I understand so far they work rather closely with Mr. Hinton on a regular basis.

The biggest name usually gets all the fame on academic work - a "Matthew effect".

Moreover, the last author is usually the most senior person or someone who funded the project; usually perceived as important or more important than the first author who "did all the work".

> usually perceived as important or more important than the first author who "did all the work"

To be accurate, that person is considered the one who will claim the credit after the student graduates. Not that they are important to the work.

Most academic papers are mostly students work under a general instruction from the advisor.

Please can we not use AMP links, but the direct URL, thanks

Why? The AMP site for this specific article is significantly cleaner and easier to read.

Edit: whoops apparently I'm not allowed to question this.

Not everyone uses mobile, desktop experience is awful.

Also I'd prefer to see the original content rather than a mangled Google version, there are significant issues with AMP (both technically and morally) and it would help if we don't propagate it's usage where possible. Thanks

I was referring to desktop experience, it's significantly better from what I see.

Are you serious? The page stretches across my widescreen monitor. I can't even see the whole image because of that. Reading such long lines is a hellish chore. Why do you think this is better than a properly formatted page?

snap your browser to the side of your screen, and voila! it's half the length of your screen now

Again, how is this better than a page that is already properly formatted?

On Desktop, it is horrible.

There are no margins, and zoom is messed up on Chrome/Firefox.

Desktop is what I was referring to, its significantly better in my opinion. Just straight text full width, no wasted space and junk all over the screen.

I sincerely hope you have no authority over design of any software or web page that any human actually has to read. There is extensive research proving that text wider than about 60-70 characters is increasingly difficult to follow and understand. There is further research showing the benefit of margins for visual understanding as well.

Basically what you get with this AMP link goes against all known facts about how the human eyes & brain best receives textual information, unless you just happen to have a huge zoom or tiny screen.

It's obnoxious that Wired styled their AMP page like that, but it's not a problem with AMP itself.

Surely you don't use a default non-100% zoom, it is crazy how Google who also happens to make a browser can't get such trivial details right.

Also, the zero margins are not helpful on a large display.

These alone make the experience grotesque and awful for me; DOM-distiller is so much better than AMP in terms of readability and layout.

You can get the same experience by just requesting the AMP version from Wired directly (just append `/amp` to the original URL), no reason to go through Google's cache for that.

You can still get the broken AMP formatting on the publisher's site:


It looks awful on the computer, but it is also perfect to save to pocket and read in my Kobo, which is what I usually do.

Yes, it would be great to use the actual publisher URL: https://www.wired.com/story/googles-ai-wizard-unveils-a-new-...

omg is this really how amp urls are supposed to be ?! isn't this crazy? www-wired-com wtf... https://www-wired-com.cdn.ampproject.org....

They're trying to hide that all amp pages are actually hosted by google.

Which would actually be counterproductive for Google: If people start to get used to these "fake" URLs, soon they will not be able to tell www-google-com.evilwebsite.com from the actual Google website, and will be more susceptible to phishing attacks.

Why do you think google cares about random phishing attacks? They care bout your data, and they will get it no matter what. Maybe the phishing site is also using google analytics!

Based on the domain name, I thought the AMP project was starting incorporate machine learning into the project.

Here's a youtube video that explains the new capsule idea: https://www.youtube.com/watch?v=VKoLGnq15RM

(I'll admit that I don't fully understand it yet), but I think the major thing that capsules tries to fix is that a CNN only looks at a small window of the image at a time. Since the capsules aggregate more information, it can learn more general features.

Also, he notes that the paper was done on the MNIST data set (small images), and may not generalize to larger images, but the initial results are promising.

Congrats to Hinton, et al on publishing. Should see more info at NIPS 2017 in December. Quite admirable, embarking on a late-career "Year Zero" course correction, all in the name of advancing the field ;)

How does the human brain handle "invariance"? Not just of the spatial variety. But transformational, temporal, conceptual, and auditory invariance as well?

Some background on "columns" from bio-inspired computational neuroscience startup Numeta:

Why Does the Neocortex Have Layers and Columns, A Theory of Learning the 3D Structure of the World


I guess the key is our ability to extrapolate, imagine things from a single picture. When a child is shown a proper image of a cat and it sees a cartoon cat, it has done that extrapolation of cat's body contours. Or rather I would say it is some sort of meta data that is learnt out of each experience like the way we model in OO - class and object instances. We somehow are able to abstract the class out of a image even if it is just a single image and I feel it is the meta data that gets refined over time rather than the storing pixels of actual images.

I believe that the kernels learned by a deep net (especially the detailed ones) are basically what this guy is talking about (a small nnet that recognizes basically one feature). I suppose you could sample a large number of capsules, but that would be equivalent to just making a bigger deep net.

There is dynamic routing between capsules, compared to static routing in normal deep nets. The routing itself is learned.

It's probably more than that otherwise that guy wouldn't waste his time.

yes, is about feeding the 3D context too. It means to recognize a feature once, and then give a spatial translation, is able to say, yes is the same feature, just turned 30deg right, 40deg up, for example, without having to train the model with a _picture_ of an object taken from all sides and perspectives. Humans use binocular vision [1], but AI can be programmed to do more.

This is practically introducing AI to the real world: an object is more than the picture of it.

[1] https://en.wikipedia.org/wiki/Binocular_vision

Small correction:

"this guy" is "the guy" behind deep learning revolution.

>Abstract. In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarizes relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.


I think the key is not looking at how well humans perform, but how badly they make mistakes. For example, over the halloween period we interpret 2 flashing LEDs as scary cat eyes. We might not do the same taken out of the temporal context. How we "fail" is a possible indicator to how we succeed.

I'm not sure I follow your example of LEDs as cat eyes and Halloween.

Do you mean a Halloween decoration involving LEDs that a human interprets as a representation of cat eyes? That's not really a mistake.

Or a human mistaking flashing LEDs as cat eyes? In which case I can't see how the mistake would be limited to the Halloween period.

I mean we decide that the LEDs are supposed to be cat's eyes because everyone has halloween decorations up, whereas another time of year, we might conclude (correctly) that they are just flashing LEDs

It’s interesting that they mention that Geof’s Inspiration is coming from biology. I think there is a lot more mining we could do in this area. We don’t have to capture the implementation details, just the salient ingredients that make intelligence work in biological organisms.

This paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1692705/pdf/106... I read in a neuroscience class once makes the argument that their are various "levels and loops" or abstraction layers to the way the brain functions. Kind of similar line of reasoning.

Oh... Just that then.


I haven't read the research papers yet, but as someone new to machine learning and image recognition...

> Hinton’s capsule networks matched the accuracy of the best previous techniques on a standard test of how well software can learn to recognize handwritten digits

Is the journalist just saying that capsule networks can perform well on MNIST? Don't most state of the art techniques perform with 99+ accuracy on MNIST?

Yes, their first paper is on MNIST. That they get high accuracy isn't earth-shattering, but since they are doing something very different from other approaches, it's still noteworthy. The real benefit is in the generalization performance:

We then tested this network on the affNIST 4 data set, in which each example is an MNIST digit with a random small affine transformation. Our models were never trained with affine transformations other than translation and any natural transformation seen in the standard MNIST. An under-trained CapsNet with early stopping which achieved 99.23% accuracy on the expanded MNIST test set achieved 79% accuracy on the affnist test set. A traditional convolutional model with a similar number of parameters which achieved similar accuracy (99.22%) on the expanded mnist test set but only achieved 66% on the affnist test set.

Yes that seems like a good first test for generalization. Did they publish these images somewhere?

In Machine Learning terms, current neural network techniques are a local minima. Capsule networks might not perform as well now, but may offer a way forward which will outperform them in the long term as they're developed and refined. Being able to match MNIST results, rather than do worse, is a good reason to work on them.

Siraj Raval's implementation of Capsule Network using tensorflow. video: https://www.youtube.com/watch?v=VKoLGnq15RM code: https://github.com/llSourcell/capsule_networks

Eventually they’re going to start connecting specialized neural networks together into a neural network of neural networks and that’s where the real magic is going to happen.

Eventually Jesus will come back to Earth and that is where the real magic is going to happen.

There it is, the obligatory "inception neural networks will lead the singularity" comment.

If you consider it, a convolutional neural network is applicable to any type of picture, including those that are not pictures of 3D scenes, such as seismic data. So, in order to handle pictures of 3D scenes well, you are going to have to take extra assumptions about the data. This Geoffrey Hinton does, by taking the assumption that a scene consists of objects, with associated pose parameters.

So same concept as face detection by Viola Jones ? Look at smaller features and a superset/composition of them ?

CNNs do the same (hierarchy of features, combining lower, simpler features to more complex ones on each level). This is a bit different concept.

Fodor's the mind doesn't work like that is a great book explaining the shortcomings of both connectionist and modular models of cognition. He basically says neither should work, nor combinations thereof. Never seen anything more than dismissal of his work.

This hackernoon article is what cleared the concept of capsules for me: https://hackernoon.com/what-is-a-capsnet-or-capsule-network-...

Wow, this should reignite the Chomsky vs Norvig debate. This is the kind of science Chomsky wants.

Layman's interpretation of capsules is that they're designed to facilitate inverse graphics. It's like a pixel shader in reverse.

“AI wizard” wired you are killing me.

Wizard is an established and well respected title within computer science.

is there a 'toy' example where one could compare a 'regular' NN as compared to a 'capsule' NN? Code??

That's what they've done with MNIST dataset.

https://github.com/naturomics/CapsNet-Tensorflow Tensorflow Implementation of CapsNet.

Is there a link to the original papers?

what's the difference between concatinating the layer and adding more layers?

I thought Hassabis was in charge there?

Deep Mind, which keeps separate kitchen from Google.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact