Reading your post, you actually seem quite confused.
>Imagine a hypothetical language that is so compressed, so non-redundant, so little correlated, that it's indistinguishable from random noise. Learning this language may seem an impossible task.
Well yes, learning a class of strings in which each digit of every finite prefix is statistically independent from each other digit, is very hard, bordering on impossible (or at least, impossible to do better than uniform-random guessing).
>But in fact it's very easy to produce text in this language. Just produce random noise!
But that isn't the learning problem being posed! You are not being asked to learn `P(string | language)` (which is, in fact, the uniform distribution over arbitrary-length strings), but `P(language | string1, string2, ..., stringn)`, which by the way you've posed the problem factorizes into `P(language| character1) x P(language|character2) x ... x P(language|characterm)`. If the actual strings are sampled from a uniform distribution over arbitrary-length strings, then we have two possibilities:
1) The prior is over a class of languages some of which are not optimally compressed, and which thus do not render each character (or even each string) conditionally independent. In this case, the posterior will favor languages that do render each character conditionally independent, but we won't be able to tell apart one such hypothesis from another. We've learned very little.
2) The prior is over a class of languages all of which yield strings full of conditionally-independent noise: no hypothesis can compress the data. In this case, the evidence-probability and the likelihood cancel, and our posterior over languages equals our prior (we've learned nothing).
>Real language, of course, has tons of statistical patterns, and is definitely not random. But I don't see how it is harder to learn than, say more redundant audio recording of the same words, or a video recording of the person speaking them. That extra information is irrelevant and will just be discarded by any smart algorithm anyway.
Noooo. Compression does not work that way. Compression works by finding informative patterns in data, not by throwing them away. If your goal is to learn the structure in the data, you want the structure to be more redundant rather than less.
I'm telling you, once the paper is submitted, I can send you a copy and just show you the equations and inequalities demonstrating this fact.
>Most people do. As I said some people don't, and they function fine. See this: http://www.bbc.com/news/health-34039054
Differences in sensorimotor cortex function that leave the brain unable to perform top-down offline simulation with a high subjective-sensory precision don't invalidate the broad theory that cortical microcircuits are generative models (in particular, hierarchical ones, possibly just large hierarchies in which the individual nodes are very simple distributions).
>Robots are never going to have exactly the same internal states and experience as humans. They could be very, very different, in structure, to the human brain.
Duh. However, if we want them to work, they probably have to run on free-energy minimization somehow. There is more necessity at work here than connectionism believes in, but that's a fault in connectionism.
>Being exactly like humans isn't the goal. Mimicking humans is an interesting diversion, but it's not necessary, or the goal in and of itself.
I didn't say that a working robot's representations had to exactly match those of humans. In fact, doing so would be downright inefficient, since robots would have completely different embodiments to work with, and thus be posed different inference problems in both perception and action. The fact that they would be, necessarily, inference problems is the shared fact.
>And you may be right that a robot without vision would be disadvantaged. I think that's mostly anthropomorphism, imagining how disadvantaged blind humans are (and in fact even blind humans can function better than most people expect.) But even if it's true, my point is that sight is not strictly necessary for intelligence.
Sight isn't. Some kind of high-dimensional sense-data is.
>In fact I think vision may even be a disadvantage. So much of the brain is devoted to visual processing. While text, and even language itself, are hacks that evolution created relatively recently. A brain built purely for language could be much more efficient at it than we can probably imagine. Ditching vision could save a huge amount of processing power and space.
That's putting the cart before the horse. Language is, again, an efficient but redundant (ie: robust against noise) code for the models (ie: knowledge, intuitive theories, as you like) the brain already wields. You can take the linguistic usage statistics of a word and construct a causal-role concept for them in the absence of a verbal definition or sensory grounding for the word, which is arguably what children do when we read a word before anyone has taught it to us, but doing so will only work well when the concepts' definitions are themselves mostly ungrounded and abstract.
So purely linguistic processing would work fairly well for, say, some of mathematics, but not so much for more empirical fields like social interaction, ballistic-missile targeting, and the proper phrasing of demands made to world leaders in exchange for not blowing up the human race.
Hold on. Let's say the goal is passing a Turing test. I think that's sufficient to demonstrate general intelligence and do useful work. In that case, all that is required is mimicry. All you need to know is P(string), and you can produce text indistinguishable from a human.
>Noooo. Compression does not work that way. Compression works by finding informative patterns in data, not by throwing them away. If your goal is to learn the structure in the data, you want the structure to be more redundant rather than less.
Ok lets say I convert English words to smaller huffman codes. This should be even easier for a neural network to learn, because it can spend less effort trying to figure out spelling. Of course some encodings might make it harder for a neural net to learn, since NNs make some assumptions about how the input should be structured, but in theory it doesn't matter.
>Some kind of high-dimensional sense-data is [necessary]... purely linguistic processing would work fairly well for, say, some of mathematics, but not so much for more empirical fields
These are some really strong assertions that I just don't buy, and I don't think you've backed up at all.
Humans have produced more than enough language for a sufficiently smart algorithm to construct a world model from it. Any fact you can imagine is contained somewhere in the vast corpus of all English text. English contains a huge amount of patterns that give massive hints to the meaning. E.g. that kings are male, or that males shave their face and females typically don't, or that cars are associated with roads, which is a type of transportation, etc.
Even very crude models can learn these things. Even very crude models can produce nearly sensible dialogue from movie scripts. Models with millions of times fewer nodes than the human brain. It's amazing this is possible at all. Of course a full AGI should be able to do a thousand times better and completely understand English.
Trying to model video data first is wasted processing power. It's setting the field back. Really smart researchers spend so much time eeking out 0.01% better benchmark on MNIST/imagenet/whatever, with entirely domain specific, non general methods. So much effort is put into machine vision, when Language is so much more interesting and useful, and closer to general intelligence. Convnets, et al., are a dead end, at least for AGI.
I'd need to see the math for this: how will the Huffman codes preserve a semantic bijection with the original English while throwing out the spellings as noise? It seems like if you're throwing out information, rather than moving it into prior knowledge (bias-variance tradeoff, remember?), you shouldn't be able to biject your learned representation to the original input.
Also, spelling isn't all noise. It's also morphology, verb conjugation, etc.
>Humans have produced more than enough language for a sufficiently smart algorithm to construct a world model from it.
Then why haven't you done it?
>Any fact you can imagine is contained somewhere in the vast corpus of all English text.
Well no. Almost any known fact I can imagine, plus vast reams of utter bullshit, can be reconstructed by coupling some body of text somewhere to some human brain in the world. When you start trying to take the human (especially the human's five exteroceptive senses and continuum of emotions and such) out of the picture, you're chucking out much of the available information.
There's damn well a reason children have to learn to speak, understand, read, and write, and then have to turn those abilities into useful compounded learning in school -- rather than just deducing the world from language.
>Even very crude models can learn these things. Even very crude models can produce nearly sensible dialogue from movie scripts.
Which doesn't do a damn thing to teach the models how to shave, how to tell kings from queens by sight, or how to avoid getting hit by a car when crossing the street.
>Models with millions of times fewer nodes than the human brain. It's amazing this is possible at all.
The number of nodes isn't the important thing in the first place! It's what they do that's actually important, and by that standard, today's neural nets are primitive as hell:
* Still utterly reliant on supervised learning and gradient descent.
* Still subject to vanishing gradient problems when we try to make them larger without imposing very tight regularizations/very informed priors (ie: convolutional layers instead of fully-connected ones).
* Still can't reason about compositional, productive representations.
* Still can't represent causality or counterfactual reasoning well or at all.
>Trying to model video data first is wasted processing power. It's setting the field back. Really smart researchers spend so much time eeking out 0.01% better benchmark on MNIST/imagenet/whatever, with entirely domain specific, non general methods. So much effort is put into machine vision, when Language is so much more interesting and useful, and closer to general intelligence. Convnets, et al., are a dead end, at least for AGI.
Well, what do you expect to happen when people believe in "full AGI" far more than they believe in basic statistics or neuroscience?
I don't think anything like that exists today, or ever will exist. And in fact you are making an even stronger claim than that. Not just that vision will be helpful, but absolutely necessary.