This isn't an engineering problem, it's a philosophical problem: We are blind to most things. We can only see what we can relate to personally. We can use language and other symbol systems to expand our understanding by permuting and recombining our personal experiences, but everything is grounded in our interactive developmental trajectory.
The kitten-in-a-cart experiment demonstrates this clearly: http://io9.gizmodo.com/the-seriously-creepy-two-kitten-exper... Interaction is crucial for perception. Sensation is not experience.
And here's the rub: Once you give an AI an animal-like personal developmental trajectory to use for grounding the semantics of their symbol systems, you end up with something which is not particularly different or better than a human cyborg.
Look at word2vec. By using simple dimensionality reduction on the frequency that words that occur near to each other in news articles, it can learn really interesting things about the meaning of words. The famous example is the vector "king" minus the vector for "man" plus "woman" equals "queen". It's learning the meaning of words, and the relationships between them.
Recurrent NNs, using similar techniques, but with much more complexity, can learn to predict the next word in a sentence very well. They can learn to write responses that are almost indistinguishable from humans. And it's incredible this works at all, given RNNs have only a few thousand neurons at most, and a few days of training, compared to humans' billions of neurons trained over a lifetime.
All of the information of our world is contained in text. Humans have produced billions of books, papers, articles, and internet comments. Billions of times more information than any human could read in their entire lifetime. Any information you can imagine is contained in text somewhere. I don't think it's necessary for AIs to be able to see, or interact with the world in any way.
If you can predict the word a human would say next, with enough accuracy, then you could also produce answers indistinguishable from theirs. Meaning you could pass the Turing test, and perform any language task they could do just as well. So language prediction alone may be sufficient for AGI.
This is the theory behind the Hutter Prize, which proposes that predicting (compressing) wikipedia's text is a measure of AI progress. The Hutter Prize isn't perfect (it's only a sample of wikipedia, which is very small compared to all the text humans have produced), but the idea is solid.
Communicate the concept of "green" to me in text.
The sound of a dog barking, a motor turning over, a sonic boom, or the experience of a Doppler shift. Beethoven's symphony.
Sour. Sweet. What does "mint" taste like? Shame. Merit. Learn facial and object recognition via text.
Tell a boxer how to box by reading?
Hand eye coordination, bodies in 3 dimensional space.
Look, I love text, maybe even more than yourself. But all these things imbibe, structure and influence or text, but are not contained in them.
To make substantial inroads to something that looks like human esque AI, text is not enough. The division of these fields are artificial and based on our current limited tech and the specialisation of our researchers, faculties and limitations.
When we read, we play back memories, visions, sounds, feelings, etc, and inherent ideas gained through experience of ourselves as physical bodies in space.
Strong AI, at least to be vaguely recognised as such, must work with algorithms and machinery that understand these things, but which then works at that next level of abstraction to combine them into proper human type concepts.
Of course, there is the question about why we would want to create a human like AI, it's my contention that human like AI isn't actually what many of us would want, but that's another topic...
I won't touch the qualia aspect, but everything necessary to flawlessly pretend to understand the color green, the sound of a dog's bark, or the experience of hearing a sonic boom can be represented with text. As an existence proof, you could encode an adult human as a serial stream of characters.
But if you must pretend to be sighted and hearing, there are many descriptions of green, of dogs barking, of motors, etc, scattered through the many books written in English (and other languages.)
Are these descriptions perfect? Maybe not. But they are sufficient to mimic or communicate with humans through text. It's sufficient to beat a Turing test, to answer questions intelligently, to write books and novels, and political arguments, etc. If that's not AGI, I don't know what is.
I don't propose that a human could lose all of their senses and still be able to communicate. But I do believe computers could do so, if they are designed to do that. Humans are not designed to work lacking those senses.
Now we are just speculating. We believe a computer might be able to understand things for which it doesn't have the sense - but that is speculation and totally untested, and certainly can no longer be justified by using human minds as an example.
That seems hard to believe.
Even if this were a true statement, it's still the case that it might not be enough. There is a class of functions that are simply not learnable without some prerequisite knowledge. This is directly analogous to a one-time pad in crypto. It is entirely possible that the function 'language' is in this class of unlearnable functions. While it may be the case that certain varieties of intelligence are learnable tabula rasa from a powerful neural net, the surface form of human natural language (the part your recommending measuring) may simply not have enough information in it to decode the whole picture. It is entirely possible that you need to supply some of your own information as well to the picture, in a specific manner so as to act as a kind of decryption key. A record needs a record player, even if you can make similar sounds with cassettes and CDs.
And so, I'm willing to bet that you simply cannot, using raw, uninformed statistical techniques, predict what word a human would say next. You need to understand more of the underlying structure of humans first.
I will agree, however, that the success towards the Hutter Prize is a valuable demonstration of AI progress. Simply because I believe that maximal compression and the kind of intelligence I'm talking about are one and the same thing. You need to offload as much of the semantic weight of the corpus into the encryption algorithm as you can. That means building a very complex model of natural language. And if you accept the premise that this model is not simply learnable by observing the surface form, then that means building Strong AI
I don't understand how this could possibly be the case. We can already make great progress on language understanding with simple methods like word2vec, or perhaps even markov chains. There are tons of statistical patterns in text that can be learned by computers.
If it turns out that the easiest, or even only means of doing this is by emulating the human brain, then it is entirely possible that we inherit a whole new set of constraints and dependencies such that world-simulation and an emobdied mind are required to make such a system learn. If this turns out not to be the case, that there's some underlying principle of language we can emulate (the classic "airplanes don't fly like birds" argument) then it may be the case that text is enough. But that's in the presence of a new assumption, that our system came pre-equipped to learn language, and didn't manufacture an understanding from whole cloth. That the model weights were pre-initialized to specific values.
I don't think it's so weird to imagine that natural language really doesn't convey a ton of explicit information on its own. Sure, there's some there, enough that our current AI attempts can solve little corners of the bigger problem. But is it so strange to imagine that the machinery of the human brain takes lossy, low-information language and expands, extrapolates, and interprets it so heavily so as to make it orders of magnitude more complex than the lossy, narrow channel through which it was conveyed? That the only reason we're capable of learning language and understanding eachother (the times we _do_ understand eachother) is because we all come pre-equipped with the same decryption hardware?
1) They appear to have crafted the skeleton of a grammar as it is with their nodes, super nodes, and slot collocations. This is directly analogous to something like an Xbar grammar, and is not learned by the system; therefore, if anything, it's strengthening a Chomskian position; the system is learning how a certain set of signals satisfy its extant constraints.
2) The don't appear to go beyond generative grammar, which already seems largely solvable by other ML methods, and is a subset of the problem "language". Correct me if I'm wrong here, it's a very long paper and I may have missed something.
There's also the related phenomenon of "subtext" -- the idea that some language has a different meaning than what's said. For instance, when I ask about whether a signature line on a form is required, and the other person says, "Yes, it's required. However you think best to get the signature." There's a subtext there of, "This signature won't actually be checked, so don't worry about it."
Of course we have no idea how the Voynich manuscript is encrypted (which would make the assumptions of word2vec wrong), or if it even has any meaning at all. And it's an incredibly small dataset compared to modern text corpuses, so there is probably significant uncertainty and overfitting. And other problems like inconsistent spellings, many errors in transcriptions, etc. But in principle this is a good strategy.
>how would a person do if they were locked in a room with lots of books written in a language unknown to them?
If you spent all day reading them, for years, and you somehow didn't get bored and kept at it, eventually you would start to see the patterns. You would learn how "slithy toves" are related to "brillig", even if you have no idea how that would translate to English. Study it long enough, and you may even be able to produce text in that language, indistinguishable from the real text. You may be able to predict the next word in a sentence, and identify mistakes, etc. Perhaps carry out a conversation in that language.
And I think eventually you would understand what the words mean, by comparing the patterns to those found in English. Once you have guesses for translations of just a few words, you can translate the rest. Because you know the relationships between words, and so knowing one word constrains the possibilities of what the other words can be.
If the translation it produces is nonsense, the words you guessed must have been wrong, and you can try again with other words. Eventually you will find a translation that isn't nonsense, and there you go. This would be very difficult for humans, because the number of hypotheses to test is so large, and analyzing text takes forever. Computers can do it at lightspeed though.
More generally, has any attempt been made to identify the meanings of words in any sufficiently large corpus of text in a known foreign language (for example, Finnish), without being provided with a translation into English, and then compare the identified meanings to the actual meanings, as a first step towards translation?
Doing this without any translated words at all, would be more difficult. But I believe possible. It's actually a project I want to try in the near future.
This statement is false. There is a well known thought experiment called Mary’s Room the gist of which is that knowing all conceivable scientific knowledge about how humans perceive color is still not a substitute for being a human and perceiving the color red: https://philosophynow.org/issues/99/What_Did_Mary_Know
The experience of seeing red is an example of what is called “qualia”.
In Google AI systems that identify cats, birds, etc it is reasonable to imagine AI technology evolving towards systems that can discuss those objects at the level of a typical person. However with an AI based on text only there is no possibility of that. It would be like discussing color with a blind person or sound with a deaf person.
In any case, at some level everything is symbols. A video is just a bunch of 1's and 0's, as is text, and everything else. A being raised on only text input would have qualia just like a being raised on video input. It would just be different qualia.
It may sure look indistinguishable, but on the inside it just wouldn't be the same.
If you assert that a person can understand everything there is to know about the color red and then still not understand what it is like to see red, you have either contradicted yourself or assumed dualism.
Also, it's a thought experiment. Some people will claim the answer to that question is no, she learned nothing. Others will claim that she did. It's that thing she learned beyond the physical that theoretically cannot be conveyed by science, or even possibly by language.
Word2vec is definitely an impressive algorithm. But at the end of the day, it's just a tool that cranks out a fine-grained clustering of words based on (a proxy measure for) contextual similarity (or rather: an embedding in a high-dimensional space, which implicitly allows the words to be more easily clustered). And yes, some additive relations between words, when the signal is strong enough.
But to say that it's "learning" the "meaning" of these words is really quite a stretch.
I remember that when I first read A Clockwork Orange, it took me a while but I finally started to understand the meanings of those words/phrases (though I may not have every encountered them before.) It did feel like my brain was re-wiring itself to a new language. It'd be interesting to see how some type of language AI would treat these works.
edited to add there's a wiki article on the language of A Clockwork Orange, Nadsat: https://en.wikipedia.org/wiki/Nadsat
Gender was just an example. There are lots of semantic information learned by word2vec, and the vectors have shown to be useful in text classification and other uses. It can learn subtle stuff, like the relationship between countries, celebrities, etc. All that information is contained in a few hundred dimensions, which is tiny compared to the neurons in the brain.
You say, as many people do, that the operation "king - man + woman = queen" indicates that it understands that the relation between "king" and "queen" is the relation between "man" and "woman". But there is something much simpler going on.
What you're asking it for, in particular, is to find the vector represented by "king - man + woman", and find the vector Q with the highest dot product with to this synthetic vector, out of a restricted vocabulary.
The dot product is distributive, so distribute it: you want to find the maximum value of (king * Q) - (man * Q) + (woman * Q).
So you want to find a vector that is like "king" and "woman", and not like "man", and is part of the extremely limited vocabulary that you use for the traditional word2vec analogy evaluation, but not one of the three words you used in the question. (All of these constraints are relevant.) Big surprise, the word that fits the bill is "queen".
(I am not the first to do this analysis, but I've heard it from enough people that I don't know who to credit.)
It's cool that you can use sums of similarities between words to get the right answer to some selected analogy problems. Really, it is. It's a great thing to show to people who wouldn't otherwise understand why we care so much about similarities between words.
But this is not the same thing as "solving analogies" or "understanding relationships". It's a trick where you make a system so good at recognizing similarities that it doesn't have to solve analogies or understand relationships to solve the very easy analogy questions we give it.
There's also the relationship it has with the world.
If literally the only thing you know is the relationship between words, but you have a perfect knowledge of the relationship between words, you'll quickly determine that "Day" and "Night" are both acceptable answers, and have no means of determining which is the right one. At the very minimum, you need a clock, and an understanding of the temporal nature of your training set, to get the right one.
What do you see?
What do you smell?
What do you hear?
What does the landscape look like?
What memory does this bring up?
These are messages that the language is communicating. If an AI can't understand at least some of the content of the message then can it compose one effectively? I'm not certain it can understand the meaning from words alone, but we can certainly try.
And then there is the fundamentally dynamic aspect of language, which strengthens the need for a rich understanding of the world that words describe and convey.
That's a bit overstated. I think maybe you mean to say "all of the information you'd need to be an effective citizen of our world is contained in text"? Or something similar? I think even that is too strong a claim, but it's at least understandable.
As stated the assertion doesn't make any sense. There is more information in a glass of milk than could be stored on all of the computers on earth, and Heisenberg showed that it's impossible to even record all of the information about a single particle.
Certainly there is no textual information available about my grandfather's eyes, but that information is accessible in the world for those who'll look. You seem to be underestimating the quantity of information you absorbed as a baby, just by reacting robotically to the people and events around you and absorbing the relationships between percepts and feelings.
If the much-vaunted "general intelligence" consists in both vague and precise causal reasoning and optimal control with respect to objects in the real world, then no, it obviously cannot be done with mere language. At least one sensor and one effector will be needed to train an ML/AI model to perform those tasks.
The internet is so large and so comprehensive (especially if you include digitized books and papers, e.g. libgen or google books) that I doubt any important information that can be learned through video data, can't be obtained through text data.
Uhhhh... there's this thing called entropy. A corpus of video data is vastly higher in entropy content than text data, and probably a good deal more tightly correlated too, making it much easier to learn from.
Remember, for a human, the whole point of speech and text is to function as a robust, efficient code (in the information-theoretic sense) for activating generative models we already have in our heads when we acquire the words. This is why we have such a hard time with very abstract concepts like mathematical ones: the causal-role concepts (easy to theorize how to encode those as generative models) are difficult to acquire from nothing but the usage statistics of the words and symbols, in contrast to "concrete", sense-grounded concepts, which have large amounts of high-dimensional data to fuel the Blessing of Abstraction.
Nevermind, I should probably just get someone to let me into a PhD program so I can publish this stuff. If only they'd consider the first paper novel enough already!
I don't know if humans have generative models in their heads. There are people who have no ability to form mental images, and they function fine. Regardless, an AI should be able to get around that by learning our common patterns. It need not mimic our internal states, only our external behavior.
No, I meant a pair of specific information-theoretical quantities I've been studying.
>And yes videos are highly redundant. But I don't see how that's any kind of advantage.
Representations are easier to learn for highly-correlated data. Paper forthcoming, but conceptually so obvious that quantifying it is (apparently) non-novel.
>I don't know if humans have generative models in their heads.
The best available neuroscience and computational cognitive science says we do.
>Regardless, an AI should be able to get around that by learning our common patterns. It need not mimic our internal states, only our external behavior.
Our external behavior is determined by the internal states, insofar as those internal states are functions which map sensory (including proprioceptive and interoceptive) statistics to distributions over actions. If you want your robots to function in society, at least well enough to take it over and kill everyone, they need a good sense of context and good representations for structured information, behavior, and goals. Further, most of the context to our external behavior is nonverbal. You know how it's difficult to detect sarcasm over the internet? That's because you're trying to serialize a tightly correlated high-dimensional data-stream into a much lower-dimensional representation, and losing some of the variance (thus, some of the information content) along the way. Humans know enough about natural speech that we can usually, mostly reconstruct the intended meaning, but even then, we've had to develop a separate art of good writing to make our written representations conducive to easy reconstruction of actual speech.
Deep learning can't do this stuff right now. Objectively speaking, it's kinda primitive, actually. OTOH, realizing the implications of our best theories about neuroscience and information theory as good computational theories for cognitive science and ML/AI is going to take a while!
I know what you are saying, but I don't think it's true.
Imagine a hypothetical language that is so compressed, so non-redundant, so little correlated, that it's indistinguishable from random noise. Learning this language may seem an impossible task. But in fact it's very easy to produce text in this language. Just produce random noise! As stated, that's indistinguishable from real text in this language.
Real language, of course, has tons of statistical patterns, and is definitely not random. But I don't see how it is harder to learn than, say more redundant audio recording of the same words, or a video recording of the person speaking them. That extra information is irrelevant and will just be discarded by any smart algorithm anyway.
>The best available neuroscience and computational cognitive science says we do.
Most people do. As I said some people don't, and they function fine. See this: http://www.bbc.com/news/health-34039054
>Our external behavior is determined by the internal states, insofar as those internal states are functions which map sensory (including proprioceptive and interoceptive) statistics to distributions over actions. If you want your robots to function in society, at least well enough to take it over and kill everyone, they need a good sense of context and good representations for structured information, behavior, and goals.
Robots are never going to have exactly the same internal states and experience as humans. They could be very, very different, in structure, to the human brain. Being exactly like humans isn't the goal. Mimicking humans is an interesting diversion, but it's not necessary, or the goal in and of itself.
And you may be right that a robot without vision would be disadvantaged. I think that's mostly anthropomorphism, imagining how disadvantaged blind humans are (and in fact even blind humans can function better than most people expect.) But even if it's true, my point is that sight is not strictly necessary for intelligence.
In fact I think vision may even be a disadvantage. So much of the brain is devoted to visual processing. While text, and even language itself, are hacks that evolution created relatively recently. A brain built purely for language could be much more efficient at it than we can probably imagine. Ditching vision could save a huge amount of processing power and space.
Reading your post, you actually seem quite confused.
>Imagine a hypothetical language that is so compressed, so non-redundant, so little correlated, that it's indistinguishable from random noise. Learning this language may seem an impossible task.
Well yes, learning a class of strings in which each digit of every finite prefix is statistically independent from each other digit, is very hard, bordering on impossible (or at least, impossible to do better than uniform-random guessing).
>But in fact it's very easy to produce text in this language. Just produce random noise!
But that isn't the learning problem being posed! You are not being asked to learn `P(string | language)` (which is, in fact, the uniform distribution over arbitrary-length strings), but `P(language | string1, string2, ..., stringn)`, which by the way you've posed the problem factorizes into `P(language| character1) x P(language|character2) x ... x P(language|characterm)`. If the actual strings are sampled from a uniform distribution over arbitrary-length strings, then we have two possibilities:
1) The prior is over a class of languages some of which are not optimally compressed, and which thus do not render each character (or even each string) conditionally independent. In this case, the posterior will favor languages that do render each character conditionally independent, but we won't be able to tell apart one such hypothesis from another. We've learned very little.
2) The prior is over a class of languages all of which yield strings full of conditionally-independent noise: no hypothesis can compress the data. In this case, the evidence-probability and the likelihood cancel, and our posterior over languages equals our prior (we've learned nothing).
>Real language, of course, has tons of statistical patterns, and is definitely not random. But I don't see how it is harder to learn than, say more redundant audio recording of the same words, or a video recording of the person speaking them. That extra information is irrelevant and will just be discarded by any smart algorithm anyway.
Noooo. Compression does not work that way. Compression works by finding informative patterns in data, not by throwing them away. If your goal is to learn the structure in the data, you want the structure to be more redundant rather than less.
I'm telling you, once the paper is submitted, I can send you a copy and just show you the equations and inequalities demonstrating this fact.
>Most people do. As I said some people don't, and they function fine. See this: http://www.bbc.com/news/health-34039054
Differences in sensorimotor cortex function that leave the brain unable to perform top-down offline simulation with a high subjective-sensory precision don't invalidate the broad theory that cortical microcircuits are generative models (in particular, hierarchical ones, possibly just large hierarchies in which the individual nodes are very simple distributions).
>Robots are never going to have exactly the same internal states and experience as humans. They could be very, very different, in structure, to the human brain.
Duh. However, if we want them to work, they probably have to run on free-energy minimization somehow. There is more necessity at work here than connectionism believes in, but that's a fault in connectionism.
>Being exactly like humans isn't the goal. Mimicking humans is an interesting diversion, but it's not necessary, or the goal in and of itself.
I didn't say that a working robot's representations had to exactly match those of humans. In fact, doing so would be downright inefficient, since robots would have completely different embodiments to work with, and thus be posed different inference problems in both perception and action. The fact that they would be, necessarily, inference problems is the shared fact.
>And you may be right that a robot without vision would be disadvantaged. I think that's mostly anthropomorphism, imagining how disadvantaged blind humans are (and in fact even blind humans can function better than most people expect.) But even if it's true, my point is that sight is not strictly necessary for intelligence.
Sight isn't. Some kind of high-dimensional sense-data is.
>In fact I think vision may even be a disadvantage. So much of the brain is devoted to visual processing. While text, and even language itself, are hacks that evolution created relatively recently. A brain built purely for language could be much more efficient at it than we can probably imagine. Ditching vision could save a huge amount of processing power and space.
That's putting the cart before the horse. Language is, again, an efficient but redundant (ie: robust against noise) code for the models (ie: knowledge, intuitive theories, as you like) the brain already wields. You can take the linguistic usage statistics of a word and construct a causal-role concept for them in the absence of a verbal definition or sensory grounding for the word, which is arguably what children do when we read a word before anyone has taught it to us, but doing so will only work well when the concepts' definitions are themselves mostly ungrounded and abstract.
So purely linguistic processing would work fairly well for, say, some of mathematics, but not so much for more empirical fields like social interaction, ballistic-missile targeting, and the proper phrasing of demands made to world leaders in exchange for not blowing up the human race.
Hold on. Let's say the goal is passing a Turing test. I think that's sufficient to demonstrate general intelligence and do useful work. In that case, all that is required is mimicry. All you need to know is P(string), and you can produce text indistinguishable from a human.
>Noooo. Compression does not work that way. Compression works by finding informative patterns in data, not by throwing them away. If your goal is to learn the structure in the data, you want the structure to be more redundant rather than less.
Ok lets say I convert English words to smaller huffman codes. This should be even easier for a neural network to learn, because it can spend less effort trying to figure out spelling. Of course some encodings might make it harder for a neural net to learn, since NNs make some assumptions about how the input should be structured, but in theory it doesn't matter.
>Some kind of high-dimensional sense-data is [necessary]... purely linguistic processing would work fairly well for, say, some of mathematics, but not so much for more empirical fields
These are some really strong assertions that I just don't buy, and I don't think you've backed up at all.
Humans have produced more than enough language for a sufficiently smart algorithm to construct a world model from it. Any fact you can imagine is contained somewhere in the vast corpus of all English text. English contains a huge amount of patterns that give massive hints to the meaning. E.g. that kings are male, or that males shave their face and females typically don't, or that cars are associated with roads, which is a type of transportation, etc.
Even very crude models can learn these things. Even very crude models can produce nearly sensible dialogue from movie scripts. Models with millions of times fewer nodes than the human brain. It's amazing this is possible at all. Of course a full AGI should be able to do a thousand times better and completely understand English.
Trying to model video data first is wasted processing power. It's setting the field back. Really smart researchers spend so much time eeking out 0.01% better benchmark on MNIST/imagenet/whatever, with entirely domain specific, non general methods. So much effort is put into machine vision, when Language is so much more interesting and useful, and closer to general intelligence. Convnets, et al., are a dead end, at least for AGI.
I'd need to see the math for this: how will the Huffman codes preserve a semantic bijection with the original English while throwing out the spellings as noise? It seems like if you're throwing out information, rather than moving it into prior knowledge (bias-variance tradeoff, remember?), you shouldn't be able to biject your learned representation to the original input.
Also, spelling isn't all noise. It's also morphology, verb conjugation, etc.
>Humans have produced more than enough language for a sufficiently smart algorithm to construct a world model from it.
Then why haven't you done it?
>Any fact you can imagine is contained somewhere in the vast corpus of all English text.
Well no. Almost any known fact I can imagine, plus vast reams of utter bullshit, can be reconstructed by coupling some body of text somewhere to some human brain in the world. When you start trying to take the human (especially the human's five exteroceptive senses and continuum of emotions and such) out of the picture, you're chucking out much of the available information.
There's damn well a reason children have to learn to speak, understand, read, and write, and then have to turn those abilities into useful compounded learning in school -- rather than just deducing the world from language.
>Even very crude models can learn these things. Even very crude models can produce nearly sensible dialogue from movie scripts.
Which doesn't do a damn thing to teach the models how to shave, how to tell kings from queens by sight, or how to avoid getting hit by a car when crossing the street.
>Models with millions of times fewer nodes than the human brain. It's amazing this is possible at all.
The number of nodes isn't the important thing in the first place! It's what they do that's actually important, and by that standard, today's neural nets are primitive as hell:
* Still utterly reliant on supervised learning and gradient descent.
* Still subject to vanishing gradient problems when we try to make them larger without imposing very tight regularizations/very informed priors (ie: convolutional layers instead of fully-connected ones).
* Still can't reason about compositional, productive representations.
* Still can't represent causality or counterfactual reasoning well or at all.
>Trying to model video data first is wasted processing power. It's setting the field back. Really smart researchers spend so much time eeking out 0.01% better benchmark on MNIST/imagenet/whatever, with entirely domain specific, non general methods. So much effort is put into machine vision, when Language is so much more interesting and useful, and closer to general intelligence. Convnets, et al., are a dead end, at least for AGI.
Well, what do you expect to happen when people believe in "full AGI" far more than they believe in basic statistics or neuroscience?
I don't think anything like that exists today, or ever will exist. And in fact you are making an even stronger claim than that. Not just that vision will be helpful, but absolutely necessary.
If I said "That is one huge fan", could you predict what my next word will be? It would depend a lot on context and the ability to reason within a complex world model. Depending on whether I had gone to a concert or to a wind tunnel, your distribution over guesses would alter. If I had gone whale watching you might even suspect I made a typo. Changing huge to large would lead to major to no adjustments, depending on each guess.
So while I agree an AI could emerge from text alone, it would have to be very sophisticated to do this.
doesn't that depend on the syntax of a language? my guess is that it would not work for a language that has a more flexible word order than English; so it wouldn't quite work for such languages as Russian, Turkish and Finnish (and other languages).
But I do believe it is possible, someday. Probably within our lifetime.
Edit - To be more clear I also agree AGI could be possible with text as the only input. I just think we need a new paradigm. Ostensibly AGI is meant to mimic human intelligence (minus the pitfalls), so IMO, the best approach will be that which mimics the underlying processes of human intelligence - not just the results. Traditional statistical NLP methods will probably have some role in this final system, but not the heart of it, as far as mimicking intelligence by mimicking intelligence's underlying processes goes.
It's learning a meaning, not the meaning. It's just a probabilistic model for the occurrence of a word based on the words that surround it. This should not serve as a base for the rest of your claims.
Anyway -- the improvements gained by multi-modal systems essentially disprove your thesis. Which is a good news! We're making progress.
Maybe we can do that with just words, but humans certainly don't. Words are related to concepts first, then we figure out the meaning of the rest.
Look at animal kingdom some animals are walking about 5 minutes of being born and running in hours.
How about when you include multimedia recordings? Or give the machine a camera and wheels?
Strangely (or maybe horribly?) there are a number of studies of children raised in Romanian orphanages that somewhat cover this area.
Under Nicolae Ceaușescu the government outlawed abortion (with some exceptions) in an attempt to increase the birth rate (https://en.wikipedia.org/wiki/Abortion_in_Romania). Coupled with a poor economy this lead to masses of infants and children being given over to government "care" in orphanages. These were pretty bleak places for infants and children, with infants often spending hours in a crib with little stimulation.
Here's a really good article about the effects this sort of institutional "care" has on children: http://www.americanscientist.org/issues/feature/2009/3/the-d...
Here's the The Bucharest Early Intervention Project (tons of info about this subject area): http://www.bucharestearlyinterventionproject.org/
There's lots of basic research being done to better understand the biological brain. Progress is slow and steady, but the brain remains poorly understood at this point in time. Most applied research has pursued more pragmatic methods because these methods have had faster progress and proven more useful in practice.
And if you look at what humans do that we characterize as "intelligence", it's a substantial list of different functions: "Human intelligence is the intellectual capacity of humans, which is characterized by perception, consciousness, self-awareness, and volition. Through their intelligence, humans possess the cognitive abilities to learn, form concepts, understand, apply logic, and reason, including the capacities to recognize patterns, comprehend ideas, plan, problem solve, make decisions, retain information, and use language to communicate. Intelligence enables humans to experience and think." If I take an engineering perspective and look at the list of functions that an AGI would have to perform to be an AGI, I would definitely want to give it more than a text-in, text-out interface so that it could have, for example, perception and volition.
For example, imagine a theoretical "AGI" that exclusively deals with a stream of text in (as human language) and a stream of text out (again, as human language). If you ask it any questions about its physical surroundings, it's either going to make things up (which isn't perception and therefore fails both at being useful and meeting the definition of intelligence above), or it's going to get information about its physical surroundings via some proxy that feeds it a linguistic stream of information about its surroundings. But it doesn't matter if it's getting perceptual information from a proxy or from more directly embedded sensory interfaces; if it's getting perceptual information and able to use it sensibly, then it's performing the function of perception. And in that case the proxy source of perceptual information may as well be considered part of the "intelligent" system.
These kinds of articles seem a bit silly to me, because they seem to imply that we should expect to be able to create a system which is "intelligent" but which only performs the function of understanding and producing meaningful language. But if you're only handling language, you're taking away all of the other functions that are part of the definition of "intelligence" above, which leaves you with a system which is far short of anything we'd consider "intelligent".
Indeed. However we can not rule out the possibility that we engineer a system without a "body" (I think you are referring to the Embodied mind thesis?). It is a complicated topic and discussions about it are futile without precise definitions of loaded terms, like "intelligence", "body", etc.
A rather well-defined test is the classic Turing test and I wouldn't dismiss the possibility that it can be passed by a bodyless machine/program-thing.
On the contrary, to make progress we should give up on making precise definitions of non-technical terms such as 'body' and 'intelligence'. They are folk notions that don't have relevant precise definitions talking about them in an engineering context is distracting.
You don't need a definition of 'beauty' to paint. You don't need a definition of 'justice' to practice law. You don't need a definition of 'intelligence' to build clever robots.
IIRC the test is a binary classifier, but intelligence is a spectrum that's fuzzy and therefore inherently hard to define.
IE, how low is someone willing to assume the lack of intelligence in a human is not a good definition of general intelligence, as that's circular reasoning.
I would suppose that babies possess general intelligence, but they lack the knowledge about the environment.
Why do you assume that? The creators of AlphaGo certainly couldn't beat it.
No, it's really not. For the Turing Test, the AI is meant to be adversarial—it's objective is to convince you that it is human.
AlphaGo's objective isn't to "play like a human," it is to win. If they gave it an objective of playing like a human, I'm sure AlphaGo could play in a way that would be indistinguishable from a human.
> Since they have access to the AlphaGo system, they can just calculate the probability that each move corresponds to one AlphaGo would make.
Peeking at the system/data is cheating. Obviously the person who sets up a Turing test knows which player is AI.
It could just play unbelievably bad and appear like a beginner. That wouldn't prove intelligent.
> Peeking at the system/data is cheating
Someone ignorant of computers would hardly ever assume a machine. Of course the omission of this rule would leave someone smarter than the computer.
If you talk statistics, IE the machine has to convince only a fair share of humans, the definition of the threshold is a problem. Intelligence would depend on the development of the society. I thought this is about an intrinsic value.
It's an interesting thought experiment, but hardly conclusive, just observational.
Sure, which is why it's not a very good metric. The correct metric for looking at whether computational game intelligence has exceeded human capacity is that computers can consistently beat humans.
To be clear, I'm not convinced that we'll ever make a generalized intelligence which can pass the Turing Test. My point was merely that the fact that humans create the system is not a good argument for why it's impossible: in many domains, we can already create computer systems which vastly outperform ourselves.
It is very hard to dial down a ELO 3000+ program to 1800 level of a club player and not make it computer like.
What is usually done is lower the depth searched and add some random blunders but it is still obvious to a stronger player that it is a program.
I edited the post, did you read that? You are making my point, you can't bootstrap a definition for artificial intelligence by comparison to humans, when human intelligence is not well defined either.
The first versions of AlphaGo were certainly inferior to human players, but the current version is superior to any human.
I made a hopeful hypothesis and I opposed immediately that human intelligence might just not be optimized for recognizing intelligence. It is optimized for other things, not to waste energy and because of that it recognizes indeed that to play go very well but nothing else is rather less intelligent.
You do make a strong point there, specialized computers are stronger than humans in a specific task, but we are talking about general intelligence. I have to admit, too, that I have a hard time getting the bigger picture and get confused to easily. I didn't read any of the literature that would rather well define the problem, as the OP put it, so the discussion is likely less informative.
In my opinion, the comparison is unequal, still, because the Computer used a ton more resources and memory. There aren't enough go professionals to put together and let their averaged opinion learn and play, consuming as much energy.
Easier to build, parallelize, extend, maintain. Possibly somewhat safer too.
But that's a big if. You've just taken one really hard problem (learning about the world) and turned it into an even harder problem (simulating the world).
But Turing argued that if the machine could be shown to do the same tasks that humans can do, and act indistinguishable from a real human, then surely it's intelligent. Its internal properties don't matter, at least not to judge intelligence.
In practice, a good class of tests to use is a test that must evoke an emotional response to produce a sensical answer. An example is art interpretation. Questions involving allegory. Interpret a poem etc.
Important to note that whatever the challenge is, it must always be a new example - as in never been seen before. Anything that is already in the existing corpus, the computer can simply look up what is already out there. In other words, there is no one concrete thing you can use again and again repeatedly.
Example of test that would foil a computer: A personally written poem and having discussion about it.
There's a great book which gives an overview of this field, The Talking Heads Experiment: Origins of Words and Meanings by Luc Steels, which discusses many of the advances made in this field (including for instance how having a grammar, as opposed to just stringing words related to what you want to say at random, is an evolutionary advantage because it boosts communicative success). It's published as open access, so go grab your free copy! :)
Chapter 4 in particular has a very interesting discussion of what's problematic with the machine learning approach -- that it takes a lot of training examples for a classifier to start making interesting decisions -- and presents a selectionist alternative to that, where distinctions (as in e.g. nodes in decision trees) are grown randomly and they're reinforced / pruned based on feedback. Crucially, the categories (semantic distinctions) are not labels given at the outset, but they emerge along with the language, based on the environment the agents encounter and the tasks they're using language for.
In general, I'd recommend Chapters 1 and 2 for a quick introduction, but in a pinch, I attempted to give a 50,000-foot summary in an essay I wrote (look under the heading Evolutionary Linguistics):
I realize that engineering applications of these ideas might be a long way off (and perhaps they'll never materialize), but boy are these exciting discoveries about the very fabric of language :)
That would beget the question if there could be machines mightier than a Turing Complete one. I'm sure that's missing your point.
Even if that were true, which I am not sure about, such a robot will have a very different moral status, as an artifact, and one that can be reproduced cheaply and indefinitely. This is very useful indeed, so on this axis it could be counted as 'better'.
The idea of development being important for AI is an old one, but it hasn't had much concrete success. Brook's Cog robot at MIT is a prominent example of a robot that didn't do very much, despite this approach being taken in a good faith effort by talented and well-supported people.
We basically just need a system that comes up with efficient representations of the world such that it can reason about it, i.e. that it can tell you which hypotheses about the world are likely true given some data. This computation allows you to make predictions and predictions are really at the heart of intelligence. If you can follow a hypothetical trajectory of generated, hallucinated or simulated samples of reality, i.e. samples that likely correspond to what actually happens in the world (and in the agent's own brain), then you can actually perform actions that are targeted at some purpose (e.g. maximizing reward signals). However, there are many sources of data that essentially give you the same information. Whether you create a representation by directly interacting with the world or just watch many examples of how the world generally evolves over time and how different entities interact with one another, you essentially get the same idea about how the world works (except in the first case you also learn a motor system). I think the anthropomorphism is really misplaced here, because computer systems are not dependent on actually performing in the real world. Since computers have near unlimited, noiseless memory and have super fast access to that memory, they can actually plan interactions by careful reasoning on the fly instead of needing to learn e.g. motor skills for manipulating objects, eating and handwriting before one can get anywhere near the performance of computers wrt. access to reliable external memory. A computer system also does not have hormone and neuromodulator levels that need to be met for healthy development (e.g. dopamine), therefore the intuition that deprivation of interaction with the world prevents learning is extremely misleading.
On what basis do you conclude that "biologically" is important here? There may be some reason to suspect that human-like intelligence requires human-like ability to sense and interact with the outside world, but I see less reason to suspect that it is important that the mechanism of the sensors or manipulators must be biological.
The cheapest way to make a brain is still the old way.
I would say that Helen Keller's life argues that a intelligent machine must be able to have experiences and the capacity to associate experiences with language. The machine probably doesn't need all of the perceptual modalities that we have, as Helen Keller demonstrated, but it should probably have some similarities with our own so that there would be common ground for initiating communication about experiences. A machine with just a text in / text out interface has nothing in common with us.
Anne Sullivan's equally remarkable (IMHO) role as a teacher is not really an issue here, as training is also an option for AI, though it might be evidence in a rather different discussion about whether unsupervised learning alone, particularly as practiced today, is likely to get us to AI (clearly, the evolution of intelligence can be cast as unsupervised learning, but that is a very long and uncertain process...)
If you think about audio/visual data, deep nets make sense: if you tweak a few pixel values in an image, or if you shift every pixel value by some amount, the image will still retain basically the same information. In this context, linearity (weighting values and summing them up) make sense. It's not clear whether this makes sense in language. On the other hand, deep methods are state of the art on most NLP tasks, but their improvement over other methods isn't the huge gap as in computer vision. And while we know there are tight similarities between lower-level visual features in deep nets and the initial layers of the visual cortex, the justification for deep learning in NLP is simpler and less specific: what I see is the fact that networks have huge capacity to fit to data and are deep (rely on a hierarchy of features). My guess is we may need a fundamental breakthrough in a newfangled hierarchical learning system that is better suited for language to “solve” NLP.
I think there are similar limitations with control and inference. When it comes to AlphaGo the deep learning component is responsible for estimating the value of the game state; the planning component is done with older methods. This is much more speculative, but when it comes to the work on Atari games, for example, I suspect that most of what is being learned (and solved) is perception of useful features from the raw game images. I wonder whether the features for deducing game state score are actually complex.
I think what I'm trying to say is that when we look at the success of deep learning, we have to separate out what part of that is due to the fact that deep learning is the go-to blackbox classifier, and what part of this is due to the systems we use actually being a good model for the problem. If the model isn't good, does that model merely need to be tweaked from what we currently use, or does the model have to completely change?
Arguing from the other direction, neural networks have also already proven to deal with very sharp features. For example the value and policy networks in AlphaGo are able to pick up on subtle changes in the game position. The changes from the placement of single stones can be vast in Go and by no means this is only solved by the Monte Carlo tree search. Without MCTS, AlphaGo still wins in ~80% of the time against the best hand-crafted Go program. The value and policy networks have pretty much evolved a bit of boolean logic, simply from the gradient from the smoothness that results from averaging over a lot of training data.
I have a pet theory that the discovery of sharp features and boolean programs might heavily rely on noise. If the error surface becomes too discrete, we basically need to backup to pure random optimization (i.e. trying any direction by random chance and keep it, if it is better). That allows us to skip down the energy surface even without the presence of a gradient. Of course, such noise can also lead to forgetting, but it just seems that elsewhere the gradient will be non-zero again, so any mistakes will be correct by more learning (or it simply leads to further improvement if the step was into the right direction). Surely, our episodic memory helps in the absence of gradient information as well. If we encounter a complex, previously unknown Go strategy, for example, it will likely not smoothly improve all our Go playing abilities by a small amount. Instead, we store a discrete chaining of states and actions as an episodic memory which allows us to reuse that knowledge simply by recalling it at a later point in time.
Isn't that basically Monte Carlo?
It is funny how every AI post on HN turns into a speculative discussion forum full of words "I think", "likely", "I suspect", "My guess" etc, when all the research is available for free and everyone is free to download and read it to get a real understanding of what's going on in the field.
>what I see is the fact that networks have huge capacity to fit to data and are deep (rely on a hierarchy of features).
Actually recurrent neural networks like LSTM are turing-complete, i.e. for every halting algorithm it is trivial to implement an RNN that computes it.
It is non-trivial to learn these parameters from algorithm IO data, but for many tasks it is possible too.
>I suspect that most of what is being learned (and solved) is perception of useful features from the raw game images.
It is not this simple, deep enough convnets can represent computations, the consensus is that middle and upper layers of convnets represent some useful computation steps. Also note that human brain can only do so much computation steps to answer questions when in dialogue, due to time and speed limits.
>My guess is we may need a fundamental breakthrough in a newfangled hierarchical learning system that is better suited for language to “solve” NLP.
This is being worked on, see the first link for Memory Networks and Stack RNNs, DeQue RNNs, Tree RNNs. Deep learning is a very generic term, there are dozens of various feedforward and recurrent architectures that are fully differentiable. The full potential of such models has not been nearly reached yet and maybe language understanding will be solved in the coming years (again, the first link shows that it is in process of being solved).
EDIT (reply to below): in general these statements are either vague and nonspecific, or perfectly correct and non-informative, comments that don't have much to do with my original point.
>Turing-completeness is quite broad and nonspecific, like I said.
It is, but feedforward models (and almost every Bayesian/statistical model) don't possess it even in theory, while RNNs do.
>Doing "some computation" is an obvious statement that doesn't add any information.
Let me be more specific: currently researchers think that later stages of CNNs do something that is more interpretable as computation than as mere pattern matching. Our world doesn't require 50-level hierarchy, but resnets with 50+ layers do good, looks like because they learn some non-trivial computation.
>the jury is still out on whether any of those RNN approaches will be the needed breakthrough.
Sure, we'll see. Maybe there won't be need in any breakthrough, just incremental improvement of models. And even current models when scaled up to next-gen hardware (see nervana) can surprise us again with their performance.
My skepticism is about success in the sense of commercially useful systems that can process language and function "off the leash" of human supervision without the results being dominated by unacceptably bad results.
Look at the XBOX ONE Kinect vs the XBOX 360 Kinect. On paper the newer product is much better than the old product, but neither one is any easier or fun to use than picking up the gamepad. In the current paradigm, researchers can keep putting up better and better numbers without ever crossing the threshold to something anybody can make a living off.
This is probably due to the fact that the field is very interesting and has lots of undefined boundaries, so people like to take educated guesses based on the knowledge they might have and on their intuition. Fair enough for this discussion.
> maybe language understanding will be solved in the coming years
OK, here comes my guess: I think reasoning about and producing computer programs should be easier than reasoning about and producing natural language. So if that's possible (big if), then it should come first. And then maybe the NLP will be solved with the help of code writing computers. Or maybe just by code writing computers, and nobody here has a job anymore :)
I wonder if it's just a different kind of "noise". Higher level, more structured.
> My guess is we may need a fundamental breakthrough in a newfangled hierarchical learning system that is better suited for language to “solve” NLP.
It seems fairly evident that there are many hierarchies inside the brain, each level working with outputs from lower-level processing units. In a sense, something like AlphaGo is hierarchy-poor - it has a few networks loosely correlated with a decision mechanism.
But the brain probably implements a "networks upon networks" model, that may also include hierarchical loops and other types of feedback.
I think, to have truly human level NLP, we'd have to simulate reasonably close the whole hierarchy of meaning, which in turn is given by the whole hierarchy of neural aggregates.
EX: "How long do stars last?" Means something very different in a science class than a tabloid headline. Is that tabloid talking divorce or obscurity? Notice how three sentences in I am clarifying last.
EDIT: a combination of noise, I should say, and paucity of information.
Asking a computer to solve all the ambiguity in human language perfectly is asking it to solve it far better than any human can.
For human-level NLP, you need to model the mechanism by which the relationship network is generated, and ground it in a set of experiences - or some digital analogue of experiences.
Naive statistical methods are not a good way to approach that problem.
So no, Wikipedia will not provide enough context, for all kinds of reasons - not least of which is the fact that human communications include multiple layers of meaning, some of which are contradictory, while others are metaphorical, and all of the above can rely on unstated implication.
Vector arithmetic is not a useful model for that level of verbal reasoning.
For AI to determine their own goals, well now you get into awareness ... consciousness. At a fundamental mathematical level, we still have no idea how these work.
We can see electrical signals in the brain using tools and know it's a combination of chemicals and pulses that somehow make us do what we do ... but we are still a long way from understanding how that process really works.
I'd actually just say that we've not really defined these very well, and so arguing about how far along the path we are to them isn't that productive.
In similar context they probably end up parsed to the same question assuming correct inflection, posture, etc. Spoken conversations are messy, but they also have redundancy and pseudo checksum's. Written language tends to be more formal because it's a much narrower channel and you don't get as much feedback.
PS: It's also really common for someone to ask a question when they don't have enough context to understand what question they should be asking.
A further comment on deep methods being state of the art currently:
I wonder how well these tasks really measure progress in natural language understanding (I really don't like isolating that term as some distinct subdiscipline of broader AI goals, but so be it). Some of Chris Manning's students have at least started down the path of examining some of these new-traditional tasks in language, and found that perhaps they are not so hard as they claim to be.
 A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. Chen, Bolton & Manning [https://arxiv.org/abs/1606.02858]
IME as a chatbot developer, people don't talk to them in conversational english so much as spit out what they want it to do.
But something about the very use of hierarchy in trying to solve NLP makes me queasy. I think it's more (poetically-metaphorically) like Reed-Solomon codes than hierarchies ( to the extent that those don't actually overlap ). There is Unexplained Conservation of Information That Really Isn't There To Start With.
Language is only a communication protocol most efficient in an interactive context (dialog) that allows two agents with shared but not identical set of experiences achieve understanding in some domain and context with the caveat that understanding is unprovable and not absolute. Understanding that is only empirically tested and behavior probed (e.g. long after successful conversation Agent Alice discovers that Agent Bob "did not get it" as she expected).
By analyzing sole language without experiences, the machine, using something like word2vec, may discover semantic dependencies (e.g. man + cassock = pedophile) but not true semantics that has world consequences.
Even with unlimited language corpora the machine does not have the set of axioms that humans have (experiences and observed stories). These axioms are needed to build further more abstract knowledge.
So I'm not surprised computers keep on struggling with language applications. Once they succeed strong AI will not be much further away.
I'll raise your bet; I'm willing to believe that once we succeed in building a general understanding of language, we'll look back and see that we simultaneously have solved Strong AI. To twist the old saying, I think that language is what the human brain does.
 Yes, we can talk about P-zombies if you want. But I mean more in the Turing Test sense here.
 Yes, I know the progress has been impressive. The progress in the 60s with GOFAI was impressive at first too. Then it plateaued.
 I'm particularly referring to Sapir-Whorfishm and the various communication heuristics proposed by Grice. But I'd throw Chomskian Universal Grammar in there too.
Another grounding source is related to ontologies. We are already building huge maps of facts about the world like "object1 relation object2".
Another source of "common sense" is word embeddings. In fact it is possible to embed all kinds of things, like, shopping bags, music preferences, networks topologies - as long as we can observe objects in context.
Then there is unsupervised learning from video and images. For example, starting from pictures, cut them in a 3x3 grid, shuffle the tiles and then task the network to recover the original layout. This automatically extract semantic information from images unsupervised. A variant is to take slides from video, shuffle them around, then task the network to recover the original temporal order. Using this process we can cheaply learn about the world and provide this knowledge as "common sense" for NLP tasks.
I am not worried about grounding language. We will get there soon enough, but we're just impatient. Life evolved over billions of years, AI is just emerging now. Imagine how much computing power is in the collected brains of humanity, and how much computer time we give AI to learn. AI is starved of raw computing power and experience yet. Human brains would have done much worse with the same amount of computing.
Ontologies are much the same; they are interesting for the problems they solve, but it's not clear how well those problems relate to the more general problem of language.
word embeddings are also quite interesting, but again, are typically based entirely off whatever emergent semantics can be gleaned from the structure of documents. It's not clear to me that this is anymore than superficial understanding. Not that they aren't very cool and powerful. Distributional semantics is a powerful tool for measuring certain characteristics of language. I'm not sure how much more useful it will be in the future.
Uunsupervised learning from video and images is a strictly different problem that seems to me to be much lower down the hierarchy of AI Hardness. More like a fundamental task that is solvable in its own universe, without requiring complete integration of multiple other universes. Whether the information extracted by these existing technologies is actually usefully semantic in nature remains to be seen.
I agree that we'll get there, somewhat inevitably; not trying to argue for any Searlian dualistic separation between what Machines can do and what Biology can do. I'm personally interested in the 'how'. Emergent Strong AI is the most boring scenario I can imagine; I want to understand the mechanisms at play. It may just be that we need to tie together everything you've listed and more, throw enough data at it, and wait for something approximating intelligence to grow out of it. We can also take the more top-down route, and treat this as a problem in developmental psychology. Are there better ways to learn than just throwing trillions of examples at something until it hits that eureka moment?
Regarding the "internal world", we already see the development of AI mechanisms for attention, short term memory (references to concepts recently used), episodic memory (autobiographic) and semantic memory (ontologies).
I think language is a UI with our own brain. It allows us to interact with its knowledge system and representation of the world. Self is a thin client running on the vast knowledge system. If you think about it, thinking is not where the real thinking happens. We get intuition signals from the brain on what is true / false , which are required for our higher level thinking. So thinking we do is also a thin client running on top of the Brain OS. Both the thinking and language are serialization tools of representation of the world that was solely evolved for communication with other brains. Since we don't have direct neural link with other brains, we have to serialize it and hence language based thinking.
So i think to evolve language understanding in machines, we might have to simulate many intelligent agents in a simulated environment and let them collaborate. Similar to how our brains collaborated and gave rise to natural languages.
Even with language there is a whole spectrum of language skill. Some animals like parrots, crows, great apes, and then people can learn language at various levels.
Some deep learning models can already learn basic language skills too. The question is, how far can these techniques go. Maybe, pretty far.
I don't think AI will be able to fully grasp the intricacies of human language until it has "lived" long enough to form the links between ideas and experiences. Mainly in the physical realm, as obviously a lot of our human development is shaped by our environments. They will need eyes, ears, and maybe noses. We should also consider giving them subconscious or instinctive reactions to certain stimuli. An AI wouldn't immediately know that, for example, rotten meat is bad to humans because it lacks a nose to send a signal of danger and disgust.
We should also consider the idea to communicate with AI in regular face-to-face speech. Talking is not the same as writing, and conveys a lot of information beyond just the words.
There are many types of application programmers, but there are 2 types in particular that are interesting. One of them is the purely technology-driven developer. He uses all the new tools, he's read Knuth's books a hundred times, he knows how to build elegant systems. However, he only takes enough interest in the business as is necessary to know what to build. At the end of the day, he'll build the most elegant beautiful system that almost never accomplishes the business goal. He knows how to describe a problem, but he doesn't really know the problem.
The second likes programming, he finds technology fun, but he is really driven by trying to understand the full context of the business. Writing software is a means to see an impact on people. He's driven by seeing a business problem solved. I've only met 2 people in my career who are ACTUALLY like this, they're rare... which is maybe a good thing because they write shit code.
A good engineering team tries to get both of these guys, you have the tech guy making sure your platform is maintainable, and you have the business driven guy who makes sure it's useful. One guy understands the structure of the tool, the other understands the structure of the world the tool is in.
A language is a tool, it can be elegant, it can be beautiful, it can be technically perfect.. and just like poetry, it can have a very little practical utilitarian purpose. When I look at how we're using ANN's to develop language today, this is how it feels to me. We're spending so much time trying to figure out how to get a computer to build the most technically perfect sentence, we're missing the maybe more interesting problem of trying to get a computer to understand the world. My son right now isn't old enough to conjugate a sentence, but he understands what certain things in the world do. He clearly understands that cars move things, he understands you can use the hose to get things wet. He's not that old, but he's developing a mental model of the world. He just doesn't know how to describe it yet.
To me, having a computer look at a crane, and then print the word "crane" is interesting, but even more interesting is if you could give it 3 pictures (a crane, a building, and a pile of rubble) and teach it how 1+1=2.
The major flaw i see in manager/corp/team analysis of workers is that it misses out on a portion of the population that is genuinely independently functional and creates and shares the value they create. They don't work for companies because they either don't need to or own their own. These are the ideals worth keeping in mind.
Language isn't some side feature. It's a complicated interface layer that lives on top of an enormously rich, dynamic internal model of the world one lives in.
The article barely touches on this fundamental aspect.
Subjective feelings are not necessarily part of that.
This article on how consciousness evolved does a good job of explaining how it works (finally), and I think it's something we could emulate.
Coding up a simulation that understands itself as one conceptual entity among many is not that interesting. The trick is having a subjective experience of that understanding. It seems to me that: subjective experience + sufficiently advanced conceptual understanding is what gives rise to, is the definition of, self-awareness.
See my other posts in this thread for more thoughts on this.
Neural networks are going to make huge inroads to the AI language problem simply by exposing the AI to example after example of words in varying contexts. But I wonder if the real problem is getting those neural networks to let go of unnecessary data? Humans rely on excited neurons to recognize patterns, but our neurons let a lot of sensory input pass us by to keep from getting bogged down in the details. Are the image-recognition AI's described in the article capable of selective attention? Will they get bogged down in the morass of information in trying to pattern-match every word to every image and context they learn?
Yes they are: https://indico.io/blog/sequence-modeling-neural-networks-par... https://github.com/harvardnlp/seq2seq-attn
I wonder if there's AI focused research that analyses how children learn language as an aspect of their research? Especially kids learning a "new" language.
Paradox? In the extended reals, where infinity is the biggest number, infinity plus one gets you infinity, just as you'd expect. In, say, the study of ordinal numbers, where there are many infinite quantities, it doesn't make any sense to talk about "the biggest number".
"Infinity plus one" is a paradox to a child who believes that infinity is a finite number. When they realize what infinity actually means, the paradox will be resolved.
I once told Rod Brooks, back when he was proposing "Cog" (look it up), that he'd done a really good insect robot, and the next step should be a good robot mouse. He said "I don't want to go down in history as the guy who built the world's best robot mouse". "Cog" was a dud, and Brooks went back to insect level AI in the form of robot vacuum cleaners.
We need more machines which successfully operate autonomously in the real world. Then they may need to talk to humans and each other. That might work.
The big problem in AI isn't language, anyway. It's consequences. We don't have common sense for robots. There's little or no understanding of the consequences of planned actions. We need to get this figured out before we can let robots do much.
EDIT: This is the most volatile comment I've ever posted. It has been going +2, -2, +2, -2 for the last 35 minutes. People seem to love it or hate.
If you're talking about just being victims of machine logic, we've been suffering that since the invention of the traffic light traffic jam.
As for reality, creating an AI that will actively hate us is a feat of about the same difficulty as creating an AI that would love us. Those are two opposite points of a tiny island called "has more-less human mind" that floats on a vast ocean named "we all die". The biggest challenge of surviving superhuman AIs is locating that island.
As the words of wisdom say, "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else."
The cortical regions are good at creating hierarchical feature maps and we had a bunch of search algorithms laying around in the parts bin. Presto : Present day A.I.
This approach meshes well with 'big-data' companies in possession of large compute stacks and data-sets. So, its the direction things went.
A perfect opportunity for disruption..
The current wave is on borrowed time.
Even very basic creatures with less intelligence "learn" because they "want to live".
That is the key - You want to stay alive.
You can't be immortal. You just don't live to learn forever. You live to stay alive and feel happy. And that is what drive us to learn.
Could this be true in case of machines?
Regardless of the rich context a word may reside, or the infinite sample pool of text upon which we may unleash our learning robots, as long as they are all words, you will never encounter the real apple to which the symbol is linked, nor reach the reality to which all the symbols are linked. The machine will learn something. It just won't be anything like what we know or understand, or what generated the paper trail in the first place. They will be awkward simulations, which is exactly what we have.
My now 20 month old son wasn't born literate, but already spoke. A grunt, a groan, a giggle, and a moan. These are his words. There is nuance, there is rhythm, there is intention, and there is tone. Not that I'm any good at writing baby books, but there is a reason babies enjoy rhyming and puns and silliness. The point is, words offer so much more than their meaning. This is the language they understand. They don't know English, and we aren't teaching them English. But they are slowly but surely articulating themselves, be it that they're hungry, lonely, or just want that grape. In fact, who is teaching who? Parents learn the language of their child first to be any good at parenting. Their expressions of their intelligence precludes the expressions of our own. Maybe all we need is a machine that groans.
I find it no coincidence that the philosophy of Ludwig Wittgenstein evolved with his experiences teaching children. And if I were to make a bold prediction, it appears the field of AI will benefit immensely from all the young and talented AI researches who start having kids of their own. The comments here already seem to attest to this. It's either that or giving up and deciding to teach preschoolers for a while. Either way, we'll soon have our book on AI that will do what Philosophical Investigations did for philosophy. And I can't wait.
But once you have a system with human-level speech recognition and semantic mapping, where do we go? The ability to have a meaningful dialogue with a machine seems very difficult to model as a machine learning problem (what constitutes ground truth? What's the reward function look like?), and also has to deal with many unknowns. For example, ask a smart assistant like Alexa or Siri about functionality it wasn't programmed with, and you get a terse "Sorry, can't help you with that." But ask a child, and you prompt a question-answer dialogue (i.e. learning) or perhaps feigned understanding. My toddler son is an expert at giving me the answer he thinks I want to hear, even when he has no idea what I'm talking about :) There are certainly many new problems which we can begin to think about tackling, but certainly no sign IMO that we're running out of applications for deep learning in the field of language.
Aren't you describing learning, in general? Physics, math, biology, etc
There are things you cannot put into words, and yet you think them. There are
things that you can't put into words and yet you can make people around you
understand them. There are things you understand without even knowing you
understand them. But even before we go there- there are so many things that
people can make utterances about that are not possible to collect into example
sets and train models on.
How do you collect examples of whatever it is that makes people lie on the
beach to get a sun tan? How do you collect examples of imagination, dreams,
abstract thinking, all those things that your brain does that may be a
side-effect of self-aware intelligence or the whole point of self-aware
intelligence in the first place?
How do you collect a data set that's as big as the whole world you've
experienced in your however many years of life? And even if you could, what
machine has the processing power to train on that stuff, again and again,
until it gets it right?
Machine learning meaning is hopeless, folks. Fuggeddabout it. There's not
enough data in the whole world, there's no machine big enough to process it if
it existed. We 'll make some advances in text processing, sure, we'll automate
some useful stuff like translation (for languages close to each other) and
captioning (for photographs) and then we'll stall until the next big thing
comes about in a few generations from now.
That's what the current state of the art suggests.
I am a very bad example though, for one because English is just my second language. Sure there is thinking before words are learned. Language is a complicated problem to talk about, just like self awareness. Consciousness is a very nebulous term to me. Still, you'd have to prove that language is theoretically unfit. Any such logic might be incomplete if you suppose you cannot put it into words. A complete first order logic is expressible however, following Goedels completeness theorem.
Without a human analyzing the transcripts it's very difficult for the chat bots to know which inputs it's receiving are "good" or "better."
Even the idea of good language is subjective. We all know there is such a thing, but nearly everyone has different ideas of what this is.
However, if language is more integral than that. If language is more a facet of intelligence than a building block. If language is the structure of sentience, rather than something sentience leverages, then no, all the chat bots in the world won't help. We need something that can integrate more than just plain text embeddings of incredibly intricate and complex structures. We can't crack this code without a key.
My guess is that this latter scenario is the likely case.
A real conversation is about conveying understanding, not about the words spoken.
AlphaGo was trained on however many zillions of games and playing against itself, but does it actually understand anything about the game? Or can it simply react to the current state of the game and suggest what the next move should be. It will never have a leap of intuition causing it to say "the only winning move is not to play."
Intelligence is not purely reactive.
Basically, you need to ask questions that require meta-cognition, like, "What does Mary think about you?" That requires:
* Understanding of yourself as an entity.
* Understanding of Mary as another entity, with its own state.
* The capability to use previous interactions to approximate that other entity state.
Is to me, the most significant way in which we can mimic the way the human cognitive process develops associations between things. Auto-association is key here.
In addition, understanding how to calculate similarities between vectors is also important.
Imagine downloading "english teenager slang 2016 v2.0" to your home AI, so it can understand what the hell your kids are saying :)
So if you don't want the system to be gameable, such public blobs of weights may need to be avoided.
You can't "understand" language without having model of the world that humans construct during their life and education.
So pretty much next step for language recognition is indistinguishable from sentience.
Since success has a higher priority for researchers than explicable success, and if the "singularity" is just progress that is not understood, it may be almost here - and not require true AI.
Though to be fair, by that definition, the singularity has always been with us, since we don't understand how we think.
My theory, and I'd love to find someone offering a similar and more fleshed-out hypothesis, is that conscious experience serves as a universal data type. It can encode and play-back any type of knowledge and memory, and relationships among them, from the color of the teacher's shirt that time you broke your bone in 3rd grade, to the formula for electron energy in quantum mechanics.
Unfortunately, the word consciousness is almost forbidden in most scientific circles. The dominant view is that there is no Hard Problem of Consciousness and that any discussion of it is quackery, or at least "not science". This taboo is holding us back.
I was partly inspired to pursue the path I took in R&D by observing that the industry didn't seem to want to consider someone thinking different or working on the true foundation of A.I (the hard problems).
I figured, if I was able to write software up and down the stack for the billion dollar network infrastructure equipment that powers and services the internet, I probably knew what I was doing w.r.t to engineering.
A networked system with a missing foundation...The rest is history and I look forward to making disclosures about my work in the near future.
In the meantime, you should know that there are quite capable and industry proven individuals working on this. They aren't quacks, maintain graduate degrees from the top universities of America, and have a proven track record in the industry. The spotlight just doesn't shine in their direction. Of course, once a functional model is proof'd, I'm sure that will change. Such is the history of new paradigms and those who, through deep and new understanding, seek to usher them in...
This is also why I think that deep learning / neural networks are only going to take us so far. I think there is more to the story of how the brain works than only neural networks that make predictions, and frankly I do not think that any system that does not at least attempt to do "that" (simulating consciousness's model building/manipulation feature) will have much better luck at language processing/understanding.
It seems that this phenomenon, whatever it is, plays a central role in sensory perception, and there's reason to think that it's present even in animals with simple brains. So I suspect that we're looking for some kind of simple operation that can happen on the scale of a small number of neurons, maybe even a single neuron.
This is all speculation of course, informed by some knowledge and intuition, but speculation nonetheless. But it's the only way to push the frontier, and the unwillingness to engage with consciousness as a matter of serious study seems to be a major failing of brain science and AI.
This is important to understand in the midst of today's AI hype.
So, the foundational problems remain...
They remain because there is no foundation to these cortical systems. Anyone who states this is railed and laughed at. So, you get what you get.....
The article states :
"Machines that truly understand language would be incredibly useful–but we don’t know how to build them."
There are people and groups who know how to build them. They are focusing on the 'foundation' first. That is not where the spotlight or money are directed. So, they remain in the dark.
We gained head-winds with a very trivial model of neurons and cortex like hierarchical neural network designs and the money sent people off to the races. People began writing wrappers, stuccoing the top floor, hacking up scaffolding, applying any C.S concept they could find in the parts bin to fancify the top floor.
That's where all of the attention and money is.. What does your system do? What benchmark can it beat? What data can it classify? What cool trick can it do to impress us? So, you get impressive trick systems that require massive amounts of data, training, and answer maps to obscure the lack of intelligence. As there is none explicitly designed into these systems, the system cannot convey its understanding.
It's nothing more than an answer map w/ annealing routines and memory... Very similar to cortical regions.
The foundation and supporting layers up to the top have been ignored, aren't getting any spot-light or money, nor are the individuals who continue to toil on it.
They're considered to be 'philosophers' and jokers and not real scientist/engineers/industry leaders. The A.I space shuts out a huge pool of varying opinions via its : If you don't have a PHD, one need not apply. If you're approaching it from any other methods than the ones subscribed to and you're not a name, face, or have a laundry list of papers you get the : Good luck (thumbs up).
And people stand around and wonder why the fundamental problems remain? Come on...
In any event, it wont remain for long and that will be due to someone/groups actually investing the time and energy to build a sound foundation. This begins first and foremost by deep philosophical questions about the nature of the universe and intelligence. The answers derived serve as a guiding light for further along scientific and engineering pursuits.
This article should be : AI's lack of a foundation. Whose going to build it? Whose going to invest the time to understand what exactly it is as opposed to hacking away at it?
It's the truth but would be considered a 'hit piece'.
Until someone constructs a proper foundation, no one is going to give credence to the idea that current A.I lacks it. Hindsight is 20-20 as is a force-fed neo-cortex.
Ultimately language use requires a few skills:
* a good parser
* motor cognition/coordination
* a good memory
* situational awareness
The first two in the list are what small children struggle with the most. Fortunately, we can eliminate motor coordination as a need for AI. Although extremely powerful parsers demand a specialized expertise to produce this part of the problem is straight forward. I write open source multi-language/multi-dialect parsers as an open source hobby.
I discount vocabulary and situational awareness, because most children still haven't figure this out until they enter high school long after they have learned the basics of speech. That pattern of human behavior dictates that while it might be hard to teach these skills to a computer you can put this off a long ways down the road until after basic speech is achieved.
If somebody paid me the money to do this research my personal plan of attack would be:
1. Focus on the parser first. Start with a text parser and do audio to text later. Don't worry about defining anything at this stage. When humans first learn to talk and listen they are focusing upon the words and absolutely not what those words mean.
The parser should not be parsing words. Parsing words from text is easy. The parser should be parsing sentences into grammars, which is harder but still generally straight forward with many edge cases.
2. Vocabulary. Attempt to define words comprising the parsed grammar. Keep it simple. Don't worry about precision at first. Humans don't start with precision and humans get speech wrong all the time. This especially true for pronouns. Just provide a definition.
3. Put the vocabulary together with the parsed grammar. It doesn't even have to make sense. It just has to have meaning for words and the words together in a way that informs an opinion or decision to the computer. Consider this sentence as an example: I work for a company high up in the building with a new hire that just got high and gets paid higher than my high school sweetheart.
4. If the sentence is part of a paragraph or a response to a conversation you can now focus on precision. You have additional references from which to draw upon. You are going to redefine some terms, particularly pronouns. Using the added sentences make a decision as to whether new definitions apply more directly than the original definitions. This is how humans do it. These repeated processing steps means wasted CPU cycles and its tiring for humans too.
5. Formulate a response. This could be a resolution to close the conversation, or it could be a question asking for additional information or clarity. Humans do this too.
6. Only based upon the final resolution determine what you have learned. Use this knowledge to make decisions to modify parsing rules and amend vocabulary definitions. The logic involved is called heuristics.
This only way all this works is to start small, like a toddler, and expand it until the responses become more precise, faster, and more fluid. At least.... this is how I would do it.
But look at what Deepmind does: it takes these ideas (and also ideas from systems neuroscience), implements them as differentiable modules and trains them on data in end-to-end fashion. This works really well.
Learning is very important, much more important than architecture. If you have a model that can learn you can add more structure later - again this is what modern deep learning is all about.
But that happens in text too: we group things into paragraphs and add a lot of punctuation and as we read we sometimes skim a bit, return as needed, reread what we missed the first time. (Or in texts/IMs our cultures are in the process of building whole new sub-dialects of error correction codes like emoji and "k?".)
A lot of people would think a machine is broken if it hemmed and hawed as much as people do in a normal conversation; if it needed full paragraphs of text to context set and/or explain itself.
The biggest thing lacking in voice recognition right now is not the lack in word understanding or any of the other NLP areas of research: it's in a lot of the little nuance of conversation flow. For now, most of the systems aren't very good at interruptions, for instance. From the easy like "let me respond to your question as soon as I understand what you are asking to save us both time" to the harder but perhaps more important things like "No [that's not what I mean]" and "Wait [let me add something or let me change my mind]" and "Uh [you really just don't get it]" and presumably really hard ones like clears throat [listen carefully this time].
The point should not be that we hit 100% accuracy: real people in real conversations don't have 100% accuracy. The issue is how do you correct from the failures in real time and keep that "conversational" without feeling strained or overly verbose (such as the currently common "I heard x, is that correct?" versus "x?" and head nod or very quick "yup").
We don't consciously think about the error correction systems in play in a conversation so that makes them hard to judge/replicate and it's easy to imagine there's an uncanny valley waiting there for us to get from no "natural error correction" ability across to supporting error correction in a way that it works with our natural background mechanisms.
At least in my mind, that's probably the next big area to study in language recognition is deeper looks into things like error correction sub-channels and conversational timing (esp. interruption) and elocution ("uh", "um", "you know", "that thing", "right, the blue one"). I'd even argue that what we have today is probably already getting to "good enough" for the long run if it didn't require us to feel like we have to be "so exact" because you only get one sentence at a time and you don't have good error correcting channels with what we have today.
If you pay attention for about 5 minutes it's just nonsense. I mean, they are clearly repeating things from somewhere else that were sensible in their original context, but now they seem to be saying things nearly randomly.