This is true, but it is only part of the answer.
Another part of the answer is what I call the Long Tail of Grammar. It turns out that if you try to write down all the rules of grammar, you will not get 40 or 60 rules, but something more like 100s or maybe even 1000s of rules. Most of those rules are obscure, rare, archaic, or useable only in specific contexts or with specific words. However, they are part of the language, a native speaker will be able to use them and comprehend them without difficulty, and an NLP system must be able to "understand" them in order to extract the correct meaning from a sentence.
As just a minor example off the top of my head, compare the phrase "peeled peach" with "hairy-peeled peach". The former phrase means a peach without a peel, while the latter means a peach with a hairy peel. So a good NLP system must not only recognize the existence of the two grammatical rules, but also be able to disambiguate them correctly.
A good example of this is the Winograd Schema. You might think you can figure out a good algorithm for anaphoric resolution (i.e. If you see "Sally called and she said hello.", who is "she"?) that just relies on the structure of a sentence, without considering semantics.
But here's a counterexample:
"The city councilmen refused the demonstrators a permit because they feared violence."
Who are 'they'?
"The city councilmen refused the demonstrators a permit because they advocated violence."
Now who are 'they'?
If you're like most people, even though only the verb changed, the binding of 'they' based on the deeper semantic meaning also changed.
These sentences are called Winograd Schema, and there are plenty more like it.
Presumably, such sentences would be contained in a paragraph that would provide additional clues as to whom the word 'they' refers. Given additional context, you could then ask, "were the protesters or the councilmen fearing violence?" Document summarization and fact extraction systems could then approximate humans in such a task.
What's interesting is that word sense ambiguity underlies a lot of comedy. For instance, "Time flies like an arrow; fruit flies like a banana." The close juxtaposition of the word "like" being used in two different contexts is what makes this sentence "funny". I think it's not too far off to say that we could eventually teach AI systems to recognize humor.
"I dropped the egg on my glass living room table and it broke!"
"I dropped my hammer on my glass living room table and it broke!"
These are both ill-defined semantically, but if you asked most native English speakers "what broke" for each sentence, they'd probably say "egg" for the first and "table" for the second. It could be the other, but it would be surprising.
So, to solve just the "Dropped X on Y, Z broke" problem, we'd need to teach the computer to understand the effect of the relative 'fragility scores' of each object. Personally, I never sat down and memorized a chart of these as a human. You could perhaps use machine learning to derive the data by analyzing a large corpus of text, and match humans most of the time, but then that's just one sentence type solved, out of any number of other tricky constructions. So the long tail of semantic understanding quickly becomes a very fun set of problems to solve, for certain definitions of fun. :)
A few more examples to consider how you would teach a computer to understand, from a Winograd Schema corpus:
John couldn't see the stage with Billy in front of him because he is so [short/tall]. Who is so [short/tall]?
The sculpture rolled off the shelf because it wasn't [anchored/level]. What wasn't [anchored/level]?
The older students were bullying the younger ones, so we [rescued/punished] them. Whom did we [rescue/punish]?
I tried to paint a picture of an orchard, with lemons in the lemon trees, but they came out looking more like [light bulbs / telephone poles]. What looked like [light bulbs / telephone poles]?
 e.g. http://cs.rochester.edu/research/lore/
OR are we just well-designed-auto-trained Neural network. :D
In the glass table example, the bot could be explicitly 'dumb' or pedantic and ask for clarification. Or perhaps even better, the bot could simulate natural conversation: assume one or the other (with maybe a dash of built-in knowledge about the world), make that assumption explicit to the user, and allow the user to correct it. This might even make the bot more human and pleasant to interact with.
"A violent mob requested a demonstration from the councilmen. The councilmen refused the permit, because they feared the violence."
I suspect, grammar begets normalization, with primary and secondary keys just like in relational databases. People are just not very good at it. EG. I'd contest the consistency of those 1000 grammar rules. Point in case, the word "violence" needs the definite article, because violence is an abstract concept (which the parent missed). All the while the indefinit and definit articles serve other purposes, eg. the quantifiers from logic (for all, there exists) which are at odds with the naive countability of the violence.
So Language is ambiguous, NLP is done probabilistic and thus is hard with at least exponential complexity.
Edit: What I mean is, the problem here is contraction omitting context. Of course databases worked before relational databases, but sometimes you really want the third normal form.
No, searching for "inciting violence"/"fearing violence" in Google gives thousands of hits. (Also in Google Books, if you want to claim all these websites are wrong.)
It is perfectly OK to use "violence" without an article.
Wouldn't omitting the the in my sentence mean, they feared violence in general? Sure, broad contracts are welcome, but then there couldn't be a specific answer to who they are. I guess that's in agreement with what you said.
Is that a rule of grammar, or simply the meanings of the adjectives "peeled" and "hairy-peeled"?
- Xed Y = a Y that has been Xed, where X is a verb
- Xed Y = a Y that is equipped with an X, where X is a noun.
For the latter pattern, consider examples such as "red-lipped woman", "rosy-fingered dawn", "sharp-toothed pike", "horned owl", etc.
Because syntactic change.
There's a difference between coining a new word and changing a grammar rule.
I'd much rather feed it edge cases to accommodate than rules to follow.
Long story short, there's no such thing as grammar but they're a nice fiction for talking about communication. The deeper you get into NLP the more you (1) see what jelinek was talking about when he said "every time I fire a linguist, accuracy goes up" (he was hiring physicists and information theorists), and (2) realize that basically every thought, belief, and statement is deeply ambiguous, and that most human communication is ad hoc.
Also, the more time you spend looking at live data from users, the more you realize that the notion of language as a generally shared system of meaning is not real. Trivial communication about basic tasks is doable, although you often will fuck up there too when talking to someone with deeply different cultural expectations.
If you think that _any_ sentence has unambiguous meaning, you should try to meet some people who are more different than you are. Or get into deeper conversations with the people you know.
I'm not sure what you mean by this. Certainly adjectives are a thing, and if I learn a new adjective, "anguilliform" for example, I have never heard that in context, but I know exactly how to use it. That is grammar, right?
(Side note: look into the difference between prescriptive and descriptive linguistics for a sense of where I'm coming from on that point.)
So, you learn a new adjective. Surely you can use it like any other adjective, right? Sure. But someone can also very comfortably use it in a way that violates your notions of how adjectives work, and you'll probably know what they mean. Or you'll use it in a context that makes sense to you, but not to someone you're talking to. The way to use language is in the way that allows others to understand you. Paradoxically that doesn't actually mean adhering to some arbitrary set of rules. Here's an important paper . Basic idea: you get a small community isolated, and they just rip the rules to shreds, but what they end up building is often much higher bandwidth and allows for more complex ideas.
The point I'm pushing on is this: engineers especially think of language as an agreed upon set of rules which can be used correctly or incorrectly. Turns out that in practice it's a chaotic mess of individuals who abuse the rules mercilessly with minimal regard for how they're supposed to use it, and still get along happily. Developing an understanding of the inherent fuzziness of words and structure in communication can actually help a person develop significant capacity for self expression.
The failure to understand this is one of the reasons most engineers write shitty poetry. :P
I don't think that's right either. It's not fiction, it's definitely something real.
I think your insistence on me being precise here is a wonderful illustration of two competing approaches to language and how they make a synthesis. I'm fairly confident you know what I mean but are choosing not to accept my phrasing, so this exchange may also be an interesting illustration of how language is also the negotiation of power.
Someone who has never seen a peach might assume that hairy peeled peach refers to a hairy peach without a peel. If you've seen a peach before, this makes no sense. So the assumption is that the peach has a hairy peel.
This is why purely statistical / supervised learning-based NLP alone is not enough.
The following are all germane but not mentioned: text analysis/mining, controlled vocabularies, indexing, taxonomies, ontology, semantic web, latent semantic analysis, latent dirichlet allocation, corpus analysis, document similarity analysis, tf-idf, ngrams, and skip grams just to mention a few.
In general the article is a good idea but their needs to be more of a description of the domain landscape and then "paths" plotted through that landscape that lead to interesting and useful competency.
With regards to nlp, I have a site that is using a spider to collect headlines for stocks and I have been working on clustering, sentiment analysis, and text summarization. But it's I haven't completed it.
The two questions about the PhD's do feel a little bit misplaced for a startup audience. Who here stops and thinks "Am I supposed to have a PhD to do that?", when setting out to start something new? (<insert theranos reference here>)
PhD is just an academic title and as such it is neither a necessary nor a sufficient prerequisite to approach NLP from a mathematical angle.
The best in what? If you mean pushing the boundaries of research, yes, then your path there will likely involve a PhD. If you mean building the best technology products, then being able to read, understand and implement the biggest recent advances is enough and usually requires far less mathematical knowledge.
(That was intentionally written generically since I think that it applies to more than just NLP)
Yes I mean pushing the boundaries, but I think it is important to stress that a PhD is not a prerequesite to do that. Anyone who is talented enough can learn the necessary math.
Getting paid for research work is a different story of course. A PhD undoubtedly helps with that, but this is Hacker News. People might figure something out.
What I want are more tools for Natural Language Generation. Can anyone recommend some good ones? (beyond what's on Wikipedia)
(...though I don't agree that NLP is mature, just more mature than this. Lots of room to improve NLP as well!)
Looks less like NLG, and more like picking existing responses from a (probably huge) corpus using ML. Hard to replicate unless you have access to the kind of data Google has.
In a way, vision is easier. Instead of discrete symbols (words) it's continuous signal which are much easier to interpret and generate from neural networks. By comparison, best language models are behind best image generation models (2-3 years behind, in my estimation).
For example, there are few applications of GANs to text, and many applications to images, GANs being the hottest thing in deep learning now. So you have to keep in mind that NLP is by and large still not solved. There is no decent conversational chat bot yet. We can reason over small pieces of text but that is far from full understanding. NLP at this level is hard.
What you can easily do now is to classify text, detect sentiment, entities, word vectors, grammatical parsing and summarization. All are low level stuff.
The goal is to just feed gigabytes of raw text to a huge, complex neural network, and hope it will extract relevant features.
Most advancements in ML is not accomplished by some new super algorithm. Rather, advancements are reached when new datasets are presented!
It's practical, readable, and it's free.
e.g. for NLP:
> Do I enjoy participating in the act of piloting an aircraft? Or am I expressing an appreciation for man-made vehicles engaged in movement through the air on wings
Clearly the latter, as the former begets the infinitive, "I love to fly ...".
Maybe I am wrong, going by the American usage of the gerund I clearly am, but then "I want going flying" sounds ridiculous in any case. Maybe I am missing the difference, so as a second language speaker, I'd love to be corrected.
Can someone point me to a satisfying demo of a professional text summarization software?
http://smmry.com (demo here)
Copy a few articles into text files and get working on implementing some of these methods until you have enough of an understanding to construct your own methods for the fun of it.
Here's some good reading material:
Edit: Don't get deterred by the math formulas in these papers. They look far more complicated than they actually are.
I played around with this a bit to develop https://www.findlectures.com, so knowing what works/doesn't work there I'm developing some NLP scripts to support my use cases.
The only thing one really needs is to count the frequency of words in each document and do very simple math. Tf-Idf is still very relevant today and provides you with a very good idea on how statistics is used on text-mining.
A few months later and after many iterations + a whole lot of testing, the algorithm now can extract super relevant keywords 90%+ of the time!
I wish I knew about the WikiCorpusExtractor. Thanks for the link!