Hacker News new | past | comments | ask | show | jobs | submit login
How to Get into Natural Language Processing (ycombinator.com)
340 points by craigcannon on Jan 20, 2017 | hide | past | web | favorite | 75 comments

> Why is NLP Hard? ... Language is highly ambiguous - it relies on subtle cues and contexts to convey meaning.

This is true, but it is only part of the answer.

Another part of the answer is what I call the Long Tail of Grammar. It turns out that if you try to write down all the rules of grammar, you will not get 40 or 60 rules, but something more like 100s or maybe even 1000s of rules. Most of those rules are obscure, rare, archaic, or useable only in specific contexts or with specific words. However, they are part of the language, a native speaker will be able to use them and comprehend them without difficulty, and an NLP system must be able to "understand" them in order to extract the correct meaning from a sentence.

As just a minor example off the top of my head, compare the phrase "peeled peach" with "hairy-peeled peach". The former phrase means a peach without a peel, while the latter means a peach with a hairy peel. So a good NLP system must not only recognize the existence of the two grammatical rules, but also be able to disambiguate them correctly.

> or useable only in specific contexts or with specific words.

A good example of this is the Winograd Schema. You might think you can figure out a good algorithm for anaphoric resolution (i.e. If you see "Sally called and she said hello.", who is "she"?) that just relies on the structure of a sentence, without considering semantics.

But here's a counterexample:

"The city councilmen refused the demonstrators a permit because they feared violence."

Who are 'they'?

"The city councilmen refused the demonstrators a permit because they advocated violence."

Now who are 'they'?

If you're like most people, even though only the verb changed, the binding of 'they' based on the deeper semantic meaning also changed.

These sentences are called Winograd Schema[1], and there are plenty more like it.

[1] https://en.wikipedia.org/wiki/Winograd_Schema_Challenge

This is a recognized problem and is called Word Sense Disambiguation. It's hard but not intractable. One issue is that such sentences are themselves ambiguous and even human readers may disagree on their meaning. A statistical system can make a guess based on a large corpus of word-context pairs, which can approximate what a human does when attempting to disambiguate the meaning. It won't be perfect, but again, part of this is due to the fact that the sentence as it stands alone is insufficient.

Presumably, such sentences would be contained in a paragraph that would provide additional clues as to whom the word 'they' refers. Given additional context, you could then ask, "were the protesters or the councilmen fearing violence?" Document summarization and fact extraction systems could then approximate humans in such a task.

What's interesting is that word sense ambiguity underlies a lot of comedy. For instance, "Time flies like an arrow; fruit flies like a banana." The close juxtaposition of the word "like" being used in two different contexts is what makes this sentence "funny". I think it's not too far off to say that we could eventually teach AI systems to recognize humor.

Just wanted to say thank for a great and interesting comment. Never seen the complexity of NLP summed up so well.

re the council people sentences: I don't understand the problem. they're ill-defined sentences. we use heuristics to parse them but those heuristics can fail (the council denied the demonstrators permit because they feared violence... and the council was obliging). just teach the computer the heuristics like we learn them.

That's exactly the issue. The way we learn them is through world experience, which is sometimes hard to figure out how to transfer into a computer.


"I dropped the egg on my glass living room table and it broke!"

"I dropped my hammer on my glass living room table and it broke!"

These are both ill-defined semantically, but if you asked most native English speakers "what broke" for each sentence, they'd probably say "egg" for the first and "table" for the second. It could be the other, but it would be surprising. So, to solve just the "Dropped X on Y, Z broke" problem, we'd need to teach the computer to understand the effect of the relative 'fragility scores' of each object. Personally, I never sat down and memorized a chart of these as a human. You could perhaps use machine learning to derive the data by analyzing a large corpus of text[1], and match humans most of the time, but then that's just one sentence type solved, out of any number of other tricky constructions. So the long tail of semantic understanding quickly becomes a very fun set of problems to solve, for certain definitions of fun. :)

A few more examples to consider how you would teach a computer to understand, from a Winograd Schema corpus[2]:

John couldn't see the stage with Billy in front of him because he is so [short/tall]. Who is so [short/tall]?

The sculpture rolled off the shelf because it wasn't [anchored/level]. What wasn't [anchored/level]?

The older students were bullying the younger ones, so we [rescued/punished] them. Whom did we [rescue/punish]?

I tried to paint a picture of an orchard, with lemons in the lemon trees, but they came out looking more like [light bulbs / telephone poles]. What looked like [light bulbs / telephone poles]?

[1] e.g. http://cs.rochester.edu/research/lore/

[2] http://www.cs.nyu.edu/faculty/davise/papers/WinogradSchemas/...

Reading this post, I get the feel that none of the AI training could be as good as humans until we train all human perceivable aspects together. An entire human brain equivalent.

OR are we just well-designed-auto-trained Neural network. :D

That's pretty cool. I wonder if the recent work on caption generation mixed with video understanding can get some sense of likelihood of disambiguated sentences to evaluate possible hypotheses.

I wonder if at least in the context of chat bots these kinds of problems are mitigated or avoided by the conversational nature of these bots.

In the glass table example, the bot could be explicitly 'dumb' or pedantic and ask for clarification. Or perhaps even better, the bot could simulate natural conversation: assume one or the other (with maybe a dash of built-in knowledge about the world), make that assumption explicit to the user, and allow the user to correct it. This might even make the bot more human and pleasant to interact with.

context is everything.

"A violent mob requested a demonstration from the councilmen. The councilmen refused the permit, because they feared the violence."

I suspect, grammar begets normalization, with primary and secondary keys just like in relational databases. People are just not very good at it. EG. I'd contest the consistency of those 1000 grammar rules. Point in case, the word "violence" needs the definite article, because violence is an abstract concept (which the parent missed). All the while the indefinit and definit articles serve other purposes, eg. the quantifiers from logic (for all, there exists) which are at odds with the naive countability of the violence.

So Language is ambiguous, NLP is done probabilistic and thus is hard with at least exponential complexity.

Edit: What I mean is, the problem here is contraction omitting context. Of course databases worked before relational databases, but sometimes you really want the third normal form.

> Point in case, the word "violence" needs the definite article, because violence is an abstract concept (which the parent missed).

No, searching for "inciting violence"/"fearing violence" in Google gives thousands of hits. (Also in Google Books, if you want to claim all these websites are wrong.)

It is perfectly OK to use "violence" without an article.

OK, I'm not a native speaker and admittedly didn't reaffirm my claim before posting. Anywho, the headline and newspaper speak is not exactly the propper grammer I am talking about, right?

Wouldn't omitting the the in my sentence mean, they feared violence in general? Sure, broad contracts are welcome, but then there couldn't be a specific answer to who they are. I guess that's in agreement with what you said.

> compare the phrase "peeled peach" with "hairy-peeled peach"

Is that a rule of grammar, or simply the meanings of the adjectives "peeled" and "hairy-peeled"?

In my mind there are two very distinct patterns:

- Xed Y = a Y that has been Xed, where X is a verb

- Xed Y = a Y that is equipped with an X, where X is a noun.

For the latter pattern, consider examples such as "red-lipped woman", "rosy-fingered dawn", "sharp-toothed pike", "horned owl", etc.

yeh it seems more like another example of word sense disambiguation.

Not just that, grammar rules can come into existence/change spontaneously. Here's an example and name for this phenomenon:

Because syntactic change.

That isn't really a rule change. The lexicon just added a new preposition, 'because,' to the old list. It means almost the same thing as the old preposition 'because of.'

There's a difference between coining a new word and changing a grammar rule.

No, it is a rule change. It's turning a sentence fragment into a sentence.

This seems like a great argument for automatic grammar learning. How far has current research taken us in that direction?

I'd much rather feed it edge cases to accommodate than rules to follow.

They tried that in the 70s and 80s until they realized they were wasting their time. Probabilistic context free grammars were a thing, also.

Long story short, there's no such thing as grammar but they're a nice fiction for talking about communication. The deeper you get into NLP the more you (1) see what jelinek was talking about when he said "every time I fire a linguist, accuracy goes up" (he was hiring physicists and information theorists), and (2) realize that basically every thought, belief, and statement is deeply ambiguous, and that most human communication is ad hoc.

Also, the more time you spend looking at live data from users, the more you realize that the notion of language as a generally shared system of meaning is not real. Trivial communication about basic tasks is doable, although you often will fuck up there too when talking to someone with deeply different cultural expectations.

If you think that _any_ sentence has unambiguous meaning, you should try to meet some people who are more different than you are. Or get into deeper conversations with the people you know.

Edit: typo

there's no such thing as grammar

I'm not sure what you mean by this. Certainly adjectives are a thing, and if I learn a new adjective, "anguilliform" for example, I have never heard that in context, but I know exactly how to use it. That is grammar, right?

I'm being hyperbolic, but what I'm suggesting is that grammar is a convenient fiction. The part where you speak of "how to use it" points to the break down in your thinking.

(Side note: look into the difference between prescriptive and descriptive linguistics for a sense of where I'm coming from on that point.)

So, you learn a new adjective. Surely you can use it like any other adjective, right? Sure. But someone can also very comfortably use it in a way that violates your notions of how adjectives work, and you'll probably know what they mean. Or you'll use it in a context that makes sense to you, but not to someone you're talking to. The way to use language is in the way that allows others to understand you. Paradoxically that doesn't actually mean adhering to some arbitrary set of rules. Here's an important paper [0]. Basic idea: you get a small community isolated, and they just rip the rules to shreds, but what they end up building is often much higher bandwidth and allows for more complex ideas.

The point I'm pushing on is this: engineers especially think of language as an agreed upon set of rules which can be used correctly or incorrectly. Turns out that in practice it's a chaotic mess of individuals who abuse the rules mercilessly with minimal regard for how they're supposed to use it, and still get along happily. Developing an understanding of the inherent fuzziness of words and structure in communication can actually help a person develop significant capacity for self expression.

The failure to understand this is one of the reasons most engineers write shitty poetry. :P

[0] http://onlinelibrary.wiley.com/doi/10.1111/1467-9481.00177/f...

what I'm suggesting is that grammar is a convenient fiction.

I don't think that's right either. It's not fiction, it's definitely something real.

Let me try that again: grammar is a way of describing some conventions that are often used but whose force is much weaker than almost everyone thinks. Natural language processing on the basis of grammar gets some of the most frequent uses, but immediately its limitations become extremely clear.

I think your insistence on me being precise here is a wonderful illustration of two competing approaches to language and how they make a synthesis. I'm fairly confident you know what I mean but are choosing not to accept my phrasing, so this exchange may also be an interesting illustration of how language is also the negotiation of power.

I'm fairly confident you know what I mean


Your example isn't even a specific grammar rule. It's a very broad language rule, which is: "descriptions should conform to the real world."

Someone who has never seen a peach might assume that hairy peeled peach refers to a hairy peach without a peel. If you've seen a peach before, this makes no sense. So the assumption is that the peach has a hairy peel.

This is why purely statistical / supervised learning-based NLP alone is not enough.

I think this is a good idea for a series. Although I think more detail needs to be given on the actual path, that is after all the purpose of the series. Most of this article seemed to be describing what NLP is and why it's hard. This isn't bad and some attention should be given to it but people looking to find the path into NLP will already be familiar with most of this information. I was expecting a bit more of a syllabus type format. There was mention of needing some college level algebra and statics, I would have liked more detail in this area with links to more resources (classes, articles, datasets, etc). Keep up the good work!

Agreed on more substantive detail needed. I was surprised at the lack of mention of many of the basic techniques and domains that a person interested in should consider learning about.

The following are all germane but not mentioned: text analysis/mining, controlled vocabularies, indexing, taxonomies, ontology, semantic web, latent semantic analysis, latent dirichlet allocation, corpus analysis, document similarity analysis, tf-idf, ngrams, and skip grams just to mention a few.

In general the article is a good idea but their needs to be more of a description of the domain landscape and then "paths" plotted through that landscape that lead to interesting and useful competency.

That's a great point. I wonder if there would be a better way to introduce meaningful, actionable topics of study to an introductory-level audience of people who may have never heard of NLP.

I researched deep learning for nlp for a year and compiled this list of papers and articles about some of the most interesting topics.


Have you built anything interesting?

I did some kaggle contests using image stuff.

With regards to nlp, I have a site that is using a spider to collect headlines for stocks and I have been working on clustering, sentiment analysis, and text summarization. But it's I haven't completed it.


I like the idea of the Paths series, though some of the points in this first article read like they could be written about most "emerging technologies". Anyway, I'm looking forward to the next one!

The two questions about the PhD's do feel a little bit misplaced for a startup audience. Who here stops and thinks "Am I supposed to have a PhD to do that?", when setting out to start something new? (<insert theranos reference here>)

I think a better question would be: How much math do I need to approach NLP in a way that enables me to be among the best?

PhD is just an academic title and as such it is neither a necessary nor a sufficient prerequisite to approach NLP from a mathematical angle.

> How much math do I need to approach NLP in a way that enables me to be among the best?

The best in what? If you mean pushing the boundaries of research, yes, then your path there will likely involve a PhD. If you mean building the best technology products, then being able to read, understand and implement the biggest recent advances is enough and usually requires far less mathematical knowledge.

(That was intentionally written generically since I think that it applies to more than just NLP)

>If you mean pushing the boundaries of research, yes, then your path there will likely involve a PhD.

Yes I mean pushing the boundaries, but I think it is important to stress that a PhD is not a prerequesite to do that. Anyone who is talented enough can learn the necessary math.

Getting paid for research work is a different story of course. A PhD undoubtedly helps with that, but this is Hacker News. People might figure something out.

Perhaps the distinction between "analysts" and "builders" would be better context for discussing math background in startups than PhDs.

There are a ton of libraries and tools available for NLP, so I feel that side is relatively mature.

What I want are more tools for Natural Language Generation. Can anyone recommend some good ones? (beyond what's on Wikipedia)

I tried playing around with NLG a couple of years ago. The only tool I was able to get up and running well enough to do anything was SimpleNLG[1]. It's java, not my forte, but was straightforward otherwise. Here's a basic example of use: https://github.com/simplenlg/simplenlg/wiki/Section%20V%20%E...


+ this. It's actually much harder than folks think; most solutions are still rules and sentence fragments that you string append. Automated Insights and Narrative Science are both doing interesting work, but more is needed here.

(...though I don't agree that NLP is mature, just more mature than this. Lots of room to improve NLP as well!)

I'm not sure what methods they use, but the "single sentence reply suggestions" created by Google's Inbox are the highest quality natural language generation that I have come across.

I believe this is the paper that describes the approach they are using: http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf

Looks less like NLG, and more like picking existing responses from a (probably huge) corpus using ML. Hard to replicate unless you have access to the kind of data Google has.

If you think about it, all natural language generation is simply picking from a list of possible words to go next. They just do that at a sentence level.

Heh. I suppose, if you have enough sentences to choose from for the various combinations of words, tenses, conjugations, prepositions, and so forth. Can't imagine there's many entities that would have that much data.

I'm a little surprised GATE [1], the General Architecture for Text Engineering tool is not mentioned. It is incredibly flexible, open source and has a very long track record as a research and prototyping tool.

[1] https://gate.ac.uk/

If you want to play with NLP, then just try Gensim, sklearn and Keras. If you're serious about NLP, it's hard stuff, you need a PHD in the field.

In a way, vision is easier. Instead of discrete symbols (words) it's continuous signal which are much easier to interpret and generate from neural networks. By comparison, best language models are behind best image generation models (2-3 years behind, in my estimation).

For example, there are few applications of GANs to text, and many applications to images, GANs being the hottest thing in deep learning now. So you have to keep in mind that NLP is by and large still not solved. There is no decent conversational chat bot yet. We can reason over small pieces of text but that is far from full understanding. NLP at this level is hard.

What you can easily do now is to classify text, detect sentiment, entities, word vectors, grammatical parsing and summarization. All are low level stuff.

So what you are saying is that a computer would be able to understand sign-language more easily than spoken language?

Computers can transcribe spoken language into text at great accuracy, but they can't understand the meaning of text at the same level of accuracy yet. Meaning is much harder than simple transcription. Voice recognition is to speech like OCR to print. What we want is to speak to computers and have them understand what we mean, like humans. Such an AI would be able to carry a conversation, extract data from and reason over documents, or perform complex actions based on verbal commands. They would need to have a good physical and conceptual understanding of the world, otherwise they could not use reasoning.

NLP right now looks like the computer vision 5 years ago: DL methods are starting to work really well, so a lot of "traditional" methods to process text might soon become obsolete.

The goal is to just feed gigabytes of raw text to a huge, complex neural network, and hope it will extract relevant features.

The problem is datasets. How can you distinguish a good result from a bad result? In some cases, depending on the user, it could be both at the same time.

Most advancements in ML is not accomplished by some new super algorithm. Rather, advancements are reached when new datasets are presented!

This book was really helpful for me when I was getting started with natural language processing: http://www.nltk.org/book/

It's practical, readable, and it's free.

Love the concept of this "How to" series. Seems like it'd be a good opportunity to spotlight the interesting HN threads on any given topic.

e.g. for NLP:

- https://news.ycombinator.com/item?id=11686029

- https://news.ycombinator.com/item?id=11690212

- https://news.ycombinator.com/item?id=1839611

> Take this simple example: “I love flying planes.”

> Do I enjoy participating in the act of piloting an aircraft? Or am I expressing an appreciation for man-made vehicles engaged in movement through the air on wings

Clearly the latter, as the former begets the infinitive, "I love to fly ...".

Maybe I am wrong, going by the American usage of the gerund I clearly am, but then "I want going flying" sounds ridiculous in any case. Maybe I am missing the difference, so as a second language speaker, I'd love to be corrected.

note to self: I read, it might not be the gerund in this case, but the past participle.

If you are interested in working in NLP, feel free to reach out to Kapiche. The website, Twitter or hello at kapiche dot com are all good options.

I think "Paths" is a terrific idea. There have been times where I've wanted to do a "first principles" look at a topic but don't want to go back through my HN upvotes. "Paths" allows for a curated and practical advice-driven jumping off point. Looking forward to more content. Best of luck with it!

So glad that you enjoyed the post! What are other topics that you've wanted to hear about?

http://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf is fun and easy. Text analysis with bash

Fun problem: Write a parser for the English language. See it fail at tweets :)

I dont to need to write a program to do that.

> text summarization are examples of NLP in real-world products

Can someone point me to a satisfying demo of a professional text summarization software?

The autotldr bot on reddit gets a lot of praise:

http://smmry.com (demo here)


Worth mentioning here, the text summarization engines are pretty much only usable on news articles or text with clean paragraph structure (yes, non-tech literature works too). Popular text summarization tools as they stand now, fails on anything else.

I actually disagree with both of you. SMMRY/AutoTLDR does an acceptable job when pasting URLs of the latest news, but its not something I would actually want to consume. More of a showcase that summarization AI has made some huge progress in recent years, but its still not at a point where I'd pay for it, as a service.

How? just get started working on a fun problem. A good place to start is keyword extraction. You don't need a PhD or expensive tools. All you need is some free time and willingness to read some cool stuff.

Copy a few articles into text files and get working on implementing some of these methods until you have enough of an understanding to construct your own methods for the fun of it.

Here's some good reading material:










Edit: Don't get deterred by the math formulas in these papers. They look far more complicated than they actually are.

Another fun thing is to paste article text into some API, like the Watson demo, so you can see what kinds of things are possible:


I played around with this a bit to develop https://www.findlectures.com, so knowing what works/doesn't work there I'm developing some NLP scripts to support my use cases.

I never thought about this particular use-case. The subtitle for TED talks should be an ocean of info for you to extract keywords from :D Pretty neat site you got there. I will be using it. Thanks!

I would say that a good example for starting in this field would be to implement something like Tf-Idf [0] for identifying keywords on a set of documents. I don't know where one can find current datasets for this, but I made WikiCorpusExtractor [1] to build sets of documents from the Wikipedia.

The only thing one really needs is to count the frequency of words in each document and do very simple math. Tf-Idf is still very relevant today and provides you with a very good idea on how statistics is used on text-mining.

[0] https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[1] https://github.com/joaoventura/WikiCorpusExtractor

I started even simpler than that. I started by just eliminating stopwords and count the frequency in each word in the document itself. I did not use a set of documents as the goal was for the algorithm to be used on the spot for a single block of text.

A few months later and after many iterations + a whole lot of testing, the algorithm now can extract super relevant keywords 90%+ of the time!

I wish I knew about the WikiCorpusExtractor. Thanks for the link!

Thank you very much for the reading material

u welcome!

I strongly recommend Stanford's youtube min-course by Dan Jurafsky & Chris Manning


Just noting that the article is better titled "why to get into ..."

One of the best places to start is reading this patent from Berkeley Lab/DOE which word2vec was based on https://www.google.com/patents/US7987191

What insight does the patent offer, which you won't get from reading the word2vec papers?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact