
Ask HN: Getting into NLP in 2018? - sosilkj
For someone with a reasonable background in software development, what new skills might need to be acquired to move specifically into NLP (natural language processing)?<p>Does doing NLP essentially mean having to learn ML at this point?<p>Also, how to showcase skills to potential employers?
======
syllogism
I think it's probably a bad strategy to try to be the "NLP guy" to potential
employers. You'd do much better off being a software engineer on a project
with people with ML or NLP expertise.

NLP projects fail a lot. If you line up a job as a company's first NLP person,
you'll probably be setting yourself up for failure. You'll get handed an idea
that can't work, you won't know enough about how to push back to change it
into something that might, etc. After the project fails, you might get a
chance to fail at a second one, but maybe not a third. This isn't a great way
to move into any new field.

I think a cunning plan would be to angle to be the person who "productionises"
models. Tonnes of data scientists these days are surprisingly light on
programming skills, and they expect to hand off a notebook to someone who'll
"make it work". I think that sort of role would position you well to take over
the actual modelling as well.

As a bit of background, I'm the lead author of spaCy, a popular NLP library.
I'd like to stress that spaCy's definitely just one part of a solution ---
nothing is a "make NLP go now" button. I do think you'll need to know machine
learning to have really successful NLP projects. Another thing you might want
to check out is our annotation tool Prodigy:
[https://prodi.gy](https://prodi.gy) . This is especially good for
experimenting, as you really need to annotate data to get anything done.

~~~
AlexCoventry
> NLP projects fail a lot.

Has having seen a lot of NLP projects given you any insight into prerequisites
for success?

~~~
syllogism
Yeah I've been giving talks about this:
[https://www.youtube.com/watch?v=jpWqz85F_4Y](https://www.youtube.com/watch?v=jpWqz85F_4Y)

The biggest problem I see people having is they don't realise that defining a
consistent annotation scheme takes a lot of work. If you don't define your
problem well, you won't be able to collect consistent annotations, and your
model will always perform poorly. You need an annotation scheme that "carves
reality at the joints". There will always be boundary cases, but some ways of
dividing things up are just bad.

For instance, if you want to collect opinions about bands you like on social
media, you don't want to have an NER category "BANDS_I_LIKE". That's just
dumb: detect bands to handle ambiguities, run the classifier over lots of
text, get a frequency list, and then mark the ones you like.

Another example. Let's say you were looking for suspicious activity in some
dataset like the Panama papers. You might have a bunch of indicators for
suspicious activity, like companies that change name a lot, or the same
director on companies from very different industries. A lot of people have
gotten the message that they should make their models as direct and end-to-end
as possible, so they might try to label emails as
"INDICATES_SUSPICIOUS_ACTIVITY". This label is unlikely to be the easiest way
to do things. For one thing, it's bad to have one category that applies to a
bunch of disjunct sets. If you've got a single label for "A or B", you won't
have a linear decision boundary. Linearly separable problems are much easier
to learn. Another problem here is that the decision you're trying to make
rests on tonnes of world knowledge. The business goals would be much better
met by learning some simple text annotations like NER, a few simple text
categories etc, and using a rule-based approach to stitch things together.

Basically, don't just work on having more powerful solutions. Make sure you've
tried hard to have easier problems as well --- that part tends to be higher
leverage.

------
aglionby
I'd recommend following a course to get an idea of what's involved. The
Stanford course[1,2] goes deep into the neural network side of things;
although lots of state of the art systems use these, I'd suggest that starting
by thinking about language and what makes it difficult to process (with the
help of some (computational) linguistics) is also helpful. Knowledge of e.g.
morphology may help you make decisions down the line with e.g. stemming for
word embeddings. The Cambridge course[3] gives an introduction in this area
(disclosure: I TA this).

Aside from that, yes. Lots, but not all, of NLP these days involves ML in some
form. Most things I've seen are done in Python, what with all the NLP-specific
and general ML libraries available.

In terms of textbooks, Speech & Language Processing by Jurafsky & Martin is
great. Might not be the best way of diving in, but it's a good resource to go
deeper into things. The draft third edition, while only partially complete, is
online for free [4].

[1]
[http://web.stanford.edu/class/cs224n/](http://web.stanford.edu/class/cs224n/)

[2]
[https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_...](https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6)

[3]
[https://www.cl.cam.ac.uk/teaching/1819/NLP/](https://www.cl.cam.ac.uk/teaching/1819/NLP/)

[4]
[https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/)

------
Maro
I've put ML/NLP into production this year to automate away much of our ~1,000
person call center (successfully). I've tried a lot of things on the way,
played around with SKL to see what features end up being useful. But in the
end what's worked and is in production right now is fairly simple
tokenization, getting rid of stopwords, and a simple rule engine, where the
rules are coming from a back-end ML job, which is not doing anything beyond
median/mean/counts/ratios to find good rules. Overall the most fancy library
call I have is a fuzzy string match thing, I even got rid of SKL to reduce
dependencies. It works very well, easy to understand, tunable, I can add
exceptions/logging/etc at each step.

When it comes to DL stuff, I think the most useful thing in production
projects will be "embeddings", and all the relatively simple stuff you can do
once you have the word -> vector mapping. It's simple stuff, pick up the new
O'reilly book 'Deep Learning Cookbook' [1] the first 4 chapters already cover
this [2]. Popular libraries have this baked in [3], I think soon this will be
like making SQL calls in Django projects...

[1] [https://www.amazon.com/Deep-Learning-Cookbook-Practical-
Reci...](https://www.amazon.com/Deep-Learning-Cookbook-Practical-Recipes-
ebook/dp/B07DK1ZZXT)

[2]
[https://github.com/DOsinga/deep_learning_cookbook/blob/maste...](https://github.com/DOsinga/deep_learning_cookbook/blob/master/04.2%20Build%20a%20recommender%20system%20based%20on%20outgoing%20Wikipedia%20links.ipynb)

[3]
[https://pytorch.org/tutorials/beginner/nlp/word_embeddings_t...](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)

~~~
mlevental
do you mean literally phone calls? if so how are you doing the speech to text?

~~~
Maro
The thing I did makes the call unnecessary. So there's no speech-to-text
involved.

Detail: deliveries in the Middle East. There's no zip codes, street names are
iffy, people don't know their address, there's multiple names for everything
(at least english and arabic), there's no "database" for lookups (Gmaps/OSM
does not have good enough coverage/accuracy). So delivery companies, when they
accept a package for delivery, get the (customer_name, customer_phone,
customer_address) tuple (and the physical package), where everything is
totally freetext, sometimes the address is like "I live in the X laborcamp,
near the gate, call me when you get here". So by default there's a large call
center which calls the recipient, and the CC agent tries to figure out where
the driver should go based on the conversation. The outcome of the phonecall
is a (lat, lon) that the CC agent drops on Gmaps. Then, the next day, packages
are assigned to drivers based on the (lat, lon), each driver has a zone
(cities are broken into zone polygons).

Challenge: given the historic delivery data, can you (how well can you?)
automate the (customer_name, customer_phone, customer_address) -> (lat, lon)
prediction?

Answer: as I wrote above, it's doable for 40-80% of orders, depending on what
accuracy we're willing to accept, with fairly simple ML.

------
telchar
I've gotten into NLP somewhat in the last 2 years. In my case I was aware of a
large body of text data we (data science dept) weren't doing anything with and
pitched some project ideas to my director about how we might make use of it.
We took a "built it and they will come" approach and I set up a text
processing infrastructure using some open source NLP software and
Elasticsearch to parse and index our documents as they come in.

~12 months after we started that we got a request for an event detection model
using data that only existed in our text database, so I was able to put the
infrastructure we set up to use.

Setting up that infrastructure with good text parsing defaults was a good
intro to NLP although I have been intentionally avoiding getting too deep into
the NLP-specific methods (e.g. linguistics) for this project. The event
detection model gave me a nice focused project to work on which was very
helpful for learning more NLP.

FWIW I actually chose not to use ML for this project (aside from parsing with
the NLP software) despite that being the easier route since we lacked
sufficient training data to train a good model. But in general one would
probably need to know how to use ML for this sort of thing.

Hopefully this will be a useful anecdote though it's probably not compatible
with your situation.

------
ocdtrekkie
Possibly too simple for the OP, but I found this free book pretty solid:
[https://www.syncfusion.com/ebooks/natural_language_processin...](https://www.syncfusion.com/ebooks/natural_language_processing_succinctly)

It turned NLP for me from "voodoo" into something I'm kinda interested in
tackling for fun, and it was a pretty short/easy read to begin with. Probably
the biggest thing I picked up out of it though was that advancement in NLP
probably requires far more understanding of the English language than
particularly talented coding.

------
tinyhouse
> Does doing NLP essentially mean having to learn ML at this point?

Yes. Everything in NLP nowadays involve ML. Most NLP problems have some
structure (e.g., generating a sequence from another sequence like in Machine
Translation, or predicting a tag for each word in a sequence like in part of
speech tagging). Once you have good ML fundamentals it's not that difficult to
get into NLP.

Also, even though different tasks in NLP share structure and characteristics,
it's a large field with different areas of expertise. You don't need to know
everything. Focus on the problems that interest you first.

~~~
benterris
> Everything in NLP nowadays involve ML.

Some really nice projects do NLP without using ML at all, for instance
Duckling [1] (a library made by facebook to find entities in a text) works a
100% with parsing rules, and is surprisingly efficient.

I agree with your point though, most of the time there is ML at some point in
your pipeline so you can't really avoid learning it !

[1]
[https://github.com/facebook/duckling](https://github.com/facebook/duckling)

------
ragona
I'm not an NLP expert by any stretch of the imagination, but I've enjoyed
playing with NLP tools for several years now. I think this is a really
exciting time to be a user of NLP libraries, and I can't recommend tools like
spaCy and textacy enough. They're just a genuine pleasure to use as compared
to the old days of raw NLTK.

------
matchagaucho
Being an enabler of ML models that utilize NLP categorization can be extremely
valuable.

In our own product, I have a backlog wishlist story to parse a repository of
docs and flag the ones containing locations, people, places, etc... then let a
linear regression determine if there are correlations.

But I'd probably hire an ETL or Data Engineer to accomplish that task.

------
teacpde
One off error? You probably meant 2019

------
gullywhumper
Here's a collection of a lot of resources including different libraries,
tutorials, datasets, and languages:

[https://github.com/keon/awesome-nlp](https://github.com/keon/awesome-nlp)

------
brutus1213
Check out the Chris Manning book and Stanford course videos.

~~~
sloaken
Thanks :-) this fits my needs exactly: Natural Language Processing with Dan
Jurafsky and Chris Manning, 2012 on youtube.

~~~
AlexCoventry
Yoav Goldberg's book is also a good introductory survey:
[https://www.amazon.com/Language-Processing-Synthesis-
Lecture...](https://www.amazon.com/Language-Processing-Synthesis-Lectures-
Technologies/dp/1681732351)

------
opportune
never mind

~~~
tinyhouse
It's going to be difficult to get a job in NLP in a decent company without
good ML fundamentals and experience with deep learning. Obviously working with
toolkits like you mentioned is very important since by building stuff you get
to learn the most, esp. in an applied field such as NLP, and there's no reason
to do everything from scratch. But in job interviews no one cares if someone
has experience using Spacy or any other specific tool.

~~~
opportune
Maybe my knowledge is out of date then, since I was doing NLP in 2016. I’ll
delete my comment but I stand by my opinion that you shouldn’t dive straight
into implementing some state of the art ML model without learning the basics
of computational linguistics

~~~
bitL
With current transfer learning explosion you can do a state-of-art NLP
processing in your app within one month without any prior expertise/deep
understanding (yeah, needs a lot of effort but doable).

