
100 NLP Papers - godelmachine
https://github.com/mhagiwara/100-nlp-papers
======
osipov
Unless you are a researcher (in academia or a corporate research lab), you
should think twice before spending your time with these papers.

I have seen repeated examples of information technology industry professionals
who go off on a wild goose chase of trying to parse the papers and reproduce
them. If you are a machine learning practitioner or a data scientist in the
industry, it is highly likely that you are going to waste your time with these
papers. Here's a concrete example from the list: "John Lafferty, Andrew
McCallum, Fernando C.N. Pereira: Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data, ICML 2001." This used to be
the defining paper in early 2000s. Today it is important only as a road marker
in the NLP research history which turned out to lead down an unproductive
route.

Those who have not spent meaningful time in academia working on publishing
their own research papers tend to fetishize them. The reality is that even the
best papers in the field are a mess of ideas designed to please fickle
reviewers and academic superiors. Most papers explore nooks and crannies of
ideas that are irrelevant to an industry practitioner and are filled with
assumptions that turn out to be impractical.

Unfortunately reading research papers has become a self-reinforcing status
symbol for practitioners to name-drop and generally show off their in-crowd
status rather than to rely on the ideas in the papers for a source of useful
and practical information.

~~~
whymauri
Uh, I disagree? Some of the best scientific discoveries of the 2000s came from
insights found in old papers (50s, 60s, 70s). For example, optogenetics.

And now with Transformers, Hopfield learning and continuous Hebbian dynamics
are making a small comeback. I mean, sure, don't implement the paper verbatim,
but it's depressing to discard decades worth of work and insights only to
rebuild it all again. Our disregard for past 'unsexy' work is one of the
largest inefficiencies in science, hands down.

~~~
melenaboija
The comment starts with "Unless you are a researcher...".

If you are a practitioner you are just trying to use the result of some
research that someone else has done before, mostly to not have to do that
research again.

Sure this result is possible thanks to revisiting old ideas, but using it does
not mean you are discarding anything.

------
stevesimmons
Does anyone have a similar list of NLP papers, but focused on recent best
practices for commercial applications, rather than foundational academic
research?

~~~
JHonaker
Yea, this would be a great resource if it exists. I constantly have to point
out to project managers that neural NLP models by and large have very
different goals than they do. If you’re not trying to do something that
amounts to computing a feature of a language (produce similar output,
sentiment of a sentence or passage, parse into SVO, etc.) they’re not all that
helpful. Laymen pretty much all assume you can get them to reason about text,
which we’re very far away from.

~~~
jcims
I wonder if there is room for a reddit clone that only links to papers. Each
subreddit is a community of interest, and there are time series analytics
available for any of the votes/references/etc.

------
thelazydogsback
It would be helpful if research papers included the publication date
prominently displayed in the header/abstract of the paper -- but none of them
do. You can get a idea by looking what the paper itself references in the
endnotes, but that varies.

It would be a useful search-engine feature or plugin to derive the pub-date
information (and possibly the source journal(s), if appropriate) from meta-
data elsewhere and/or by cross-referencing the papers that cite each other and
include dates.

It's also very annoying that one can _usually_ find a free version of most
papers, but I need to wade through gobs of hits first where they want me to
pay for it -- and it's not always obvious w/o following each link.

------
hallqv
Agree with previous posts re: reading papers being potential rabbit hole for
NLP practitioners. One paper that could be pretty useful for practical
applications is this one
[https://arxiv.org/pdf/1904.12848.pdf](https://arxiv.org/pdf/1904.12848.pdf)

It outlines strategies for data augmentation in NLP, as well as other ML task.
Finding task-specific labeled data is often one of the most pertinent issues
for applying ML outside of academia.

------
staticautomatic
I've been going down the academic NLP rabbit hole lately, and at least in my
domain (unsupervised key phrase extraction), the problem isn't the papers,
it's the code (surprise!).

Let's start with the fact that in applied NLP, everyone has a plan until they
get punched in the face by any number of pre-processing issues. And let's set
aside the fact that in the end it's all going to regress to supervision,
without which you can't optimize. Let's also set aside the fact that
performance against a "gold standard" SemEval dataset doesn't mean shit in a
lot of real world applications.

So you try out the standard issue "top of the line" algo, like YAKE, which is
so fucking slow in pure Python that it'll choke a Bayesian optimizer. You sit
around for a while debating whether or not to port it to Cython, having little
idea if the effort will pay off because you aren't sure how well YAKE is going
to work to begin with and it might get bested by another algo anyway.

So you go looking through the literature and you're delighted to find that
within just the last few months, there have been some really cool and
promising algos coming out with solid benchmarks, and there's code available
to boot. Yay!

So you download the "weakly supervised" statistical one and it turns out to be
a fucked up polyglot of Bash, C++, and a stale version of OpenJDK, some of
which you have to compile yourself with g++, and then you have to dump your
corpus into text files even though you've already got it in memory, run it
through a tokenizer you neither want nor need, and then read the results back
out of other text files. Sure, there's a docker version. It's full of bloat
and solves some of the more negligible problems at hand.

Then you download a graph-based algo and it's such an undocumented mess of
spaghetti it might as well have been written by an Italian restaurant. So you
spend a really unreasonable amount of time just trying to figure out which
function even takes your text as an input, and you read through a bunch of
other functions trying to figure out if it needs to be pre-tokenized or not
and if it wants the input as sentences or not or whatever. It also wants your
input as a text file.

Then you download a language model-based algo and you think you're going to
run the BERT variant you have at hand, but you double check the paper and it
happens to perform way better with ELMO and then if you're lucky you don't
spend a whole day trying to get AllenNLP running because you're using WSL on a
laptop without a GPU and the non-gpu Tensorflow dependency is shitting itself
all over the stack trace. You finally get the environment going in all its
bloated glory even though you just wanted the pre-trained ELMO model, which
you finally get deployed to Cortex or whatever and breathe a sigh of relief.
And then it turns out your corpus is so domain specific that your matrix is
sparser than swiss cheese because it's chock full of unks.

What have you learned after all this? That building an ensemble model which
plays nicely with spaCy or SparkNLP is going to be an order of magnitude
harder. Have fun!

~~~
Der_Einzige
I wrote a summarizer which (when using the right settings) performs
unsupervised key phrase extraction using language models. It is available
here:
[https://github.com/Hellisotherpeople/CX_DB8](https://github.com/Hellisotherpeople/CX_DB8)
It seems that it would be very useful to you.

Like most data science code, it's non-trivial to install (it used to be when
it was still updated)mostly because some dependencies are out of date and I
will not risk a lawsuit from my current employer due to the similarity between
this work and my day-to-day work. There is a jupyter notebook available which
will allow you to use it without an install

