
Understanding Text Pre-Processing for Topic Modeling [pdf] - wewake
http://www.cs.cornell.edu/~xanda/winlp2017.pdf
======
mark_l_watson
Useful results. I usually use long stop word lists and sometimes use stemming
- I will stop doing that since the paper is convincing re: eliminating only a
few most common words and not giving up extra information from stemming.

~~~
wewake
Same here -- I've always used stemming but this paper convinced me against it.
That being said, one should still write some rules to stem words such as
"Google's" to "Google" and take care to avoid duplicate words in corpora (e.g.
when "Google" is already present in data).

