
High-reproducibility and high-accuracy method for automated topic classification - jestinjoy1
http://amaral-lab.org/publications/high-reproducibility-and-high-accuracy-method-automated-topic-classification/#.VMzwjSiN1WU
======
matt4077
I had some sort of violent dopamine release just reading the headline.

I'm working on a project to make (EU-) law more accessible. So if anybody here
knows good methods to visualise/summarise long legal texts (30-300 pages) you
could do something for humanity by posting a reply.

(Word clouds just don't cut it in these cases.)

~~~
bane
A classic summarization method is:

1\. Split your text into sentences

2\. Remove stopwords (keeping a copy of the original sentence)

3\. For the remaining words, calculate the synthetic TF-IDF score of all the
words in the sentence (tf-idf each word then sum them).

4\. Keep the n-highest scoring sentences (in the order they appear), or all
the sentences with synthetic TF-IDF scores above some threshold.

There's your summary.

~~~
riazrizvi
I find it works better with additional step 2.5) Lemmatize remaining words,
using for example Python's NLTK library.

~~~
ccleve
When you lemmatize, what are you doing exactly? Are you merely reducing words
to a more common form, thereby reducing IDF? For example, are you reducing
"walking" or "walked" to "walk", and then using the IDF of walk?

~~~
bane
Yes, exactly.

related is the idea of "stemming" which uses an algorithm to try to reduce
inflection, to find a common form of a word that various versions come from.
Porters algorithm is a well known stemming algorithm. However, sometimes you
end up with weird "non-inflected" tokens at the end. (e.g. 'enhancement' might
become 'enhanc')

However, lemmatization is considered "better" in that it uses a dictionary of
inflected forms that map back to the non-inflected form. So in theory, if the
dictionary is comprehensive, you can properly replace inflected forms with
their correct non-inflected forms. (e.g. 'enhancement' -> 'enhance')

If your dictionary isn't comprehensive and comes across a token it doesn't
recognize, you can try falling back onto a stemming algorithm.

------
abeppu
I find it interesting that this appears to be written by a group of physicists
rather than NLP or ML researchers, and I think you can kind of see that in the
way they approach the problem. I think a bunch of the work done after LDA
among ML and NLP people tended towards (a) using Hierarchical Dirichlet
Process models as a platform from which to explore Bayesian nonparametrics
more generally (b) better inference algorithms for topic models and (c)
somewhat richer models (i.e. author topic models, syntax aware topic models,
etc).

And it's not like the people in this field haven't been aware of network-
oriented methods. But rather than using community-detection as a mechanism for
topic discovery, instead people either focused on networks among topics to see
how topics are related, networks among authors such that social network
information informed topic discovery, or networks among documents where
link/reference information was explicitly part of the model.

These authors seem to get solid results in part by having totally different
values/aesthetics. Unlike the Bayesian nonparametrics people, they clearly
don't care about picking arbitrary, inflexible parameters (e.g. the 5%
threshold), nor do they want their model to have a clear, generative form, nor
are they particularly concerned about having a new algorithmic insight (since
they throw their hard work to InfoMap, and discuss none of its details), nor
do they attempt to advance the expressiveness of their topic model (they
proceed with the most basic bag-of-words model available). But it does seem
like they get good results on the basic task with a very pragmatic, pipeline
approach.

~~~
shanusmagnus
Two of the authors (Kording and Acuna) are definitely not physicists; much of
their previous work you might describe as psychology with a strong math
modeling background. Interesting that the pub is in a physics journal though.

------
jetsnguns
It was interesting to see a take on the problem from the researchers outside
of NLP or ML fields, but the authors only considered classic LDA and PLSA for
comparison. I am not currently involved in topic modeling, but I know there
exist techniques and modifications to classic models that improve topic
discovery (like tf-idf weighting). Can you suggest any modern methods from NLP
and ML communities that address the same issues and can rival the authors'
findings?

------
helderts
Modeling words co-occurrence graph and then pruning "weak" edges (or achieving
similar pruning by using community detection to find clusters) works kind of
like a "feature selection" based on something that resembles a bare mutual
information or tf*idf.

I'm not entirely familiar with LDA, but from what I was able to understand
from their intro, it feels like their LDA application could have used some
feature selection.

------
avyfain
You can see the source code of a previous iteration of the algorithm here:
[https://bitbucket.org/andrealanci/topicmapping/src](https://bitbucket.org/andrealanci/topicmapping/src)

------
b0b0b0b
I'm confused by the discussion of multi-lingual corpora. Is it common in topic
modeling to consider documents drawn from disjoint vocabularies, or is it just
a kind of thought experiment?

~~~
3pt14159
Pretty common when you don't control the data source or for multi language
goverment agencies (for example in Canada you may have your court case in
French if you desire).

------
b6
I haven't dug into the details of the paper yet, but I want to commend the
authors for 1.) making it possible to actually download the PDF and 2.) giving
some indication, within the actual document, when the paper was published. I'm
being a little bit snarky, but I'm very sincere in thanking them.

~~~
rcpt
>making it possible to actually download the PDF

The journal they published in, Physical Review X, is a newer open-access
journal from APS (along the same lines as PLOS ONE or Nature Scientific
Reports). I think it's great but not everyone agrees. To read more on the
debate around the open-access phenomenon look at
[http://blogs.berkeley.edu/2013/10/04/open-access-is-not-
the-...](http://blogs.berkeley.edu/2013/10/04/open-access-is-not-the-problem/)
and
[http://www.sciencemag.org/content/342/6154/60.full](http://www.sciencemag.org/content/342/6154/60.full)

------
curiously
is there an open source implementation I can use?

What about that sentiment analysis NLP tool that someone posted on HN last
year? That was also very good.

