

Test shows big data text analysis inconsistent, inaccurate - joe_the_user
http://www.computerworld.com/article/2878080/test-shows-big-data-text-analysis-inconsistent-inaccurate.html

======
languagehacker
Holy shit! Stop the presses! Enriching source data by filtering out low-signal
information and selecting for high-signal information improves unsupervised
clustering?

It's great to see physicists getting into other disciplines to show everyone
else how wrong they are. As we all know, when you study physics and math,
you're at the top of the reductionism pyramid, so you don't even really need
to worry about having a background or experience in the thing you're studying.
You just data at a formula, and if it doesn't work, you get to go to town
berating professionals in the field using that formula. That's how it works,
right?

Pre-processing your topic-modeling data is a pre-requisite to getting good
results. This is fairly common knowledge. Not pre-processing is certainly the
more naive approach. Computational linguists have backgrounds in things like
syntax, semantics, and discourse so that they _know_ what components of
language to select for and how to format them without mis-representing the
source data.

This article characterizes a straw man to tear down in the service of
advertising a proprietary solution. It's basically an advertisement with a
scientific reference for good measure. We call these "white papers" \-- not
serious journalism.

I can make these assertions from personal experience. I used latent Dirichlet
allocation on high-fidelity natural language data to provide topic modeling
for hundreds of thousands of wikis. We used the data for ad optimization as
well as recommendations -- both of which provided statistically significant
improvements in engagement. The approach worked. The recommendations were
reproducible. I used all open-source software.

I guess I should have gotten into physics.

------
alexcasalboni
Big data text analysis is an approach, not a method.

They merely showed that Latent Dirichlet Allocation is not the best method to
solve the specific problem of classifying human language from unstructured
text. New techniques will always come up to be more accurate and reproducible,
and eventually become the new most used one. TopicMapping sounds promising. :)

------
irickt
discussion:
[https://news.ycombinator.com/item?id=8976877](https://news.ycombinator.com/item?id=8976877)

paper: [http://amaral-
lab.org/media/publication_pdfs/PhysRevX.5.0110...](http://amaral-
lab.org/media/publication_pdfs/PhysRevX.5.011007.pdf)

