Topic mining with LDA and Kmeans and interactive clustering in Python (ahmedbesbes.com)
I'm still waiting to see interesting topic modeling results on non-news data. The news has this tendency to follow a relatively low amount of topics, making it easy to discover meaningful word distributions, but reviews, messages, and comments written by "normal humans" almost never seem to.

I have done countless LDA on tweets and it worked great.

here is an example of taking tweets that have 'diabetes' and correlating topics with counties. https://www.linkedin.com/in/karl-dailey-02557b65/treasury/po...

The original topics were actually better, but I was asked to adjust the language to wipe out common language across all tweets. You can still see interesting things going on though. North East: Charity, Hospitals, Research. South: Koolaid, sweat tea. etc.

I had also done topic modeling on customer surveys for Comcast (I cant show them), but the topics identified 3 key features that lead to low customer satisfaction.

I have also used LDA on grocery purchases (for mere fun), and it worked out really great as well.

But what real insights do the topics give? The only LDA results I've seen are "fairly obvious" or "not understandable". Seeing a "car, road, light, drive, trip, ..." topic is not insightful - this is a topic that is obvious to most humans.

A more interesting topic would be one that is understandable but was not obvious, even to the people intimately involved in data's subject matter. These are harder to discover, but they do exist - and I have never seen LDA surface them.

I did a project around this (AutoTagger): search google for all the subjects in the library of congress, spider the first 10 pages, strip off anything not body text, build a vocabulary with frequency, merge vocabularies to get a subject vocabulary, reject words common across all vocabularies.

Then when presented with a new article on any subject create the vocabulary and frequencies for that article, strip out the words stripped previously and then do a match against all subjects.

Very easy to parallelize, very good results (surprisingly good, for such a simple algorithm in fact).

"20 newsgroups" is a pretty ubiquitous human-generated data set for testing LDA and other NLP techniques. I've run LDA on it myself and it recovers topics fairly well.

The thing is, reviews, messages, comments, etc, in general, do tend to revolve around some central topic(s). For example, this comment right now is about LDA. It's not totally random.

> interesting topic modeling results on non-news data.

this is easy enough to do, I'm wondering what exactly is your definition of "interesting" ?

Interesting in that it provides some real insight - confirming that "crash" and "frozen", etc are similar in app reviews is not interesting. Interesting topics are more precise: people talking about the video being frozen or not playing is a singular topic, but given that video is used in so many other reviews, would not be detected as a topic using LDA.

positive reviews: http://pastebin.com/enEZzKmC

negative reviews: http://pastebin.com/vkGn7ZLH

Amazon Echo, roughly ~34k reviews. https://en.wikipedia.org/wiki/Non-negative_matrix_factorizat...

It's surprisingly useful for finding interesting trends in reviews.

blekko did some work with LDA against our 2,000 "slashtags", almost all of which were non-news-oriented. Sorry that this is not a free link:

http://ieeexplore.ieee.org/document/7050791/

