
Topic mining with LDA and Kmeans and interactive clustering in Python - ahmedbesbes
http://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html
======
kmonad
Hey Ahmed, nice write up!

One thought I had was that I would be careful using KMeans on tSNE transformed
data. From my understanding, while tSNE preserves local relationships between
data point, it (heavily) distorts long-distance relationships. What this means
is that (if I get this right), you have almost arbitrary clustering results if
your k is chosen such that it is not equal to the number of local clusters (a
thing you don't know in advance). So clustering on distances between data
points in tSNE coordinates seems to be a risky way to draw conclusions. I am
not saying this is wrong, but if what I am saying is correct (please point out
if I got something wrong), then your analysis may perhaps work better if you
did KMeans on say the final 5-dimensional PCA space or something like that,
and leave tsne to visualize the data (and perhaps label the data points by the
clustering on PCAs?). Do you agree / does that make sense?

~~~
HIP_HOP
He fits and transforms Kmeans on the output of TfIdfVectorizer and not tSNE.
tSNE is only used for 2D visualisation.

kmeans = kmeans_model.fit(vz)

kmeans_clusters = kmeans.predict(vz)

~~~
kmonad
You're right. In that case, I would worry about the high dimensionality of
that array. Does MiniBatchKMeans do some internal dimensionality reduction?

------
stuartaxelowen
I'm still waiting to see interesting topic modeling results on non-news data.
The news has this tendency to follow a relatively low amount of topics, making
it easy to discover meaningful word distributions, but reviews, messages, and
comments written by "normal humans" almost never seem to.

~~~
autokad
I have done countless LDA on tweets and it worked great.

here is an example of taking tweets that have 'diabetes' and correlating
topics with counties. [https://www.linkedin.com/in/karl-
dailey-02557b65/treasury/po...](https://www.linkedin.com/in/karl-
dailey-02557b65/treasury/position:755225215/?entityUrn=urn%3Ali%3Afs_treasuryMedia%3A\(ACoAAA3JzV0BT9zF9Db_PxdLfE28gQQfK_PzmhM%2C1486424963566\))

The original topics were actually better, but I was asked to adjust the
language to wipe out common language across all tweets. You can still see
interesting things going on though. North East: Charity, Hospitals, Research.
South: Koolaid, sweat tea. etc.

I had also done topic modeling on customer surveys for Comcast (I cant show
them), but the topics identified 3 key features that lead to low customer
satisfaction.

I have also used LDA on grocery purchases (for mere fun), and it worked out
really great as well.

~~~
stuartaxelowen
But what real insights do the topics give? The only LDA results I've seen are
"fairly obvious" or "not understandable". Seeing a "car, road, light, drive,
trip, ..." topic is not insightful - this is a topic that is obvious to most
humans.

A more interesting topic would be one that is understandable but was not
obvious, even to the people intimately involved in data's subject matter.
These are harder to discover, but they do exist - and I have never seen LDA
surface them.

~~~
autokad
I see quantifiable and visualizible insights. Its also good for generating
features, and I can also create interaction features with important n-grams.

If it cant give you what you are looking for, maybe you are asking a fish to
climb a tree

------
RockyMcNuts
Good stuff ... if you're interested in this stuff, you might be interested in
Richard Socher's Stanford NLP class -
[http://cs224d.stanford.edu/](http://cs224d.stanford.edu/)

I have a dumb question about Bokeh ... if I use RStudio I can create some
markdown, export to PDF, send around to team/clients. Typically I can't do
anything interactive without hosting on a shiny server or referencing some
server-side magic.

If I use Bokeh, is it as simple as creating a Jupyter notebook, saving to
HTML, and then anyone can see all the D3 magic? Or are there any drawbacks
like file size, compatibility, browsers not letting you open due to security
or whatnot? If any pointers, tutorials, examples anyone has would love to hear
them, look me up if it's not deemed on-topic.

~~~
Tarq0n
If you only need basic filtering and brushing for interactivity there's a new
library called crosstalk that can do that. Some htmlwidgets like Plotly also
support some interactivity without the need for a server.

[http://rstudio.github.io/crosstalk/](http://rstudio.github.io/crosstalk/)

Edit: Note that you can knit rMarkdown to HTML as well, which is what enables
most of this stuff.

~~~
RockyMcNuts
hmmh, thanks, might check that one out.

for next project maybe I'll try both ways, with ggplot and rmarkdown to HTML,
and with Bokeh in python.

The ggplot will just knit to PNGs, would be interesting if ggplot could output
D3 with intelligent rollovers, maybe eventually some scriptable animations.

One could use ggplotly but I don't want to post anything to, or host anything
on a server, has to work 100% offline.

~~~
Tarq0n
The R version of plotly has been fully offline ever since they open sourced
it. Alternatively you could also try ggvis, but it's not as polished as ggplot
yet.

~~~
RockyMcNuts
thank you! I did not know this, having started with the online version...will
give it a shot.

