

Latent Dirichlet Allocation on Tweets - wellecks
https://wellecks.wordpress.com/2014/09/04/these-are-your-tweets-on-lda-part-i/

======
k8si
Shameless plug: if you're interested in MALLET you might also be interested in
FACTORIE: [http://factorie.cs.umass.edu/](http://factorie.cs.umass.edu/)

~~~
eob
Is development on FACTORIE ongoing? I played with it a bit a long while back
and loved the "imperative-declarative[1]" idea you all were pushing for model
construction toolkits.

[1] edit: Because I'm not sure if I just made that phrase up or if it came
from one of your papers, the idea that ML libraries that take declarative
model descriptions are great, but what's even better is if we also have an
imperative API that can dynamically generate those declarative specs for us,
even based on train-time inputs, so we can essentially "program" the structure
of a model but still benefit from keeping everything generalized and
declarative at the base.

~~~
k8si
Development is very much ongoing and that's really interesting/nice-to-hear
feedback :)

------
datahipster
I did something similar to this with tweets during the Boston Marathon bombing
[0]. One of the coolest things that I saw was that the topics themselves were
neatly ordered in time. In other words, you can visualize how the distribution
of vocabulary evolves with time [1].

It would be interesting to extend the LDA model to include a temporal
variable. Never got around to doing it, but it seems like it would work well
for social media data.

[0] [http://blog.dc.esri.com/2013/04/18/the-evolution-of-
discussi...](http://blog.dc.esri.com/2013/04/18/the-evolution-of-discussion-
around-the-boston-marathon-events/)

[1] [http://blog.dc.esri.com/files/2013/04/topic-
distribution2.pn...](http://blog.dc.esri.com/files/2013/04/topic-
distribution2.png)

------
Daishiman
It's a fun exercise for a single tweet feed, but it is unfortunately not very
useful when applied to larger scale learning, since you start hitting the
model's limitations of a fixed number of topics and, its potentially long
runtime, and the fact that it's not an online method (although there are
online variants of LDA).

~~~
xtacy
Would you like to point us to some progress on how to overcome these
limitations with LDA?

~~~
tansey
The gensim [0] package has a nice implementation of online LDA that can handle
massive streaming datasets. If you want to avoid specifying the number of
topics, you can use HDP-LDA. David Blei (inventor of LDA) has a reference
implementation on his website [1], along with many other variants of LDA.

[0] [http://radimrehurek.com/gensim/](http://radimrehurek.com/gensim/)

[1]
[http://www.cs.princeton.edu/~blei/topicmodeling.html](http://www.cs.princeton.edu/~blei/topicmodeling.html)

------
misiti3780
Shameless Plug: I did this in Python:

[http://josephmisiti.github.io/using-latent-dirichlet-
allocat...](http://josephmisiti.github.io/using-latent-dirichlet-allocation-
to-categorize-my-twitter-feed.html)

------
dhotson
Very cool!

I've read that LDA doesn't work well on short documents. Your approach of
concatenating all tweets for a user appears to work quite well. One other
technique I've seen is to concatenate multiple tweets together that contain
the same hashtag.

One of our intern students at 99designs did some work on applying LDA to
classify graphic design tasks:

[http://99designs.com.au/tech-blog/blog/2014/01/22/Swiftly-
Ma...](http://99designs.com.au/tech-blog/blog/2014/01/22/Swiftly-Machine-
Learning-1/)

.. you might find it interesting. :)

~~~
biomimic
Also used to classify genes related to lifespan - LDA on Statistical modeling
of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography
for genes related to life span.
[http://www.ncbi.nlm.nih.gov/pubmed/16681860](http://www.ncbi.nlm.nih.gov/pubmed/16681860)

------
BIackSwan
We did something similar about 3 and a half years ago where we detected online
communities based on their follower network and text of tweets. Something you
might be interested in - [http://www.slideshare.net/akshayubhat/twitter-
lda](http://www.slideshare.net/akshayubhat/twitter-lda)

