
Ask HN: Machine learning to classify tweets into topics – Where to start? - Sujan
I want to build something that can classify tweets (from my timeline or lists) by topic. I can of course provide a dataset of tweets pre-classified for each variant, but then the software should be able to learn from that and apply that logic to new, future tweets.<p>Variant A)
Take all tweets of a timeline and decide if one was &quot;relevant&quot; or &quot;irrelevant&quot;.<p>Variant B) 
Decide which of x topics a tweet belongs to.<p>Variant C)
Group similar tweets together that are related.<p>Where to start?
What are the correct terms to Google?
What libraries or software should I look at?<p>(I am most comfortable with JS and PHP, but of course this is only semi relevant.)
======
mindcrime
A) and B) are variations of a classification problem. You can use any kind of
classifier algorithm / approach. Naive Bayes, ANN, etc. C) is arguably closer
to a clustering problem, but you have to figure out how to define the notion
of distance / density for the clustering.

Basically, from a "what to google for" perspective, I'd say read up on
"classification algorithms" and "clustering algorithms". The respective
Wikipedia pages aren't a bad place to start reading.

[https://en.wikipedia.org/wiki/Statistical_classification](https://en.wikipedia.org/wiki/Statistical_classification)

[https://en.wikipedia.org/wiki/Cluster_analysis](https://en.wikipedia.org/wiki/Cluster_analysis)

------
syllogism
[https://github.com/explosion/spaCy/blob/develop/examples/tra...](https://github.com/explosion/spaCy/blob/develop/examples/training/train_textcat.py)

You'll need to `pip install spacy-nightly` -- it uses spaCy 2 alpha. Until
spaCy 2 stabilises it'll be a bit unstable, and unfortunately it's not in the
docs yet.

You'll likely get better results by running `spacy download en_vectors_web_lg`
and doing `nlp = spacy.load('en_vectors_web_lg')`. This will download word
vectors, trained on an enormous web dump by Stanford NLP using their GloVe
algorithm.

Once the model is trained you can do `nlp.to_disk(output_directory)`, and then
run `spacy package <model directory> <package directory>`. This will setup the
model data as a Python package, so that you can run `setup.py sdist`. You'll
then get a self-contained Python package that exposes a `load()` function, to
give you back the `nlp` object. (Note that if you do base the model on the
GloVe vectors the package will be enormous, like 1GB. Shrug?)

If you're starting one step back and don't have the data annotated yet, you
might be interested in our annotation tool Prodigy. There's a demo video of
the text classification workflow here:
[https://www.youtube.com/watch?time_continue=638&v=5di0KlKl0f...](https://www.youtube.com/watch?time_continue=638&v=5di0KlKl0fE)

------
chasedehan
JS and PHP won't get you anywhere - you will need to use something else like R
or python to start looking at it.

Check out this course: [https://www.datacamp.com/courses/intro-to-text-mining-
bag-of...](https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words)

This will help you figure out how to convert those words into variables which
can be used for modeling.

------
itamarst
[http://www.nltk.org/book/ch06.html](http://www.nltk.org/book/ch06.html)

------
arrmn
I'm currently working on something similar, these were my first two ideas:

Try to train your own word2vec model on a twitter dataset and then you could
use the weighted tf-idf average of these vectors. You get a vector for each
tweet, and tweets that are about the same topic should be next to each other.
Then try clustering algorithms, you can use the cosine distance to find the
nearest X tweets.

Second Idea would be to train doc2vec with twitter data.

Another worthwhile idea could be to use LDA, haven't tried it myself

------
byoung2
You could look at naive Bayesian classifiers or logistic regression
classifiers. There are libraries for both in most languages and they are
suited for your application.

------
mmikeff
I'm lazy and would start with Textrazor or Alchemy Api

------
gerenuk
For topic modeling, take a look at gensim along with k-means. Also, you can
use tf-idf to improve the accuracy.

------
kk58
For tweets naive bayes with bow approach works very well.

You need to do a ton of preprocessing. Think text transformation..

