
Ask HN: Deep learning algorithms to aggregate technology topics from the web - larryfreeman
A friend and I were talking about putting together a project that crowdsources cool technology topics that leveraged the latest algorithms.<p>I&#x27;m assuming that we should use something like Pytorch and possibly leverage the great work done by Yan Lecun.<p>What deep learning algorithms are recommended for this project?  If you can specify a technical paper or library with sample code, that would be awesome.
======
visarga
I did something similar for another language. I crawled millions of articles
first and build word2vec on the plain text. To compute the embedding of a
topic, I summed the vectors of its main keywords - 3-4 well chosen words are
enough. The embedding of an article was obtained by summing (or averaging) the
vectors of its words. I skipped the stop words (also tried tf-idf) to reduce
the noise. The final step was to compute the similarity score of an article
related to a topic. This is extremely easy and fast - a dot product between
the vectors. Scores over 0.3 (or 0.5) indicate similarity. The main advantage
of this method is that it only requires a topic vector, not a whole dataset of
training examples. But if you have such a dataset, then you can average the
most central keywords per topic, and get topic vectors.

If you have hundreds of classes and a training dataset with about 500+
examples per class, you can also try fastText, Vowpal Wabbit or even Naive
Bayes. If you want to use neural nets, there are some 1D CNNs floating around
on GitHub, but they don't work all that well compared to simpler classifiers
or simple dot product between vectors. Hundreds of classes usually make
classifiers sluggish and accuracy is not so great compared to the binary case
(spam/not spam). I wouldn't try to do that to predict the best subreddit for
an article for example, because there are too many subreddits, but with
vectors it's still OK.

------
visarga
Take a look here as well: [https://hackernoon.com/the-unreasonable-
ineffectiveness-of-d...](https://hackernoon.com/the-unreasonable-
ineffectiveness-of-deep-learning-in-nlu-e4b4ce3a0da0)

Exactly about news classification with DL.

