
An ImageNet-like text classification task based on Reddit posts - sweezyjeezy
http://www.evolution.ai/blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/
======
samcodes
The post mentions not getting great results with OpenAI Transformer. I haven't
tried that, but using a similar framework, ULM-FiT, I narrowly beat the
fasttext benchmark on a 250-class dataset we use internally. I will follow up
with how it does on this data set.

~~~
sh33mp
ULM-FiT and OpenAI's Transformer* are quite different. Both are pretrained
language-models, but ULM-FiT is a standard stack of LSTMs with a particular
recipe for fine-tuning, whereas the OpenAI's Transformer uses the much newer
Transformer architecture, and no really fancy tricks in the actual fine-
tuning. I suspect the difficulty is with the Transformer model itself - this
is not the first time I've heard that it is difficult to train.

* = To be clear, this refers to OpenAI's pretrained Transformer model. The Transformer architecture was from work at Google.

------
ppod
This looks fantastic. In particular, the focus on many-class classification is
important, it's a common real-world task that is often overlooked. I have some
suggestions:

More types of baseline accuracy measure would be useful, eg. accuracy, and
micro and macro f1 with unbalanced classes.

It would be very useful to know inter-annotator agreement for the manual
classification and human performance for the task of identifying the original
subreddit. I'm not a huge fan of creating artificial categories when natural
ones are available. In practice there will be a real difference between the
26th and 27th league of legends subreddit, it might be some subtle topical
focus shift or something political or tonal.

Is there some kind of standard measure for trading precision and recall when
classifying in a hierarchical class structure? That is, you start predicting
general high-level categories and move down to the most specific class you can
get to before confidence falls below a threshold? Then the evaluation measure
gives you more credit for getting lower down the tree (rewarding information
gain in the class hierarchy).

~~~
sweezyjeezy
A nice comment, good to see other people are thinking about this! I agree with
you about the imbalanced classes, I do have a copy of this, the main issue is
that the way this dataset was created was only looking at subreddits which
include 1000 posts or more, meaning that class imbalance is somewhat
unrealistic, if I do publish an imbalanced version it will include all
subreddits, not just the carefully selected 1013.

re: the reason for the artifice. First of all note that none of the labels in
here are exactly superficial - I have made a taxonomy - but I have only used
this to filter out subreddits - I did not combine the posts from different
subreddits in the same category here. The main reason was to combat the fact
that these are not great labels otherwise - many subreddits are subsets of
others - e.g. you have r/gaming -> r/finalfantasy -> r/FFVIII - and you don't
know that this follows a hierarchy a priori (N.B. categorising all subreddits
would require significant resources).

Worse than this, you have subreddits that don't really follow any obvious kind
of categorisation, e.g. r/askreddit (by far the most populous in terms of
self-posts), or more randomly, subreddits devoted to podcasts like r/joerogan
- they are basically places where people go for broad, likeminded chat, and
they can overlap with just about anything. I would argue that this is actually
not always realistic - for the examples I have worked on in the past, labels
were reasonably unambiguous.

------
minimaxir
Fun fact: Reddit uses a similar NLP technique as the t-SNE vizzes to combine
the content of subreddits for building recommendations:
[https://www.youtube.com/watch?v=tKISLQ87GO8](https://www.youtube.com/watch?v=tKISLQ87GO8)

~~~
thesehands
Thanks for posting this. Super cool to see how they are suggesting subreddits

------
Scaevolus
You can get similar raw data using these public bigquery datasets:
[https://bigquery.cloud.google.com/dataset/fh-
bigquery:reddit...](https://bigquery.cloud.google.com/dataset/fh-
bigquery:reddit_posts)

------
nwsm
Are you breaking Reddit TOS by storing/hosting posts?

If someone deletes their post on Reddit it will still be stored and available
on your site.

~~~
sosorry44
Why would OP care about reddit TOS?

~~~
super-serial
Because violating copyright laws could get him sued and jeopardize his
startup?

