
Deep Learning on Title and Content Features to Tackle Clickbait - abhisvnit
https://www.linkedin.com/pulse/clickbaits-revisited-deep-learning-title-content-features-thakur
======
grabcocque
The problem is, we don't have a clear definition of what clickbait is.

 _nouninformal (on the Internet) content whose main purpose is to attract
attention and encourage visitors to click on a link to a particular web page._

But that's basically everything on the web.

~~~
BooglyWoo
I'm not sure that quite addresses the problem here.

After all there is no clear definition of 'what dogs look like' (in the sense
of a collection of logical rules), but deep learning models excel at detecting
them, when provided with enough positive examples.

If it's possible for humans to agree on whether a given article is clickbait
or not, we should be able to put together an adequate dataset for training a
system to classify them too. From the linked article I am unable to discern
how the training dataset was labelled.

In other words, the fact that 'clickbait' is a nebulous concept shouldn't
preclude machine learning from being able to detect it.

~~~
astrodust
Just as "dogness" is a factor, so is "clickbaityness". You're right, this is
all about thresholds.

~~~
kristianc
I often wonder what Wittgenstein would have made of today's models of machine
learning / deep learning
[https://en.m.wikipedia.org/wiki/Family_resemblance](https://en.m.wikipedia.org/wiki/Family_resemblance)

~~~
BooglyWoo
Me too. My reading of the Blue and Brown books led me to believe
Wittgenstein's conception of meaning is inextricably tied up with the notion
of "learning" and exposure to language and its use. Rather than meaning being
contingent on 'hard' logico-mathematical derivations of formal semantics.

This contrast seems somewhat reminiscent of the complementary approaches of
hard-coded rule based AI vs machine learning.

------
visarga
I'd like a system to filter out fluff threads on reddit. It would reject easy-
consumption content such as images, gifs and short vids, or anything shorter
than 60 seconds; also, low quality comments (short, aggressive, memes, etc).

Reddit is a gold-mine of interesting content, but it is flooded with fluff and
garbage to the point where it becomes a problem to find the good parts.

I'm wondering why they don't use more machine learning magic on the site.
There are multiple machine learning papers based off the reddit comment
corpus.

~~~
make3
your best bet is to filter out meme subs

~~~
visarga
There is also a need to find interesting threads outside a known list of subs,
or to filter out some parts of otherwise good subs.

------
baxuz
You could add any title that's formulated as an imperative. "You won't
believe..." "Guess which..." "You should..."

Also titles that are formulated as a simple subject - predicate - object
sentence: "XY considered anti-pattern" "Trump is right" "Hitler did nothing
wrong" "Drunk girl shows tits" "Homeopathy is the future of medicine"

Same works if formulated as a question: "Is Trump right?" "Has Hitler done
nothing wrong?" "Is homeopathy the future of medicine?"

Bonus points for exclamation marks, pound signs and uppercase words.

~~~
mtgx
Not all clickbait headlines are written like that.

For instance: "Russia hacked US power grid" doesn't have any of those, and yet
it was a completely clickbait/sensationalist/borderline fake news headline
from WashPost. How is AI going to deal with _those_?

[https://theintercept.com/2016/12/31/russia-hysteria-
infects-...](https://theintercept.com/2016/12/31/russia-hysteria-infects-
washpost-again-false-story-about-hacking-u-s-electric-grid/)

~~~
hyperpape
That wasn't clickbait. Arguably it was worse. "You'll never guess what happens
when she starts to sing!" isn't likely to contribute to increased military
tensions between nuclear powers.

To put it as a triviality: just because two things are bad doesn't mean they
have to be bad in the same way.

I also wouldn't classify that story as "fake news"[0]. Those were things like
"Revealed: Obama says Clinton would be terrible president", or "Revealed:
Trump under investigation by European Court for Human Rights". Those were
straightforward false claims, with zero actual sourcing, by people who knew
they were lying. This Washington Post article was shitty reporting, using thin
sources, that fit a currently popular hysteria. And it was completely
inaccurate. But the authors didn't sit down and say "what can we make up."
They got some sources and didn't do any due diligence, because it was too hard
to pass up on such a juicy story.

I'm not wedded to the idea that these articles aren't fake news, but I'm
confident it doesn't make sense to call them clickbait.

[0] Of course, this relies on the idea that fake news doesn't just mean "news
that is wrong", which has been with us forever, but more about a social media
driven trend within the past year or two.

------
joosters
Simple filter for tech articles:

$clickbait = /Deep/;

------
jj12345
Thanks for the nice, condensed article. Generating features from BeautifulSoup
isn't something that I've considered before.

I'm still going through Yoshua Bengio's new book on DL, but if anyone is free
to comment: what are the justifications for the general architecture? Why use
LSTMs with the Glove embeddings?

~~~
volker48
Seems like everyone uses the glove embeddings for any text based DL project.

------
hikkigaya
All I see is that the author uses deep learning to distinguish post published
by Buzzfeed, clickhole, upworthy and stopclickbaitofficial v.s. the other
pages?

~~~
samirahmed
yes - i am skeptical on how generalizable the final model is - given lots of
features (numerical and text) are closely linked to same domain.

------
abhisvnit
code available here:
[https://github.com/abhishekkrthakur/clickbaits_revisited](https://github.com/abhishekkrthakur/clickbaits_revisited)

------
empath75
I'm not sure how he can define what clickbait is and what's not.

The NY Times isn't immune to publishing clickbait, and buzzfeed sometimes
posts really solid journalism.

------
matrix2596
Crowdsource is an amazing thing to do

