
How I trained fake news detection AI with 95% accuracy, and almost went crazy - thetall0ne
https://towardsdatascience.com/i-trained-fake-news-detection-ai-with-95-accuracy-and-almost-went-crazy-d10589aa57c
======
wadkar
The fakebox doesn’t detect fake news, it detects articles which are
factual/real and everything else is labeled as “fake”.

Where’s the dataset? How did you verify the ground truth? Where are the
annotation/labeling guidelines?

What’s the definition of factual/real articles? The dataset appears to be
created by the author - which isn’t necessarily wrong but to paraphrase Karl
Popper (in the context of human knowledge and scientific endeavors):

There are no ‘pure’ facts available; all observations are functions of
subjective factors such as interests, expectations, wishes etc.

[http://plato.stanford.edu/entries/popper/#GrowHumaKnow](http://plato.stanford.edu/entries/popper/#GrowHumaKnow)

~~~
saurabhn
I'm with @wadkar on this. I think the Fake News Challenge Stage 1 (FNC-1) was
a good step towards this effort. They acknowledge (almost) all of these
concerns and start with Stance Detection as their first stage. In this
problem, pairs of article headlines and body text were classified into
{Agrees, Disagrees, Discusses, Unrelated}.

Constructively criticism to the OP: I'd suggest they read the nuance and
discussions on the Fake News Challenge [0] and then look into their datasets +
evaluation code [1] instead of hand-coding their own "biases" into a {"Fake
news","Not-Fake-News"} binary classifier. Feel free to replace "Fake News
Challenge" with any other similar effort so that OP isn't tasking themselves
with the massive task of "Solving Fake News" all alone.

Disclaimer: I don't have any stake in FNC-1

References:

[0] [http://www.fakenewschallenge.org/](http://www.fakenewschallenge.org/)

[1]
[https://github.com/FakeNewsChallenge/fnc-1](https://github.com/FakeNewsChallenge/fnc-1)

~~~
wadkar
Thanks for the FNC links - quite interesting! This would be a nice
challenge/dataset for grad students to work as a project in ML/NLP class.

------
raker
This article is the 5%.

A more accurate way of detecting "fake news" would be interesting, but I fail
to see how such a thing could be designed, past simple detection of wishy-
washy and avoidant word patterns.

------
bagrow
Accuracy is not a sufficient measure of a classifier. Better to report
precision and recall, or any number of other combination measures.

[https://en.m.wikipedia.org/wiki/Evaluation_of_binary_classif...](https://en.m.wikipedia.org/wiki/Evaluation_of_binary_classifiers)

------
minimaxir
The OP does not say the label distribution of the training data; it's entirely
likely that the split is not balanced 50/50, which would make "95% accuracy"
as an indicator of quality misleading.

This is one of the reasons why I recommend that Medium thought pieces disclose
their data and code instead of just saying "I did AI magic!" to sell a product
(and they do charge for their product on their website).

------
richdougherty
> I found myself drifting in my own interpretation of fake news, getting angry
> as I came across articles that I didn’t agree with, fighting hard against
> the urge to only pick ones I thought were right. What was right or wrong
> anyway?

A good question and I'm not surprised he went a bit crazy.

[https://plato.stanford.edu/entries/truth/](https://plato.stanford.edu/entries/truth/)

> The problem of truth is in a way easy to state: what truths are, and what
> (if anything) makes them true. But this simple statement masks a great deal
> of controversy. Whether there is a metaphysical problem of truth at all, and
> if there is, what kind of theory might address it, are all standing issues
> in the theory of truth. We will see a number of distinct ways of answering
> these questions.

~~~
debt
What is "right" in news has become "a compelling narrative that appears
plausible to the reader."

What is "wrong" in news is "an implausible narrative relative to the reader."

Plato deals with truth which is something entirely different, not in the same
realm of the current state of news.

To be more specific, news can be one of four things to a reader: plausible and
true, plausible and untrue, implausible and untrue, and implausible and true.

Most readers seem to more concerned with a stories' plausibility and not it's
trueness.

Most importantly though is readers no longer value the "truth" component of
news. They value whether or not the narrative aligns with their own view of
the world.

Because what truly happened doesn't matter to most people, only that they have
a way to make sense of it themselves. Even if it's a partially or completely
false narrative.

------
thetall0ne
The model is not based on domains. Just the text of the article. Can confirm
there was an even number of real and notreal news examples. Data set was
eventually broken into two categories; written with bias, or without bias. For
example, a NYT Opinion piece was considered notreal news.

------
txsh
He’s not detecting fake news. He’s detecting articles that match the writing
style of a handful of publications and labeling everything else “fake”.

~~~
maxk42
Yes, congratulations to this individual who has built a classifier that
classifies news from sources they like to peruse.

------
peterwwillis
What the....?

The author describes a "fake news detector AI", that is actually a "typically
legitimate source of news" data model, combined with a fake news domain
blacklist. It doesn't detect fake news. It detects whether a story possibly
came from a source you find to typically be legitimate.

This article is fake news.

~~~
dantillberg
Yeah; I don't doubt that there's a lot of value in the collection of data they
built, but it's hard to judge the value of "95%" accuracy without comparison
to a baseline like `is_fake = !whitelist.includes(article_domain)`. That
whitelist is basically how our brains work currently, and my own (biased,
perhaps) accuracy rate is something close to 100%.

A "true fake news detector" would use only the text of the article -- without
the URL. And so I agree that this article is kind of like fake news of the
"misleading" variety. :)

~~~
eropple
Why would reputational analysis of sources not be part of analyzing news
trustworthiness?

~~~
wu-ikkyu
Just because someone is wrong most of the time doesn't mean they're wrong all
the time. It's ad hominem

~~~
eropple
OK. And? Is the point to be perfect, or is the point to be useful?

False fakes--and a wisely designed system would probably say "probably fake"
in the first place, not outright "fake"\--are less damaging than false truths
overall.

~~~
wu-ikkyu
>False fakes--and a wisely designed system would probably say "probably fake"
in the first place, not outright "fake"\--are less damaging than false truths
overall.

That depends on what your goal is: to discover groundbreaking truths (which
are almost always considered taboo/false at first) in order to improve
society, or merely attempt to cover up apparent falsehoods as damage control.

In the latter case, such a system would be highly prone to enforcing the
prexisting biases and prejudices of its developers onto the users, making it
much more difficult for them to discover truths that the developers are
ignorant of.

------
tantalor
Where's the demo?

~~~
disconnected
This isn't obvious at all, but you need to create an account with them (you
only need an email address).

Then, in your account page, you scroll down and there are instructions for
installing/running fakebox (and others).

The "demo" is actually a docker app that spins up a web application of some
sort. I'm firing it up as I type this, and I'll checking it out soon-ish.

------
mirekrusin
He needs to release/train at least 3 versions with whitelist-blacklist
variations for rt, al jazeera and fox news.

