
Show HN: An Open Source Tool to Combat Clickbait Links - goldMIT
http://links.spince.com/demo.html
======
tlb
It assumes "a relevant (i.e. non-clickbait) link would have its text appear
frequently on the actual page".

Is there empirical evidence for this claim? The examples featured in the
article seem mixed (some right, some wrong).

It's easy enough to test: collect human judgement scores for a few hundred
article links from several news sites (an afternoon's work) and compare to the
algorithm.

~~~
softdev12
Thanks for the comment. There have been a lot of studies done using the
general TF-IDF approach. You can read about it on
[https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

There is also a new version planned using a different approach.

------
samsolomon
I'm not sure that term frequency is the right way to go about the problem of
clickbait. Frequently, titles and body text is meant to match up as a form of
SEO.

I'm not sure if this can be done through an extension, but an alternative way
to go about this is by measuring time spent on the page. If the page has X
words which takes Y minutes to read, but a certain number of people bounce
before that time, the score is lowered. The more users that stay on the page
for Y minutes, the higher the score.

There are a lot of assumptions in that solution, but it might be worth
considering.

~~~
softdev12
Thanks for the suggestion. The code is open sourced on github. Feel free to do
a pull request with your approach and I'll take a look at it.

------
yakult
Oh this is exactly what I wanted - I hope you make a firefox addon soon.

How does your approach of looking for text frequency compared to, say,
pattern-matching existing clickbait titles from a database? Can the two
approaches be combined (say, by using Splice to generate a corpus, remove
false positives manually, then use it to train a pattern matcher?) Not having
to load the linked article has huge benefits on bandwidth, robustness, etc.

~~~
softdev12
Thanks. This was built because someone posted an ASK HN a few weeks ago. They
wanted to know if there was a way to stop clickbait. The text frequency
approach seemed like the simplest and cleanest approach to build this in a
short timeframe.

Firefox addon is in the works. If you want to contribute code, go to github
and make a pull request.

------
rcthompson
So that means it's fetching the content (or at least the HTML portion) of
every eligible link? That seems like potentially a lot of network traffic for
what it does. Find if you have a fast connection, I suppose.

~~~
fcanela
Can be even worse. I have seen some ugly corporate services using links with
GET request to perform delete operations and having serious data loss caused
by crawlers like this tool.

~~~
softdev12
This was definitely a consideration. It was why there is a blacklist feature
and why only links of a certain length and certain type are analyzed. Any
suggestions can be made on github via a pull request.

------
Mao_Zedang
Be interesting to train a model using user generated, is this click bait title
yes/no. Is this something machine learning could get very good at?

~~~
rcthompson
"Avoid clickbait links with this one weird trick!"

~~~
stevetrewick
"Publishers hate it!"

------
spdustin
Why not train a network on ad unit headings from Outbrain and the like. That
seems like it'd be pretty spot-on. These in-situ pseudo-related-articles ad
units are, by their nature, click-bait. Aren't they?

~~~
softdev12
Thanks for the comment. There is a new version under development that will use
a different approach. Most likely a neural network.

------
thrilleratplay
Oh...nice try. I almost clicked on that link.

------
crasp
It seems to work pretty well. I can understand how this is far from perfect so
far but the thing that bothers me the most is that there is no apparent
caching mechanism and even though my internet connection is pretty fast it
will take a good 10-20 seconds to go through all the links. Once you have
determined a site to be clickbait or not maybe cache it so i don't have to
wait 20 seconds every time i come back. Additionally this will save a lot of
bandwidth.

------
DanielBMarkham
Note: there is a form of writing where the headline introduces a weird or
interesting concept and then the writer makes the reader wait for it. When
done well, these can be some really good articles.

Not sure how to account for that. Just wanted to point out the lossiness of
the algorithm.

Looks like a cool app. Thanks!

------
outofstep
I like the idea behind using TF-IDF. Feels like a lot of unnecessary network
traffic though.

------
touristtam
example shows the bbc with a green dot and npr with a red one? O_O

BBC isn't as reliable as its reputation would have you believe ....

------
fcanela
Not an issue at all but I am curious. Why "Spanish" has the Mexico flag? Was
the spanish language modeled with that specific variant?

~~~
nibnib
It's the largest Spanish-speaking population, maybe that's why?

