
Visualizing Clusters of Clickbait Headlines Using Spark, Word2vec, and Plotly - minimaxir
http://minimaxir.com/2016/08/clickbait-cluster/
======
131012
"Once the dictionary is created, we can average all the word vectors for a
given headline to get the numeric representation of the headline itself."

Can someone explain me why this is useful? Aren't we losing a lot of precision
from word2vec results?

And as a general question: is there any useful knowledge we can extract from
this vizualisation apart: most of the time, news channels write about
different things?

Don't get me wrong, I think the techniques displayed here are really cool, but
I have the feeling the conclusions are either absent or trite.

~~~
guidopallemans
> Aren't we losing a lot of precision from word2vec results?

Not really, because the vectors are really high-dimensional, and the words
that occur in the same headline together usually aren't close together.

So it's not really losing precision, it's more like combining the meanings of
the words (numerically).

------
odbol_
All that in-depth explanation, and he forgot the first rule of data
visualization: always label your axis!

This should have been at the top, but was buried at the bottom of the article:
"The left side of the 2D representation represents the more serious headlines,
while the right side represents the more silly headlines. "

~~~
minimaxir
The axes are _intentionally_ unlabeled. They don't represent anything, just
spatial coordinates.

The fact that X-axis had a parseable interpretation (and the sliding scale of
serious is not a hard rule as there are many counterexamples; I noted the
scale as an quick observation) was due to the randomness of the clustering
algorithm. It is possible that the chart could end up rotated if the algorithm
is run again.

------
soared
Its interesting to hover over the blue buzzfeed dots on the left side. Those
are those interesting cases where buzzfeed does actual reporting.

------
Aelinsaar
> _Coincidentally around the same time Facebook announced their anticlickbait
> initiative, Facebook open-sourced their fasttext project, which can quickly
> build models to classify text using some of the above example techniques.
> Hmmmmmm…_

There's an interesting notion, a clickbait filter.

