
HN is in the same cluster as 2ch, not Techcrunch, on Twitter - rabidsnail
https://www.hella.cheap/twitter-star-chart.html
======
bhouston
2d projections of complex multidimensional data are unreliable in the extreme
as to adjacency meaning. Most adjacency especially are an artifact of the
chosen projection method.

~~~
daniel-levin
This comment got me thinking: in some applications, Euclidean distance between
feature vectors acts as a good proxy for adjacency/similarity. For such
applications, an isometry from R^n to R^2 or R^3 should in principle preserve
the meaning of adjacency. A quick Google yields [0, 1] a technique for quasi-
isometric, and isometric dimensionality reduction. This _should_ mitigate
artefacts of adjacency, or non-adjacency, as it were. In other words, you
might be able to actually pull off good 2D projections of high dimensional
data and still see meaningful relationships.

[0]
[https://en.wikipedia.org/wiki/Isomap](https://en.wikipedia.org/wiki/Isomap)

[1]
[https://www.aaai.org/Papers/AAAI/2007/AAAI07-083.pdf](https://www.aaai.org/Papers/AAAI/2007/AAAI07-083.pdf)

~~~
ecesena
Sammon mapping is another famous example, see [1] for instance for a nice
visualization.

[1]
[http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV09...](http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0910/henderson.pdf)

~~~
frozenport
>> Provides us with a measure of the quality of any given transformed dataset.
However, we still need to determine the optimal such dataset, in terms of
minimising E. Strictly speaking, this is an implementation detail and the
Sammon mapping itself is simply defined as the optimal transformation;

Somehow its technically challenging to verify the content of this article.

~~~
ecesena
I was referencing it mostly for the visualization of the "flower" that fails
with pca/linear mapping.

The original Sammon's paper is here [1], this said from what I know isomaps
are a more widespread tool - but I never found such a good visualization.

[1]
[http://theoval.cmp.uea.ac.uk/~gcc/matlab/sammon/sammon.pdf](http://theoval.cmp.uea.ac.uk/~gcc/matlab/sammon/sammon.pdf)

------
personjerry
I wonder if I could post a randomly generated graph, label it with HN-
interested labels arbitrarily, and get a serious talk started on HN about
nonexistent correlations.

------
hapless
TechCrunch reports on us. It is journalism for the spectators. The twitter
cluster of people sharing TC links is TC's audience, not participants in TC's
subject matter.

Why in blue hell would anyone on HN be sharing TC links? Intuitively it seems
more likely that people who share HN links are discussing these matters
directly.

------
bitbckt
Interesting parallel observation: when I worked for a regional newspaper some
years ago, we rolled out products for the same demo as "mommy blog Twitter".
We saw the same sort of isolated behavior - visitors to "mommy blog content"
almost never strayed onto our mainstream products.

The same sorts of products delivered to "puppy and kitty" people didn't have
the same effect, though the level of vitriol in the comments was similar.

~~~
madaxe_again
Ditto. Launched (well, we built - client project) a social network for moms
nearly a decade ago, and they were Not Interested in anything outside of the
core offering - even recipes, which you would have thought would be
interesting, weren't - until they rebranded along the lines of "recipes for
moms", which changed that interaction overnight.

Some demographics choose tighter filter bubbles for themselves than others,
and moms are likely up there, as the single most important thing to mothers
tends to be being a mother - it becomes an all-encompassing identity for many.

------
hkmurakami
Considering nicovideo is anti-establishment media (it's owned by Kadokawa,
which is an underdog media company with strong subculture roots) and that
2chan "summary sites" double as news sources for the anti-establishment these
days, the association seems apt.

------
newobj
This is amazing, one of my favorite articles on HN ever.

I'm really curious what the heck that "eye" is in the bottom right space of
the clusters. Some cluster so radically orthogonal to any other content it has
an order of magnitude more distance in differentiation?

~~~
rabidsnail
(original author here) it's a spambot network. If you click the link in that
post to the interactive version (this: [https://pile-of-
junk.s3.amazonaws.com/twitter_scatter_10k.ht...](https://pile-of-
junk.s3.amazonaws.com/twitter_scatter_10k.html)) you can see for yourself.

------
stephenboyd
This is cool. How many sampled tweets did HN links appear in? How many sampled
tweets did you have overall?

I'm curious if a sampling error could explain why an English website like HN
would get placed with the Japanese language sites. StackOverflow isn't placed
by any related sites either.

If the weird results aren't from sampling artifacts, my best guess is that a
lot of spambots must be linking to multiple legit sites regardless of
relevance.

------
brownbat
I really hope someday we get spambots that start off by trying to make useful
contributions. Then later, after building a following, start advertising
scams.

I'm confident that, given the right incentives, spam kings could discover
conversational AI before any lab.

------
swerling
This is fantastic. Feature request: drag a rectangle over a group of dots, and
see them as a text list of websites. As is it's hard to see all the sites that
are in a dense dot cluster.

------
TazeTSchnitzel
Quran quotes being grouped with archive.org might be explained by the Internet
Archive frequently being used to host Islamist materials.

~~~
runn1ng
Just today I wondered why are so few journalists picking up the fact that ISIS
is using almost exclusively archive.org for uploading their beheading and
other PR videos.

------
i336_
The interactive version is powered by this dataset - [http://pile-of-
junk.s3.amazonaws.com/domain_similarity_tsne_...](http://pile-of-
junk.s3.amazonaws.com/domain_similarity_tsne_10k.json) \- processed by
JavaScript inside the page: [https://pile-of-
junk.s3.amazonaws.com/twitter_scatter_10k.ht...](https://pile-of-
junk.s3.amazonaws.com/twitter_scatter_10k.html)

------
wodenokoto
> Japanese social media twitter (which I'm labelling as "2ch", though it's not
> just 2ch) is almost completely distinct from what I'm calling "upstanding
> japanese twitter" (links to mainstream news sites like news24)

I have no idea what the point of the headline is after reading the above part
of the post.

------
Ezhik
That's interesting. Never would've made the connection myself, although now
that I think about it, some of the most fascinating discussions I've read on
HN involved Japanese work culture.

------
ChuckMcM
This is some fascinating analysis. And like the Author I am amazed that
Twitter doesn't crack down harder on their spambots.

~~~
n0us
I've wondered that as well. I'm not "active" on Twitter but I log on
occasionally to see if there are any interesting tweets in my feed. Every time
I log on I have a new follower from penny stocks twitter, get rich quick
schemes, and various other fake profiles. This seems to stay stable at around
20 fake followers as old ones get erased and new ones follow.

It seems like amateurs are more capable at detecting spam than the entire
company but I sometimes wonder if they just know about it leave the spam bots
because once they crack down, new ones will just pop up. Or if they keep them
around at a tolerable level that doesn't drive real users away but still
allows them to publish a higher "user count"

~~~
egypturnash
This may also be in part to more active users of Twitter hitting the "report
spam" button on those spam bots. If a spambot tweets at me, I'll go do that.
I'm sure I'm not the only one, as I never see a spambot with more than a
handful of tweets showing up in my mentions.

So, crowdsource spam detection.

------
surfmike
what is 2ch?

~~~
daodedickinson
Japanese predecessor of 4chan.

~~~
yawawort
What you're thinking of is Futaba (www.2chan.net). 2ch is text only and would
be closer to Reddit than 4chan (at least culturally).

------
Rayearth
So HN is close to nico (Japanese youtube) and pixiv (Japanese-centric art and
fanart site)? Interesting.

------
forrestthewoods
What are all of the other twitters? There is so much undocumented space! I
want to know what it all is!

------
simcop2387
Is the regex search in the demo not working for anyone else (tested both
Chrome and Firefox on Win7)

~~~
rabidsnail
There's no UI for if there are no matches; it just does nothing. Try searching
for \\.com or something.

Edit: I patched it so it displays an alert if there are no matches.

~~~
simcop2387
I see. That patch makes it a lot nicer to find out that none of the sites i
wanted to look for show up in the data :)

------
gohrt
why does the hella.cheap site have an SSL cert with an unknown authority?

~~~
tokenizerrr
It has a COMODO certificate. If you see otherwise you might be getting MITMd.

~~~
schoen
It has a valid Comodo certificate but forgot to include the full certificate
chain, which is probably now the #1 configuration error (I help do support for
Let's Encrypt and about 80% of "my cert doesn't work after issuance" problems
are that). These bugs are tricky because most browsers cache intermediate
certs and then forgive sites that don't send intermediates that the browser
knows about, so you can see an error in one browser or device and not another
because of different cert caches!

~~~
kazazes
Wouldn't it be more reasonable for browsers to not cache them at all and
universally reject missing intermediate certificates? (IIRC correctly, Chrome
doesn't mind but Firefox will give you the train conductor)

~~~
schoen
It would definitely eventually reduce the frequency of this configuration
mistake.

Firefox definitely does cache intermediates (I've seen it do so as recently as
today).

