This is an evil website. We won’t have any anonymity soon. The highest match is ...

woodruffw · on Nov 26, 2022

HN has an Algolia-based API. It’s also very easy to crawl.

I wouldn’t call this evil, however: it’s merely demonstrating a technique that you should be aware of, if you’re a privacy-conscious person. It looks like they also provide some resources for avoiding stylometric detection.

nanidin · on Nov 26, 2022

I would bet my bottom dollar that the likes of Reddit and Google already have models to turn a corpus of text into probable demographic data and models to measure the similarity of users.

weinzierl · on Nov 26, 2022

Please don't shoot at the messenger. costco shared this voluntarily and I can see no bad intention.

We should see it as an opportunity to learn how easy it is to associate different pseudonymous accounts. Nothing drives this point home better than a practical demo.

We can be pretty sure stylometry is used widely by bad actors already and we should not punish people who help to spread the word about these technical possibilities.

ghaff · on Nov 26, 2022

And this is actually quite a simple approach--which is interesting in and of itself. While there would be diminishing returns, there are a ton of other techniques you could use to make stronger inferences about similarity.

JadeNB · on Nov 26, 2022

> This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?

I'd way rather have someone tell me "look at all the things I can find out about you" so that I can act accordingly (whatever that means!) rather than what we've mostly actually got, which is companies silently exploiting my data and doing everything they can to mumble reassuring but legally ineffective formulas assuring me that they deeply respect my privacy.

costco · on Nov 26, 2022

HN Firebase API. I just wrote a program in C++ with libcurl to get https://hacker-news.firebaseio.com/v0/item/1.json, https://hacker-news.firebaseio.com/v0/item/2.json, https://hacker-news.firebaseio.com/v0/item/3.json, ...

jonas-w · on Nov 26, 2022

Why didn't you use the google bigquery?

https://news.ycombinator.com/item?id=10440502

costco · on Nov 26, 2022

I was aware there was a HN dataset on BigQuery but I had never used a library to work with it before and when I played around on the website the posts I got were all from 2015 at the latest. It probably would've made my work easier but there's not really anything I can do about it now.

ufmace · on Nov 26, 2022

I don't know that I'd call this evil. We have no idea who else is using this kind of technology but not making the results public. Better to know what's possible and take measures to make it less effective.

faeriechangling · on Nov 26, 2022

It’s just statistics. I recall that during his whistleblowing, Snowden intentionally took anti-stylometry measures.

vfinn · on Nov 26, 2022

Imagine using this across different platforms :/, and let alone using different techniques in addition...

edit: maybe you'd catch some criminals if you tried to match reddit against dark web for example