Hacker News new | past | comments | ask | show | jobs | submit login

This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?



HN has an Algolia-based API. It’s also very easy to crawl.

I wouldn’t call this evil, however: it’s merely demonstrating a technique that you should be aware of, if you’re a privacy-conscious person. It looks like they also provide some resources for avoiding stylometric detection.


I would bet my bottom dollar that the likes of Reddit and Google already have models to turn a corpus of text into probable demographic data and models to measure the similarity of users.


Please don't shoot at the messenger. costco shared this voluntarily and I can see no bad intention.

We should see it as an opportunity to learn how easy it is to associate different pseudonymous accounts. Nothing drives this point home better than a practical demo.

We can be pretty sure stylometry is used widely by bad actors already and we should not punish people who help to spread the word about these technical possibilities.


And this is actually quite a simple approach--which is interesting in and of itself. While there would be diminishing returns, there are a ton of other techniques you could use to make stronger inferences about similarity.


> This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?

I'd way rather have someone tell me "look at all the things I can find out about you" so that I can act accordingly (whatever that means!) rather than what we've mostly actually got, which is companies silently exploiting my data and doing everything they can to mumble reassuring but legally ineffective formulas assuring me that they deeply respect my privacy.



Why didn't you use the google bigquery?

https://news.ycombinator.com/item?id=10440502


I was aware there was a HN dataset on BigQuery but I had never used a library to work with it before and when I played around on the website the posts I got were all from 2015 at the latest. It probably would've made my work easier but there's not really anything I can do about it now.


I don't know that I'd call this evil. We have no idea who else is using this kind of technology but not making the results public. Better to know what's possible and take measures to make it less effective.


It’s just statistics. I recall that during his whistleblowing, Snowden intentionally took anti-stylometry measures.


Imagine using this across different platforms :/, and let alone using different techniques in addition...

edit: maybe you'd catch some criminals if you tried to match reddit against dark web for example




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: