Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Using stylometry to find HN users with alternate accounts (stylometry.net)
676 points by costco on Nov 26, 2022 | hide | past | favorite | 511 comments
Author here. This site lets you put in a username and get the users with the most similar writing style to that user. It confirmed several users who I suspected were alts and after informally asking around has identified abandoned accounts of people I know from many years ago. I made this site mostly to show how easy this is and how it can erode online privacy. If some guy with a little bit of Python, and $8 to rent a decent dedicated server for a day can make this, imagine what a company with millions of dollars and a couple dozen PhD linguists could do.

Here's Paul Graham:

https://stylometry.net/user?username=pg

Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)




Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.

The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.

The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.


Woof.

I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.

I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.

Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.


This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.


Exact same thing happened to me. Wild.


On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.


Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!


FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.


What does the bolding indicate?


The explanation is here: https://news.ycombinator.com/item?id=33755466

As far as I’m concerned, it’s the killer feature of the app. The top 20 results may be noisy, but the bolded results have a signal to noise ratio close to infinity.


The precision of the bolded results looks like maybe 30% to me. Significantly better than the non-bolded, but nowhere near perfect precision.


False positives become an increasingly difficult problem the more and more potential authors you introduce. If I had wrote a fancier model it probably wouldn't be as much of a problem but what can you do.


Yes, this wasn't a criticism of the tool. It is crazy good.

But I don't think people should be making the assumption that bolded results are definite alts, which sillysaurus' comment reads like.


Hmm, that wasn’t my intent. I see this tool as a recommendation engine more than a doxxer. By “signal to noise ratio close to infinity,” I meant that if you visit one of the bolded accounts, they’ll probably sound a lot like you.

It’s one of those ideas that makes the tool substantially more effective, yet never would’ve occurred to me. It’s like the simplicity of pg’s “a plan for spam” algorithm: deceptively simple, but (like scrubbing dishes with fingers) works really well.


> I see this tool as a recommendation engine more than a doxxer.

That is absolutely all this will be used for. This is a dangerous tool that serves no real world purpose.


Of my top 20, 19 are bold, all are above 0.6, and I have no alts.


Vast majority of my top 20 were bold, except you funnily enough!

None of them are me (and you were the only one I recognised and thought "yeah, I can see where it gets it from"...)


I have 7 bolded names (0.53-0.62) in the top 20 list, and none are alts of mine.


I'm one of them and I can confirm. But then again that's what I'd say if I was.


Hi style-adjacent friend :-). Just briefly looking at your recent comment history, we seem to find different kinds of articles interesting, but maybe have a similar writing style.


Pretty much the exact same. (I do have a throwaway account but I rarely use it and it probably hasn't been used enough to qualify.)


The funny thing is that I thought of it while eating dinner last night :)


My results have 5 bolded users in my top 20, and I have 0 alt accounts.


Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)

https://news.ycombinator.com/item?id=17944293

The approach I took was a bit different, but also no ML required.

The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.

It’s a very small space to try to compare so simple methods will work fine.


Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.


It works like a charm for me too.

I put in my username and found my pre-echelon alt, possibilistic.

(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)


I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.


It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....


Cool, I only skimmed the description maybe I needed to read it more carefully.

Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.


sillysaurus3 was in mine. :) Clearly we're not the same.


> sillysaurus3

> sillysaurus2

Tbf a human could have found a bunch of them relatively easily


The method used, i.e. to calculate the cosine of the two authors' word vectors, is poorly suited for stylometric analysis because it is based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).

Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).

This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...


> based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Interesting. I was expecting to be grouped with other Russian speakers and I am (based on some nicknames). But I thought the most telling feature will be exactly word order - it’s absolutely relaxed in Russian. Word frequencies? Well, probably the absence of articles, lol (but I swear to God that I often spend some extra time trying to insert as many articles in my texts as I could).


There’s https://en.wikipedia.org/wiki/Idiolect :

”Language consists of sentence constructs, choice of words, and expression of style. Accordingly, an idiolect is an individual's personal use of these facets. Every person has a unique idiolect influenced by their language, socioeconomic status, and geographical location.”


In practice a more complex approach will tend to require a greater amount of data per user, so in this specific case this simple approach is not too bad. Moreover, fake accounts are likely to talk about the same topics, so while this leads to false positives, also makes it more likely that in the list we find actual duplicates.


Ha, gruseom shows up for pg, which is dang’s old account. A worthy successor.

This is a fascinating way to find similar HN users who aren’t the same person. It’s a surprisingly great recommendation engine. “If you like pg, you might also like…”

Sure, the privacy concerns are valid, but the cat’s out of the boot. Might as well enjoy the benefits.

montrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567

Nicely done. One of the best hacks I’ve seen in a long time.


> motrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567

I had this hunch too. It's either pg or someone trying really hard to be pg.


I mean, this is HN -

> someone trying really hard to be pg

describes half the site.


> Someone who talks about ancient history, Occam’s razor, VCs and startups,

I think these are all common topics among HN readers and commenters.


Why would montrose be pg ? The correlation is not that high. Looks like a few people have picked up pg's mannerisms.


Yeah, that score is only slightly higher than the highest one it shows for my account (which is also bold) - and unless my alter ego has been disguised so well it even managed to hide from myself, I'm pretty sure that isn't me :)


The score for montrose vs pg is lower than the score for someone most similar to me, who is definitely not me.

I think, the similiarity has to be in the high .80's to suspect that it's the same individual.


There are factors that make me think it is more likely than not (just scrolled through the comment history, don't feel like linking everything) that he is pg.

- Is bolded on pg's page

- Mentions yoga

- Talks about Lisp often

- Talks about YC often

- Talks about kids

- Links to Paul Graham's website

- Says he uses vi

- Writes exactly like you would expect pg to write


I agree that this person is trying very very hard to sound like pg ! You could be right actually. Could still be a "wannabe" though.


I'm sophisticately sure they are not. They recommend a founder to ask users directly what they will pay for.

Is that what PG would say?


Of course. Why wouldn’t he? That’s sound advice.


YC startup videos recommend not asking users directly what they will pay for.

Users freq. say they will pay for something but back down against other things.



Wow, what an odd thing to get so worked up about.


> but the cat’s out of the boot

It's my first time hearing that variant. Usually its, "the cat's out of the bag" where I'm from.

Do you mean boot in the UK sense, what Americans would call the trunk of a car? Or do you mean a sturdy piece of footwear?

Obligatory xkcd https://xkcd.com/2390/


It’s a little writing trick I leaned from (I think) Orwell. Any time you’re about to use a common metaphor, try to tweak it. You’ll catch readers off guard, which piques their curiosity.

It’s a fun game, too. I wish I’d used “the cat’s out of the hat,” but I didn’t think of it till later.


What you are describing is also known as an eggcorn.

https://en.wikipedia.org/wiki/Eggcorn


This is my all time favourite one of these:

https://thehabit.co/knowledge-is-power-france-is-bacon/

> When I was young my father said to me: “Knowledge is power, Francis Bacon.” I understood it as “Knowledge is power, France is bacon.”

> For more than a decade I wondered over the meaning of the second part and what was the surreal linkage between the two. If I said the quote to someone, “Knowledge is power, France is Bacon,” they nodded knowingly. Or someone might say, “Knowledge is power” and I’d finish the quote “France is bacon,” and they wouldn’t look at me like I’d said something very odd, but thoughtfully agree. I did ask a teacher what did “Knowledge is power, France is bacon” mean and got a full 10-minute explanation of the “knowledge is power” bit but nothing on “France is bacon.” When I prompted further explanation by saying “France is bacon?” in a questioning tone, I just got a “yes.” At 12 I didn’t have the confidence to press it further. I just accepted it as something I’d never understand.

> It wasn’t until years later I saw it written down that the penny dropped.


You left the funniest thing - the guy/gal's nickname was "Lard_Baron"


Thank you! I was trying to find the original essay I learned it from. I’m now pretty sure it was by Poe, but all I can remember is the main advice: avoid common metaphors.

I vaguely remember one of the metaphors in the essay was about a chicken coop melting, or something like that. It was vivid enough to leave a big impression.


I remember this being from Politics and the English Language (https://www.orwellfoundation.com/the-orwell-foundation/orwel...):

“ Dying metaphors. A newly invented metaphor assists thought by evoking a visual image, while on the other hand a metaphor which is technically ‘dead’ (e. g. iron resolution) has in effect reverted to being an ordinary word and can generally be used without loss of vividness. But in between these two classes there is a huge dump of worn-out metaphors which have lost all evocative power and are merely used because they save people the trouble of inventing phrases for themselves.”


Thank you so much! That’s the one.

(It’s remarkable how often a vague description can yield an HN comment with an answer from a clever sleuth like yourself. Much appreciated.)


That's neeto!

The 2nd example also loosely falls under the classification of malaphor.

https://en.m.wiktionary.org/wiki/malaphor


An eggcorn is a soundalike though, isn't it? Deliberately altering idioms to catch people's attention isn't an eggcorn IMO.


> An eggcorn is a soundalike though, isn't it?

Not necessarily, you might be thinking of malapropisms but yes probably a closer word would be the general term: protologism.

Another commenter added some useful info on the evocative alteration of metaphors [2]

1: https://en.wikipedia.org/wiki/Malapropism

2: https://news.ycombinator.com/item?id=33757097


Yeah, it’s like shooting ducks in a barrel it works so well.

Easy to overuse then people just get annoyed though…kind of like commas, I suppose.


That reminds me of a PETA campaign on social media trying to get people to replace violent idioms with alternatives like "feeding a fed horse" and "there's more than one way to pet a cat."


I like mixing metaphors, in this case "the cat's out of the tube". ("the toothpaste's out of the bag" doesn't work as well though)


I love doing this too, it's fun to write.


There's a popular movie called "Puss in Boots". That's what I had to think of first.


It's a bit older than the movie or movies in general.

https://en.wikipedia.org/wiki/Puss_in_Boots


This is somewhat similar to how they ended up catching the Unabomber. The FBI were literally at a dead end. They ended up posting one of his letters/manifestos in the paper, somebody recognised a turn of phrase the unabomber used that was unusual and reported it as possibly being their brother, FBI investigated the lead and it lead them straight to him.

Excerpts from wiki:

> Before the publication of Industrial Society and Its Future, Kaczynski's brother, David, was encouraged by his wife to follow up on suspicions that Ted was the Unabomber.[91] David was dismissive at first, but he took the likelihood more seriously after reading the manifesto a week after it was published in September 1995. He searched through old family papers and found letters dating to the 1970s that Ted had sent to newspapers to protest the abuses of technology using phrasing similar to that in the manifesto.[92]

> In early 1996, an investigator working with Bisceglie contacted former FBI hostage negotiator and criminal profiler Clinton R. Van Zandt. Bisceglie asked him to compare the manifesto to typewritten copies of handwritten letters David had received from his brother. Van Zandt's initial analysis determined that there was better than a 60 percent chance that the same person had written the manifesto, which had been in public circulation for half a year. Van Zandt's second analytical team determined a higher likelihood. He recommended Bisceglie's client contact the FBI immediately.[96]

> In February 1996, Bisceglie gave a copy of the 1971 essay written by Ted Kaczynski to Molly Flynn at the FBI.[87] She forwarded the essay to the San Francisco-based task force. FBI profiler James R. Fitzgerald[98][99] recognized similarities in the writings using linguistic analysis and determined that the author of the essays and the manifesto was almost certainly the same person. Combined with facts gleaned from the bombings and Kaczynski's life, the analysis provided the basis for an affidavit signed by Terry Turchie, the head of the entire investigation, in support of the application for a search warrant.[87]

https://en.m.wikipedia.org/wiki/Ted_Kaczynski


As I recall, one of the clinchers was his use of the phrase, "you can’t eat your cake and have it too" as opposed to the now-predominant variant "you can’t have your cake and eat it too."

I often wonder if stylometry can be used to positively identify a person based not on general word frequency, but by a single phrase or two which are rare in general but commonly used by the individual. In theory this could be relatively easy to find given a large corpus. You'd pick out the top few n-grams for short phrases by an individual and identify the ones which are most overly-represented compared to the rest of the population.


It was actually his brother.


So is the lesson you should have GPT rewrite your manifesto so as to obscure your personal idioms?


Or something purpose-built like Anonymouth (https://github.com/psal/anonymouth), although it seems to be both unique and dead.

Also interesting:

> Ross Ulbricht aka Dread Pirate Roberts, the mastermind behind the infamous Silk Road site which served as a black market for drugs, weapons and fake documents was also well aware of the potential danger of stylometry being used against him. At the time of his arrest in a San Francisco public library, the FBI captured images of his laptop screen as evidence. Guess what what he had bookmarked — “Science of Stylometry.”

https://medium.com/svilenk/the-case-for-anonymity-12db114f0c...


I mean he used an forum account with an email that had his name in it.


That's the problem - it only takes a single slip and it is recorded forever. Perfect opsec is an impossibly high bar if you are maintaining an active online presence.


Only if you have a history of sending crazed writings/manifestos to newspapers and family.


The show “Manhunt: Unabomber” (Netflix) shows this whole story very well.


This is a super interesting tool for self reflection. Looking at the top 10 similar accounts to mine, it gives me an arms-length view of how other people probably interpret my tone.

I appear to be a well-educated, over-confident know-it-all.


My #3 match is cstross, and now I’m convinced that my life-long secret dream of being a successful sci-fi novelist is basically a matter of typing. (Ideas? Character development? Ruthless editing? Developing an audience? Having a publisher? What do I need of those when the Computer told me I’m practically a genius…)


I'd suggest giving the back story to Agent to the Stars by John Scalzi a glance.

http://www.scalzi.com/agent/

> In the summer of 1997, I was 28 years old, and I decided that after years of thinking about writing a novel, I was simply going to go ahead and write one. There were two motivations for doing so. First, I was simply curious if I could; I'd had up to that time a reasonably successful life as a writer, but I'd never written anything longer than ten pages in my life outside of a classroom setting. Two, my ten-year high school reunion was coming up, and I wanted to be able to say I'd finished a novel just in case anyone asked (they didn't, the bastards).

> In sitting down to write the novel, I decided to make it easy on myself. I decided first that I wasn't going to try to write something near and dear to my heart, just a fun story. That way, if I screwed it up (which was a real possibility), it wasn't like I was screwing up the One Story That Mattered To Me. I decided also that the goal of writing the novel was the actual writing of it -- not the selling of it, which is usually the goal of a novelist. I didn't want to worry about whether it was good enough to sell; I just wanted to have the experience of writing a story over the length of a novel, and see what I thought about it. Not every writer is a novelist; I wanted to see if I was.


Same. Looking through some of the handles on my list tells me that I come across like a not-particularly-well-educated McSmug that needs to take a good long look at myself. Wouldn’t be so bad if I wasn’t reading the posts thinking I definitely could see myself writing this.

This was certainly eye-opening.

Update: It’s actually a little strange that reading through some of the matches it’s not just style that overlaps but perspectives in quite a few cases too. I’m definitely not the unique little snowflake that some others are finding themselves to be.


I also enjoyed reading one of my style-partner’s posts.

The most noticeable similarity is that we both clearly have strong opinions about some things, and like to share information, but also like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.

The downside is, I guess, this could be seen as a bit weasel-word-y or indirect.


> like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.

Commonly called just “hedging” like hedging your bets.


That’s a kinder description than I gave it in my next paragraph, so thanks I suppose.

I do think it is an under-emphasized aspect of honesty, though, that we should be clear about our level of experience/understanding. Especially online — people like to discuss things, even (especially?) when we are just getting started. So if we’ve picked up opinions through osmosis and we start repeating them without testing them, we’re really just amplifying some possibly-incorrect viewpoint (and if we’ve picked it up, there’s a good chance it is already widespread in the community, which is bad if it is wrong).

And I mean, more concretely a measurement is not complete without the error bars!

Often this doesn’t really matter, because it is just chit-chat anyway. But it is nice to keep in mind.


> we should be clear about our level of experience/understanding

there are many languages that encode this info as mandatory grammatical affixes, it's called evidentiality.


I hadn’t heard of that. Neat!

I find it interesting that the first example they use in the Wikipedia article is Turkish. I’ve only met a couple Turks, but they were all quite good engineers. I wonder to what extent embedding this kind of information in the language helps organize your thoughts.


> I appear to be a well-educated, over-confident know-it-all.

Don't we all?


I hate us insufferable nerds. !


> over-confident know-it-all.

I’m pretty sure participation in HN is a 99% sure filter for being called this many times in one’s life.


That's what we all come to HN for...


we must be a good match


I'd love a version of this where you enter two usernames and get a match score.


After a few tries on boring accounts, I thought to try the account of somebody who was notorious for an incident outside of HN, and had a (deservedly) bad time at HN for a couple of years before the account went dark.

And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).

Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.


I keep no alternate accounts, but this tool reports best matches for me that appear to be Slavic or just Russian - and I am Russian. Best match score in my list is just above 0.5. There are some clearly alternate accounts on the list, their match scores with this tool are well above 0.7.

It is probable that persons of same cultural origin will have similar writing style and vocabulary. It is also probable that persons of same cultural origin would have same relationships with the world as a whole, they would like same things and dislike other same things.

So, in my opinion, it is possible that you have found not only alternate accounts (score above 0.7), but accounts of people with same cultural origin (ones that are around 0.6).


My highest was 0.41 and the person writes nothing like me. I guess I'm a unique snowflake after all.


I was curious about this, my highest match was 0.47 and I have no alts, maybe I'm also a unique snowflake, or haven't said anything noteworthy enough to have been deepfaked yet ;).


my second highest hit (ie, third in the list) is gwern at 0.45 who i'm fairly sure is not me.


I was actually just looking at near hits for gwern and found what's almost definitely a defunct alt for him.


Well is certainly NOT me, that's for sure.

On an unrelated topic, I'm starting a service to write comments in the style of others to provide plausible deniability for other alt accounts. Rates negotiable.


I have a few in the low 0.5's and, honestly, they seem cool and I want to meet them.


I don't have any alternate accounts here either and my writing style is apparently nearly the same as a high profile account that I recognize and has many points. I wouldn't say this is a highly accurate thing.


There're 19 other accounts this tool finds similar to me. Those are not my accounts. 0.46 - 0.56 are numbers.


I think people are sort of confused at what this tool is supposed to be which I will concede is partially my fault. The results of this tool are by themselves not indicative of having an alternative account. It generates the 20 most similar users for every single user on the site, regardless of whether they have an alt or not (there's obviously no way for me to know that for every single user). In your case further investigation would reveal that none of those accounts are yours.


It is a fun tool, I can assure you. It is just people have found use case you haven't foreseen yourself.

I think your tool should have internal embeddings for each of the user. Also, most probably your tool uses cosine similarity for a search.

Thus, I would like to suggest a feature: recognize simple arithmetic operations over user's embeddings, such as "thesz - 2 * patio11". It will make things even more fun, this way we can find users who are like me and much not like patio11. Even simple additions and subtractions would suffice.

(an idea is taken from properties of word2vec embeddings)

Your tool is thought provoking. What I discovered with it made me think about my use of language and what other languages (body, imagery, etc) I use differently because of who I am. Which made me think about my favorite underrated superhero Cypher [1] - would his innate ability to understand languages make him best detective ever?

[1] https://en.wikipedia.org/wiki/Cypher_(Marvel_Comics)

Thank you!


Really cool idea. I'd need to upgrade the VPS though so all the vectors would fit in memory but it probably wouldn't be too hard (right now I'm just storing a map of username string -> array of 20 username strings because my VPS only has 512mb RAM). I'll think about if I can do this in a way that is more resource conservative.


Fwiw, and as gp mentioned, > 0.7 seems more likely to be alt territory.


You are fools, one and all! This tool's only purpose, is to tag people who use it!

Now they know just who cares about which alternate accounts. They know!

They freaking know, man!

You have all fallen for their ploy. Fools!


I have no alternate accounts and visited the site out of curiosity, because I used to worked in the domain like this.

What I found was worth visiting the site. Somehow notably many accounts with (relatively) high similarity to mine's are sharing at least one of my personal traits.

Which is fascinating, to me.

And I think is worth to be noticed by others - what and how you write can disclose who you are.


It knows my IP now.

(Or does it?)


It offers no privacy policy, so can't tell.


.6 is high confidence? I did my own username, wondering what it would return, since I know I don’t have any alt accounts. The top results are in the .6-.7 range. If they aren’t alt accounts, is it just coincidence that we have similar writing styles?


I think so.

A funny thought — my “matches” cap out at around .56. Having false positives* in a tool like this might feel like a “bad result” but actually I think it just means that if someone were running this sort of tool across the whole internet, I’d be relatively easy to correlate, while your identity would be intermingled with your .6-.7 partners.

*actually they aren’t really even false positives because the tool doesn’t promise to detect alts in the first place, just find similar styles.


> but this functionality turned out way creepier than I thought the moment I tried it

Hopefully this raised awareness means that people who actually need anonymity will be more likely to know to take precautions.


Genuinely asking, what way is there to combat this? Is there a tool that takes out stylistic elements of your comment?


The site mentions a service called Quillbot which apparently does just that. https://stylometry.net/avoid


This is the million dollar question. I think the goal of "anonymity for most intents and purposes" is worthy, it's been how I've enjoyed HN and Reddit, but I also know that it was just a matter of time before stylometry and other meta-analysis of post history become 10 second tools for everyone. Now the cat is out of the box.

I've been thinking about this a bit, and I've landed in that having a stable identifier across ALL comments & posts is a poor default. We still probably want some coherence, at minimum within a thread, eg to follow a back-and-forth. The site itself may also use stable identifier for abuse prevention. But there's no reason one should have the same username externally traceable for posts about completely different topics.

In practice, this could be done with low friction pseudonym creation, which all ties to the same account privately.


One way would be to run such tool before posting and then based on the results, tweak the post and repeat until the similarities are not statistically significant. Or instead of tweaking, start posting under a new throwaway account. But this won't save you when some new way to analyze style appears in the future. Moreover there are other types of meta data which can be taken into account to narrow down the search space a bit such as timestamps. And obviously more you write, harder it is to control these things.


I wonder if gpt3 has a use case here?


[flagged]


You know everyone going to put your username in that tool after a rant like that.

If ever there were a good use for a throwaway account I’m thinking this is it…


0.6 isn't much. I have 3 matches above 0.6, and they're not me. 20 or so over 0.5.


I get one 0.68 match, which... fair enough. It is an account I've abandoned some years ago, no secrets there.

No other hits above 0.5, so I guess that either makes me pretty unique as a commentator or my English is broken in a unique way.


That's why you manually evaluate the matches. And like I wrote in that comment, I did that manual eval, and these clearly are alts of that main account, not spurious. Narrowing down the pool of accounts you'd need to do this kind of manual evals for by a factor of 100000 is a pretty significant change in capabilities.


Could you elaborate on why it's obvious why you won't name the account?


Maybe to avoid attracting any extra attention to this user? Also, as someone who’s read HN for a few years, it only took me 2 guesses to find an account that the above comment describes (and not necessarily the same person).


It was a classy move by jsnell, too. Thank you.

(I don’t know who the comment is talking about, which is how it should be. There’s no need to blow someone’s cover in a highly visible way. Even if they were satan, they’d still be welcome on HN as long as they’re writing substantive, interesting comments that follow the guidelines.)


Such quality comments would track with most thorough Satan representations.


They obviously don't want it to be known, seeing as they've got alts to post under and avoid going into too much detail. Being able to go out and do your own research is different than posting the information open for everyone to see at a glance.

I would say it's obvious why one might respect that wish (do unto others...), but I'm also aware that my and my culture's sense of privacy goes further than many others'.


MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The top one seems an odd one out in that case?


Usernames aren't random enough to be safe as a simple MD5. Perhaps with a strong bcrypt, but similar to PIN codes, it might be better to give partial information like "is the second character an ...", assuming nobody else made similar statements. Or give the first ~two hex characters of the hash, so that it would match 1/(16²)rd of the usernames. I'm sure there's also a clever way for a zero-knowledge proof here, probably something with diffie-hellman using the name as your random integer or something, but I'm too sick to think about this stuff right now. Privately sharing data publicly is hard.


Another problem is that it's a small set. If you had a list of all HN users, you could compute md5 for all of them in seconds.


I think the intention of the post not mentioning the handle was just to prevent old discussions from flaring up or so? The post doesn't really contain any new information on the person that would be worth obscuring. So I just thought I'd hash it to prevent that. But it seems I actually screwed up the hashing so I will leave it at that.


Good point - I've been running john on that md5 for a couple minutes :)


Why use John? Just run down the list of Hacker News usernames; it'll take less time. (Or, better still, don't; just because the privacy's theoretically compromised doesn't mean we have to exploit that.)


I don't think there's a public list of all HN usernames is there?

Found this, it includes 250k usernames, but it's not there. https://www.kaggle.com/datasets/hacker-news/hacker-news-corp...


The username in question isn't in this dataset but maybe it was created in the past 10 days, as the max(timestamp) is Nov 16th, 2022.

https://console.cloud.google.com/marketplace/details/y-combi...


It isn't there, and given the "story" it happened years ago so it should be there, so I guess we've been played.


Unintentionally played I might add... But I will leave it at that.


> quick browse of the comments of the recently active ones, they look really likely to be alts.

Hmm isn't a spot check of comments somewhat tautological, since that is how the tool identifies alts (rather than something like IP address or time of day)? If this had been promoted as "find accounts with similar writing style to yours" would people immediately assume alts?


I would presume that OP is referring to the actual content of the comments. This just does stylometric analysis, which looks at word choice, but not what the arrangement of the words mean.

If some accounts are found to be stylometrically similar, and then a visual inspection also shows them all stating similar opinions, that latter piece of data is a strong signal.


It would be nice to make the names clickable.

I don't think the list of pg alternate account is accurate. I checked a few. They have many oneliners that is typical of pg, but the topics and style don't look similar.

I searched a few more and got better results. :)

I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.


> I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.

It's based purely off frequency of the 200 most common English 1 word phrases, 2 word phrases, 3 word phrases, 1 character sequences, 2 character sequences, and 3 character sequences. Topic does not really have anything to do with it. If I had more time I probably would've done a smarter model that accounted for things like that.


One is also a mathematician. It's trivial that we overuse some technical words even if it's unnecessary.

Another is form Argentina, so I guess the native language leaks, for example using words derived from latin that are not idiomatic.

And there are a few more, that is a honor to be "confused" with, but I have no clue why.


Cool stuff, thank you for sharing your findings!

I don't do throwaway. I either post or STFU. I also STFU on darknet. Its why I found it fun to read/lurk on things like I2P back when it was new. And I know that on a pseudonymous account it is only a matter of time until it can be linked to another pseudonymous account. It would not surprise me if stylometry was used on Dread Pirate Roberts or the people behind The Pirate Bay or the people behind Wikileaks (Assange's sockpuppet accounts). Such can also have been used to verify afterwards instead of beforehand. Though with TPB since it was on clearweb an advanced adversary could have used correlation/timing attack to figure who wrote what.

I'm having fun times recognizing other Dutch people though their usage of English language. For example, a distinctive word I see Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a red flag the person is native Dutch. I wonder if there are stylometry tools available for figuring if someone used physical vs touchscreen keyboard (I used Glider to write this post, spellchecker unavailable).

And yes, organizations like secret service and police should use such tools as well. It is a known tool, why not use it for good? As with any tool, it can be used for good and evil. On HN this could be useful for the mod team (AFAIK nowadays only dang) to find banned people's sockpuppets. Cross-community could also be a fun project: find a HN user's Twitter or Reddit account. And I hope this method is also used to find Russian trolls on social media.


Most people greatly underestimate the power of linkage attacks on anonymity. And it doesn't even take fancy ML. In the context of healthcare records, I like to trot out this 25 year old example of an MIT grad student and the then-governor of MA.

https://ischoolonline.berkeley.edu/blog/anonymous-data/


The top hit on my list looked familiar. I looked at their recent comments and saw a discussion between that user and me. We were quoting eachother directly throughout.

I wonder if this explains our similarity. And if so, could we tweak the algo by e.g. Removing text that is prepended with ”>”


The scary thing is that once you have this data, finding HN matches for individual targeted users on other sites becomes trivial, even if those sites are harder to scrape. I bet most people here have an anonymous Reddit account, for example. If you wanted to know who was behind a particular Reddit account, you could feed it into something like this and compare the results with HN, where accounts are less likely to be anonymous. Or build a database based on blogs, Github comments, etc.

Also, since this uses only word frequency, there are probably relatively easy improvements to make that would make it even more powerful, like looking at particular runs of words that are unique. Some expressions or figurative language only show up in combinations of words, and tend to be highly style specific.


I could have used a part of speech tagger, looked at time of day a user posts, capitalization, spelling errors, etc. From what I understand the state of the art is lightyears ahead of this, there are even companies with actual linguists who will act as expert witnesses in court to say stuff like "we can say with 95% certainty that xyz authored this email." Honestly it's kind of scary. There are papers that talk about cross platform authorship attribution, one I think did it with Twitter, Blogspot, G+ and had pretty good results.


Thus proving the only actually anonymous community in practice is 4chan, and that’s why it’s so toxic.


If you define “toxic” as “people disagreeing with you”, sure. That was what the entire internet was like until maybe 2005.


I'm old enough to remember when 4chan was self identifying as the Internet's hate machine, before xkcd referenced it as such: https://xkcd.com/591/

Sometimes people insist that's all role-play and irony; others insist that if it ever was, it certainly isn't now.

But regardless, I remember pre-2005, and it wasn't all like what I saw the two times I looked at 4chan. Bits were. Bits were much worse. But mostly, mostly, people were kinder… at least, unless political tribalism came up.


“People disagreeing with you” describes almost none of the conversation on 4chan


Forget the alternate accounts — if two users are close in style, there’s a decent chance they should be friends. This is an HN friendship machine.


It would be convenient if the usernames linked to the comment pages on Hacker News (to avoid having to copy/paste and URL hack, which is made even slightly more annoying because for some reason when I tap and hold the usernames to copy them your markup--I haven't looked at why yet--is causing an extra space character to get copied on the left).


This is interesting.

I'm 0.566 correlated with logfromblammo -- and while we are definitely not the same person, I could easily imagine writing a sentence such as:

"For some bizarre reason, management has not yet assigned a task to their programmer underlings to automated themselves out of existence. I can't imagine why."

which is theirs, not mine, from about a year ago. I like that.

On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.


> On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.

This is due to the Firebase API not updating when users ask the admins to move their comments to another account.


Yeah, I got a good match with my previous nick here. Which to me proves the tool works well.


I had a similar experience finding my most likely alt (.50 suggesting I am a unique snowflake as I have always thought :-), my most likely alt is writing certainly in a style I appreciate and on subjects I often mention.


How about this for countermeasure:

As you're typing out a comment the software gives you a list of accounts you're becoming similar to. That way you can adjust your writing as you type.


Someone linked it in the thread: https://github.com/psal/anonymouth


Forget countermeasures, go covert. Write a comment, have the comment be rewritten before submission in order to resemble a targeted account.


Sounds great, except there are many different similarity measures. Which one does the algorithm use?


Why not all of them? Which metrics are closer would tell you which aspects of your writing you need to focus on.


This found an alt that I created specifically to see if I could write artificially to defeat this kind of analysis. I have seen other tools like it posted to HN, but none before had found that account. I guess I need to up my game.


If you don't mind sharing, are you "writing artificially" purely in your head, or are you using techniques like intermediate translations?


No mechanical means, but I have referred to a thesaurus occasionally. Mostly I tried to change my sentence structure, not just words. It requires actually thinking differently, in a way. Which makes it difficult to know how well I'm communicating.


I imagine this would be quite difficult in practise, due to all the subliminal factors behind a person's writing choices.

For example, as somewhat illustrated here, your personal vocabulary is a kind of fingerprint. As you mention, using a thesaurus can somewhat alleviate that, but if a thesaurus is only changing a small % of your words, then it will only have a suitably small % effect upon analysis.

To go yet further might (I suspect!) entail methods such as directly lifting and using other people's sentences to convey your own thoughts. But even then, "your own thought patterns" are still informing the manner of the post, to some extent, so over time increasingly robust analysis may still find patterns to hook into.


I wonder if someone will come up with a Grammarly-like tool which you can feed with sample writings to help you increase/lower the similarity score of a new text you are writing.



That post was actually what motivated me to make this. I'm on your email list :)


WOW! It's such a pleasure for me


Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL and other NSA tools and exploits? Shadowbrokers.

They were always communicating in some kind of meme-russian, and their texts were funny to read. [1]

I believe their writing mostly defeated this kind of analysis, at the cost of looking like idiots (which was probably the reason no one sent them crypto-dollars to buy that stuff exclusively).

Here's an excerpt:

"Attention government sponsors of cyber warfare and those who profit from it !!!!

How much you pay for enemies cyber weapons? Not malware you find in networks. Both sides, RAT + LP, full state sponsor tool set? We find cyber weapons made by creators of stuxnet, duqu, flame. Kaspersky calls Equation Group. We follow Equation Group traffic. We find Equation Group source range. We hack Equation Group. We find many many Equation Group cyber weapons. You see pictures. We give you some Equation Group files free, you see. This is good proof no? You enjoy!!! You break many things. You find many intrusions. You write many words. But not all, we are auction the best files."

[1] https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...


*EternalBlue


Have you tried including parts of speech (for example, as bigrams and trigrams) as part of the features considered in your model? I’ve had great success with stylometry that goes beyond TF-IDF with bags of words; including grammar patterns was shockingly good.

(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)

Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.


> Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.

That is a very good idea and when I update the site that will almost certainly be included :) Any other tips? Been reading papers for ideas and I think I may have to ditch the cosine similarity and go for something fancier soon. Thank you


How long until this becomes the algorithm for a dating site?

“Find hot single women who write just like you”


This seems like a great way to hire freelance copywriters/ghost writers too. I would absolutely hire someone I knew could match my tone well for writing generic unattributed copy.


Wouldn't be surprised if dating sites already used similar algorithms.


Do dating sites really use clever algorithms to match up people together? I was under the impression that, the less likely you are to meet your perfect match, the more you're going to use the app.

In my experience I don't see a relevant list of potential matches aside from gender and age preference, it's all completely random, even frequently I see people outside the settings I've specified (i.e. men or older women).


Wouldn't be surprised if most of the women on a specific dating site had very high similarity scores.


This is one reason why I like legal doctrines such as "beyond a reasonable doubt." Even a 0.9 match in a tool like this could be a coincidence, if there are millions of users. But that won't stop people from casually believing "aha it must be an alt account", based on some anecdata.

It's so easy for something like this to be turned into a tool for a witch hunt, targeting innocents.


But a 0.8 or 0.9 match and something like Tor usage could be enough to justify a warrant. That's why I'm not sure I want to open source the code because I don't want to normalize this.


Keep in mind the potential to create false accusations by fabricating similar looking accounts.


Hmmm, doesn't seem to work. But you have convinced me (and many others?) to search our alts consecutively and so now do know who has alts?


I wonder what's a reasonable threshold for "probably the same person". I've never had an alt on HN, and when I searched myself, it found 3 other users above 0.6, none of whom I've ever heard of before.


If it's >0.9 is you can almost guarantee it's an alt but I've seen certain matches at 0.6. The problem is writing styles change over time. Another idea I had was converting the scores which are just cosine similarity scores into percentiles (so 0.99 would be 99th percentile of certainty) to make them more human interpretable.


I make new accounts every so often and the accounts of mine that it found have a score of around 0.3. I'm not actively trying to defeat stylometry but it's possible I just have a particularly unremarkable writing style.


Well I must be stereotypical myself because it found me at 0.8 !


The people at 0.4-0.6 with me do share some interests. That's cool on its own.


>The problem is writing styles change over time.

Will be interesting if we could plot the writing style divergence over time.


I got matched with my old account with a score of only 0.45


I have no alts. The highest match for me is about 0.66.


Interesting. The highest non-me account is under 0.4 on my page. I do not believe that I have such a unique writing style - especially since half my posting is on mobile and therefore possibly slightly different than my desktop posts.


My closest is 0.4879. I know I tend to be wordy but I thought I had a pretty generic style as well. This is definitely a fascinating demonstration.


Feeling better about my high of 0.49 now


0.6 is not high enough to indicate an alt


Oh wow, it's really sure that I'm stavrosk, which I am:

https://stylometry.net/user?username=stavros

The next person is 30% less certain, that's huge! This would basically identify any alt I might have with near certainty.


Funny thing is, it thinks I'm you, but it doesn't think you're me!

https://stylometry.net/user?username=rogual

I'd have thought this stylometry thing would be commutative.


I guess it's a multidimensional space, so you can have someone closer to you than me, but they aren't also closer to me than you. Basically, they're close to you, but on the "other side" of me, I guess?


Don't need multiple dimensions for that.

0.1, 0.2, 0.3, 1.0, 2.0

To 2.0, 1.0 is closest.

To 1.0, 0.3, 0.2 and 0.1 are closer.


Thanks, seems obvious when you put it like that.


The word you are looking for is "symmetric".


stavrosk doesn't have any posts/comments? What's it using to match?


It's my old username.


Huh... seems there are some inconsistencies between what's presented on news.ycombinator.com and the Firebase API. Glad it matches for you though :)


I guess they just didn't go back and reparse, not a big problem. I don't think people change their username frequently :P


This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?


HN has an Algolia-based API. It’s also very easy to crawl.

I wouldn’t call this evil, however: it’s merely demonstrating a technique that you should be aware of, if you’re a privacy-conscious person. It looks like they also provide some resources for avoiding stylometric detection.


I would bet my bottom dollar that the likes of Reddit and Google already have models to turn a corpus of text into probable demographic data and models to measure the similarity of users.


Please don't shoot at the messenger. costco shared this voluntarily and I can see no bad intention.

We should see it as an opportunity to learn how easy it is to associate different pseudonymous accounts. Nothing drives this point home better than a practical demo.

We can be pretty sure stylometry is used widely by bad actors already and we should not punish people who help to spread the word about these technical possibilities.


And this is actually quite a simple approach--which is interesting in and of itself. While there would be diminishing returns, there are a ton of other techniques you could use to make stronger inferences about similarity.


> This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?

I'd way rather have someone tell me "look at all the things I can find out about you" so that I can act accordingly (whatever that means!) rather than what we've mostly actually got, which is companies silently exploiting my data and doing everything they can to mumble reassuring but legally ineffective formulas assuring me that they deeply respect my privacy.



Why didn't you use the google bigquery?

https://news.ycombinator.com/item?id=10440502


I was aware there was a HN dataset on BigQuery but I had never used a library to work with it before and when I played around on the website the posts I got were all from 2015 at the latest. It probably would've made my work easier but there's not really anything I can do about it now.


I don't know that I'd call this evil. We have no idea who else is using this kind of technology but not making the results public. Better to know what's possible and take measures to make it less effective.


It’s just statistics. I recall that during his whistleblowing, Snowden intentionally took anti-stylometry measures.


Imagine using this across different platforms :/, and let alone using different techniques in addition...

edit: maybe you'd catch some criminals if you tried to match reddit against dark web for example


Interesting that the Op doesn't come up in the search: https://stylometry.net/user?username=costco


Their first comment and submission were 4 hours ago. Text on the page is accurate it seems.


Not surprising considering the account had no activity before today.


My nearest match is only at 0.406. It'd be interesting to see who the most unique commenters are, but it's also quite possible it wouldn't be flattering.


0.35 is my nearest. In hopes of lowering it even further, here are some nonsensical opinions never expressed on HN before: 1) Programming peaked with COBOL 2) Paul Graham is responsible for 90% of SIDS cases 3) There's no reason to use car when cdr exists.


0.2506 is my nearest match


That's the lowest I've seen yet. You must write uniquely :)


I have no alternative accounts besides making a single throwaway account to post one "Ask HN" five years ago, but I have a decent number of matches above 0.5. I think this is due to the relatively uniform style of "who is hiring posts," since my matches did that in a similar way for other companies. I made many of those for about two years when I was at a start-up.


On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?

Also think it's probably poor form to list users as examples without their permission.


> On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?

Yes.

> This may be out of line but isn't pg on here with a different username, Levenschtein distance of one that's not included? Or is that just a very motivated 13yo account who writes a lot of admin-esque comments.

What other pg account are you referring to? I want to see it so I can see what my algorithm missed.

> Also think it's probably poor form to list users as examples without their permission.

You're right. I'll remove that - I just wanted some examples especially for people on phones who don't feel like typing. Thanks for the feedback.


> However, using automated methods like machine translation services do not appear to be a viable method of circumvention.

https://www.whonix.org/wiki/Stylometry


It found my old account (ara4n; i lost the password) at 0.63. More amusingly it found my cofounder too, who hardly ever posts here (at 0.48)


> ... This site works primarily by analyzing for each user the frequencies of the most common words and phrases in the English language. Accordingly, the easiest way to avoid being identified is to simply use different words than you ordinarily would when writing. More sophisticated models than the one I made can use punctuation, comma usage, and capitalization to identify you so try alternating those as well. Services like Quillbot can help with you this but depending on your circmstances you may not want to send your writings to a third party service.

HN offers many other threads which could be tied together, including:

- time of posting

- ratio of replies to top-level comments

- comments being mainly upvoted or downvoted

- sentiment (mostly angry, dismissive, questioning, etc.)

- most common topics (keyword analysis of post being replied to)

- ratio of new posting to post replies

- first-to-comment on a post

- lone comment on a post

- etc...

It seems very likely that sooner or later every pseudonym for posting content will get discovered and linked. The lesson here is don't post anything that would cause you undue shame or harm if linked directly to your legal name.


Well now I'm self conscious about my closest match being an 0.34 when so many other people are reporting much closer matches with accounts that aren't alts. Do I write weirdly?


Same for me, the closest match is 0.36. But I expected that because I don't speak english very well so the pool of candidates is small.


.31 here! I'm a non-native speaker tho, so it wouldn't surprise me if I had weird speaking habits


My closest is 0.40, so I’m right there with you.

Native English speaker as well.


0.36 here! Out of curiosity, are you a native speaker?


I am, yes.


0.39 for myself, I’m a non-native speaker.


What does the bold signify? For example when I search for dang (https://stylometry.net/user?username=dang) the 4th most likely user is not bold whereas the 16th is?


Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always).


Huh, that's a somewhat non intuitive property.


It is a bit, but if stylometric equality was a thing you'd expect it to be symmetric, so if stylometric simmilarity is a thing....


And this is why I’m a reader and not a poster on HN :)

The second that I found out that requesting deletion of an account and its posts needed a MANUAL request to a single user (dang) I noped out so fast

But happy that the rest of you are still happy to contribute :)


I really liked the informative and straight-to-the-point about page - describing how the algorithm works in a way that is easy to understand. All the important details are summarised there. Well done!

Edit: From the "How to avoid .." page, there is the following sentence:

> Also, most authorship identification algorithms have poor accuracy when working with small amounts of words. This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.

Can you clarify what this means and why it would result in a ban?


> Can you clarify what this means

Imagine that for every new comment you want to post you would create a brand new account which you would use precisely once and never again. Then the stylometry would have just a few words and wouldn’t have enough corpus to get a reliable signature. If a lot of people does this it would be hard to figure out which account belongs with which human. ( Of course if you alone do this, your messages will stick out like a sore thumb. See xkcd 1105 )

> why it would result in a ban?

Because this practice is especially discouraged in the guidelines: “please don't create accounts routinely. HN is a community—users should have an identity that others can relate to.”


At the same time, HN doesn't let you delete comments.

Maybe with some GDPR magic.


Not sure what is your point, or how does that connect with my comment. Care to elaborate?


Your comment quotes an HN guideline, and my point relates to it. Some users may feel the need to create throwaway accounts in order to post comments that in an alternative reality they could post under their primary account and later delete if desired. It may not stop a scrupulous collector of data, but such a scenario may not be the object of their worry.

Drawing this into the logical conclusion, a user may opt to always post under a throwaway account, to avoid any possible tainting associated with a primary account.


> Can you clarify what this means and why it would result in a ban?

I have seen dang respond to users multiple times asking them to stop making new accounts especially but not always if it's to avoid rate limiting. I don't know if there's an official policy but it's definitely something I recall.


Just a heads up that for everyone who doesn't like to link their alt accounts, maybe not use this tool to see if it works.

Unless the author would run this against all HN user accounts, no need to flag the ones "of interest".


Have you done any data analysis on distributions of similarity? How similar you'd expect any 2 people to be given English focused around tech? Or any other interesting stats you'd like to share?

Very nice clean site, great work.


What match level would you expect to see between two randomly chosen individuals?


It's accurate enough that I had to create a new account now :)

I guess it's difficult to evade it as the word frequency certainly catches all about the countries I frequently refer, programming languages, interests etc.


Similar to how they make adversarial fashion[0][1] in order to not be tracked by face id AI, I wonder if we can make adversarial stylometry tools to run your comments through in order to anonymize it

.. [0] https://hackaday.com/2022/10/20/render-yourself-invisible-to...

.. [1] https://adversarialfashion.com/


OP links to a paraphrasing tool on their website.


This is absolutely bonkers. I tried it with my alt and it got my original correct! So I'm writing this comment with a fresh account which hopefully will not get correctly linked too lol


Did something similar in 2018 (still running locally) which could damask anyone

https://twitter.com/austingwalters/status/104189476543920128...

Made both Metacortex.me and insideropinion.com

The idea being you don’t actually need an active directory. It would drop in, figure out all the users (provided one account was on the AD) and would monitor everyone’s skill sets, morale, schedule, etc. Worked super well for what it was / is.


Neat work!

Out of curiosity: do you filter sentences than begin with ‘>’, indicating a block quote from another user? That might improve the accuracy a little here, if you don’t already.


Yep!


Perhaps explain in the about what you filter out? Along with what the bolding means?! Do you filter out anything else (like spaced/indented/monospace text/code, or even quoted text, which is often not written by the user?). Super thanks for this - interesting!


Turns out that there may have been some glitches in the way I was filtering lines beginning with >. For explanation of bolding see https://news.ycombinator.com/item?id=33755466. I didn't attempt to filter anything else out though filtering out code would probably help a little bit.



In this particular case, it seems to be picking up the stock moderation responses as it looks like sctb was a moderator account until 2019.


I don't have an alt but it would be cool to meet my stylometry-neighbors. I'm curious whether the writing similarity translates to oral communication too


I tried dang's old account (gruseom) expecting to see his dang account listed. Nothing. Tried dang, sctb (a previous admin) was listed as closest match.

I wouldn't rely on these results

https://stylometry.net/user?username=gruseom

https://stylometry.net/user?username=dang


I wouldn't rely on these results

You picked a user who posts a massive volume of repeat, template-y comments and found their former colleague who also posted piles of repeat, template-y comments, that being part of both of their jobs.


There are a few close matches to dang's style of template-y comments in the results. Afaik none of the listed accounts are Daniel.

I picked dang as he is the figurehead of hn, and didn't want to inadvertently reveal some other user's identity.


> There are a few close matches to dang's style of template-y comments in the results.

At least the #1 close match (sctb) was a comoderator with dang, so they were kind of alts as the official voice of HN.


writing "antirez" shows accounts with spanish names (none is mine). I guess Italian and Spanish speakers write very similarly English, but on HN there are a lot more Spanish speakers than Italian ones so that's what I get.


It seems the accuracy for nonnative speakers is not nearly as good as it is for native speakers. The algorithm could definitely use some work.


Tried my account thinking "I don't have any alts" but it turns out I do! In 2018 I changed my username from "cbr" to "jefftk" and it pulled that right up: https://stylometry.net/user?username=jefftk


Rebrand it as a soulmate-finder?


Well done, it found my ancient old account.


I only got 0.9999999999999992 for myself :(


Naturally Born Imposter


Honeypot to see what accounts are tested in sequence?

;-)


I turned off nginx logging if that makes you feel any better. Of course there's no way for you to verify that because I'm just a random guy on the internet but I will tell you that I am a civic minded citizen who is concerned about privacy and the Internet.


Only half kidding, but I’d I were state Intel it’s what I’d be doing. :D


Ingenious idea. At the very least, this is just about finding people who write like us, the same way we seek those with similar tastes (music...)

How long before large commercial indexers start offering an efficient (AI based ?) stylometry to agencies and states ?

wait... do you think the NSA is already doing this?


They would be silly not to ( apart from creepish profiling of an entire globe population you also get to potentially identify bots ). We all have mannerisms that can easily 'betray us' online. I honestly thought my writing style is more unique, but as it turns out it is somewhat common.


It isn't writing style, but more of phrase selection. If you lean on the same phrases (n-grams), then you will be very very close in a high dimensional space. Colloquialisms are the biggest tell, you should eschew them.


> I honestly thought my writing style is more unique

You just showed another possible use case for this kind of tools: "How unique is my writing style ?"


Stylometry is an old hat technique; you can assume that intelligence services around the globe regularly apply it.

(Statistical stylometry is a little newer and more rigorous than manual stylometry, which essentially involved a human being's judgement call around the similarity of documents.)


What about "deep leaning" stylometry ?



I don't know, but it wouldn't surprise me if someone has tried to apply ML to stylometry. Statistical stylometry is already petty effective, as demonstrated by this site.


Site down? I'm keen to see if it catches my alts.


Apologies for the downtime. Something crashed while I was asleep, should be working now. Not really sure how because the log indicates that uwsgi "gracefully exited," but I'm looking into it.


Same here, 502 consistently.


Apologies for the downtime. Something crashed while I was asleep, should be working now. Not really sure how because the log indicates that uwsgi "gracefully exited," but I'm looking into it.


Since it looks for similar word usage, false positives seem to appear more often when specific topics are talked about, like stocks or crypto.

Does this ignore stop words? Or do all words have the same weighting? I wonder if only focusing on stop words would give a more accurate measure. Maybe we are more comfortable with certain stop words more than others?

https://en.wikipedia.org/wiki/Stop_words

"Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant."


All words have the same weighting. I don't ignore stop words, in fact most of the ngrams I use are compromised almost entirely of stop words. Maybe it'd be more effective if I ignored them.


1. Interesting. I was kinda expecting to be grouped with other Russian speakers, and I am (based on some nicknames). Probably the frequencies of “the” and “a” are telling. But I swear to God that I sometimes spend some extra time trying to insert as much “the” and “a” in my texts as I could.

2. There is a Russian mnemonic verse, which can’t be properly translated to English, at least it’s beyond my humble capabilities. It goes:

“Это я знаю и помню прекрасно:

Пи многие знаки мне лишни, напрасны”

The number of letters in the words give you the pi number: 3,1415… The meaning is: “I know and remember perfectly: too many signs (positions) of pi are useless and impractical”. Sometimes it’s nice to remember both things.


Nice work! Thank you, of course I plugged in the obvious HN usernames

Edit to add;

Would be nice to have the https://news.ycombinator.com/user?id=username links included.


And perhaps rounding to 3 or 4 decimal places?


Amazing and I thought my doxxing tool was terrifying - https://news.ycombinator.com/item?id=32278871

I am afraid to combine all these methods


Yea.. i guess it's time to stop bothering with alt accounts/etc. I'll just make one account, maybe differently named on different services (makes scraping just a _pinch_ easier) but aside from that all i can do is modify/remove old posts.

Bit of a shame for useful posts/discussions.. but the internet is getting really.. finger print laden.


Incredible! There was a very active throwaway account here a while back that I always enjoyed interacting with. I suspected the person had more than one account and this found one that is incredibly close, down to the topics.


I checked a few random user names and I am confused.

- Why is the author costco[0] not in this lookup?

[0]: https://stylometry.net/user?username=costco


- Their first comment and submission were 4 hours ago.

- The text on that page is accurate it seems.


I played a little bit with it and it is baffling how well it finds accounts of people that know each other in real life. So it's not only good for finding alternate accounts but could be used to find peer groups.


Interesting, they are trading phrase-grams (just made that up) or lingo. That is really cool.


This doesn't seem to include text from submissions.

I ran it on Brian Armstrong's temp account from here, and it said it didn't write 10,000 characters:

https://news.ycombinator.com/item?id=3754664

EDIT: Or maybe it's something else because Brian only wrote less than 6k characters. But then why can my account be looked up?

Also, I would guess quoted replies are included, which muddies the analysis. Seems to be a very naive implementation. Much more can be done, but this was probably just a quick project.


Quoted replies shouldn't be included unless there's a bug on my end. Submission text is not included though I probably should have.


How much should we fear de-anonymisation ?

A lot of discussion on the thread are over "how can we prevent this". I would like to know why should we not embrace this and similar technologies?

The benefits in my view are large - online behaviour tracks back to real life - and epidemiology speaking the value of millions of test subjects across every question are invaluable - from traditional medicine to "mass psychology recommendations"

I can guess some downsides (hiding from abusive exes) but am interested in studies, surveys, reports etc - any HN thoughts welcome


Fear it happening or fear its consequences? Doxxing already happens all the time, but the main tools are things like account names or image search, this sort of tool could take it to a new level. A simple experiment would be to run this same algorithm against another site (say Twitter or Reddit) and see if it can reliably pick out the same peoples' accounts there. Once anyone on the internet can quickly/easily draw that sort of connection it would require incredible diligence to avoid de-anonimyzation while still maintaining any sort of "real self" presence on the internet. How much we should fear the consequences probably depends a lot on how marginalized you are within your society, but since just revealing your gender is enough to invite harassment in many forums I'm not optimistic.


>online behaviour tracks back to real life

This is good to you?

Okay, let's just make it like China or SK where your login is your citizen ID and if you write bad things the bad word police will take you away.

Also, no, I have no alts.


So I am asking because my views are only challenged inside my own head, hence the need for external thoughts.

But firstly the "governments will come and do bad things" argument - yes this is clearly and obviously a major problem - but not one solvable by technology in anyway. Fixing violent dictatorships is a IRL problem - one that requires enormous effort and sacrifices (see Ukraine for obvious example). We cannot pretend that a browser extension or a ground up rewrite of Twitter will defeat Putin or would have stopped Hitler.

As for "free" countries (something like 120+ have open free elections), we still have online abuse for voicing opinions that some people don't like (anything from pro/anti Trump to LGBT and bitcoin etc). Those are real consequences but rarely government inspired and honestly I suspect we need better support for police in prosecuting such things - I mean a death threat is a death threat.

In general my view seems to be we should have the same protections online as we do offline - and if those protections are "in theory only" that requires us to use our voting and other political power to chnage it - not to obfuscate IP addresses or so on.

The upside of tech is so great it is worth spending IRL to defend agains the downsides


I am of the generation and mindset that online abuse is not real. Straight up. Log out, turn off the screen and watch Netflix, take a walk and calm down, block the offending user. It's not real.

>I suspect we need better support for police in prosecuting such things

We do see that! But mostly people on Facebook. Here we have had judgements of people who posted threats on Facebook because it is tied to your real name.

And yes, abuse is part of the "fun". Under your system, my 10 years old Leauge and CoD chats would have me locked up.

>I mean a death threat is a death threat.

Is it? I would find it more concerning if someone on the street tells me he is going to kill me than a kid on xbox live.

NOW there is a difference in systematic stalking and harassment online if I would get bombarded with DMs and messages to kys. I don't know how to solve. But a one-off comment is NOT equivalent. Then it feels like I'm just old? At 31? Is it really so serious?


This is almost certainly going to be decided by the "reasonable person" test - and if you were on the jury it's going to have to be a higher bar than I, but I suspect there will be some offences we will both agree on.

My main point is not that we need to lock up everyone who makes a threat, but that we as a society will have to adjust our standards to the new normal.

Once upon a time every conversation was fleeting, every discussion in a pub or bar was ephemeral. Even Einstein and Dirac would walk home chatting without fear of being overhead. Then someone imagined it would be wonderful for the whole word to hear the erudite wisdom of those two geniuses of our age - and Facebook and Twitter and social media made it possible for every conversation in every bar to be captured and recorded and published - and we found out that Dirac and Einstein were just sledging each other and most other conversations globally were worse.

The new normal is that, like speeding, most evenings, conversations in most bars actually broke quite a lot of laws, from hate speech to sexual threats and basic politeness. And now the police can hear them as can everyone else - and discretion does not work on this scale - we either enforce the laws or change them.

That's a conversation for each judiciary- and likely to be either a balkanisation of the social media world, or a race to the top (we can all have twitter as long as we all behave to the standards of the highest / politest society. I am not sure where I stand on that.

Is it serious - hell yes. We are looking at a global technology with global benefits for all humankind - and if we want to communicate globally we need to agree what the standards for behaviour are on this virtual stage - from contract law to human rights and freedom of speech. We are inevitably going to build closer contacts - Brexit is a salutary lesson - and how we deal with freedom of speech online is just part of the jigsaw - but a telling part.


> I am of the generation and mindset that online abuse is not real. Straight up. Log out, turn off the screen and watch Netflix, take a walk and calm down, block the offending user. It's not real.

Until people can pierce the veil of your pseudonymity (which isn't all that hard depending on the platform and the person) and it isn't just online abuse and harassment anymore. "Tied to your real name" includes "tied to enough information about you that someone with plenty of free time can sift through various databases and piece it together" and most people have absolutely no idea how many such databases there are, and how much piecing someone can do.

I'll say something tangential: Even if we both agree that one-off assholes are largely inconsequential, and I think we do, such assholery has a broken window effect on a platform, where people see all the assholes running free and decide that it's either a place for them to be assholes or a place they should stay away from to avoid assholes.


What could possibly be the harm in allowing people to harass others based on posts they made decades ago? What could possibly be harmful in making a person who for whatever reason has changed their online identity easier to track? What could be remotely harmful about allowing Marlboro to find the accounts of ex-smokers? What could be the harm in tracking underaged users site by site?

I'm sure this is completely harmless and will not harm society.


I think this might be old age creeping up on me but I find it harder and harder to work backwards through "argument by sarcasm" to arrive at what you meant. I think clearly you are heartfelt in your views that having your identity online be a real one is bad - but I am not sure if that is because of posts you made years ago being linked back to you or nefarious advertising ?

The old posts issue is interesting- do you mean that there are posts from years ago you would find upsetting to be linked to you? Is this because you have chnaged your mind (a normal process society needs to understand) or because you said things thinking yiunweee anonymous that you would not have said under your real name? Far less of a social issue I think.

It does make for some interesting thoughts if we made everyone post under their real name.


My view isn't that accounts tied to real people are bad. It's that your lack of ability to think of cases where what you propose could be harmful points to a total lack of critical thought on your part.

The point that I am making is that it's incredibly easy to decipher why "track everyone under every identity they choose" can go wrong and lower the quality of discussion, and specifically, that it's so easy the fact you can't think of a single reason why it's a bad idea to completely eradicate privacy.

If I can find an alt of yours saying that you've quit smoking and then push tantalizing ads to you, you're going to bring me a better return than blind-firing into the American public.

If someone is looking for people who are easy to manipulate in borderline-illegal fashion (let's say, sex crimes), it's a cheat code if they see some throwaway account on HN comment on a post about the treatment of youth, "As a present high school student, I disagree with your statement because..." and track it back to a minor.


I disagree that tracking leads to lower quality discussion - for example I know my name and identity is tied to this account and instead of responding to "lack of ...thinking" with an insult I am forced to come up with intelligent responses (now you try ... it's really annoying isn't it :-)

I also explicitly ask for real life examples and studies of harm - I can imagine and create examples but I much prefer the real world to my imagination as a guide. We learnt that as basis of science.

I also think there is a difference between privacy and secrecy. You seem to conflate the two - if your actions online were secret then advertisers would not send you smoking ads. Secrecy is probably impossible - privacy is merely the politeness of our neighbours. And at scale politeness is enforced - by social norms and sometimes legal measures. We are seeing this come in (GDPR) but it's hard to have legal enforcement before the social norms have arrived.

On the smoking ad front, Gabriel Weinbergs main argument is that searching for "red men's trainers" should be enough to serve ads without having to know if I am a 20 something graduate in wisconsin or a middle aged bloke in London. And I suspect he is right within a few percentage points.

As for online grooming -yeah this is a huge danger. Every parents nightmare. And still absolutely something that needs to be enforced in the real world. And may need extra police and social resources. But if we want to stop predators reaching out to vulnerable children then it requires co-ordination amoung many groups onleine and offline - funding, political will, training education over many years.

There will be no quick fixes for the problems tech is bringing - but I remain optimistic that the cost benefit ratio is worth it and that we can vote for and require change to defend against the dangers

which takes me back to my point - what are the real world examples of dangers so we can make sensible policy


Amusingly can't run it on the author since not enough comments


I have only ever had a single account but it returned 19 possibles with no confidence above .54 but 11 bolded. My own account was listed at the top with a confidence of .9999.


Yeah, I have a bunch of bolded mutuals but none above 0.45. I think I have had one or two alts in the past, but probably they didn't make the 10000 word threshold for inclusion (nor can I remember their names to check if they work in inverse).


Why are some users bold?


Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always). I should probably explain that on the site.


Instead of making it binary, you could use a gradient indicating the strength of the mutual correlation (like how HN colors downvoted comments).


The non-bold are dead accounts I think


It isn't due to a mere property of the user, as, for example, cushman is not bold as the #2 result for tptacek but is bold as the #2 result for icambron.


FYI, the GP said above that bold usernames are those for which symmetry holds (ie they're both in each other's top ten).


Good point.


The bias is interesting here.

https://stylometry.net/user?username=nickstinemates

Number 2 for me is someone I worked closely with for a few years, and then putting his name into this results in all of the people we worked with for a few years. So it seems content>style, or, we are all more alike than we thought.


I'd be very curious to know if these algorithms can link very different types of text. I'm not surprised that my style is "derivable" on HN, but what if you included my slash-fic pieces, my research papers, etc, would it still "catch" me?

Also, talk about a chilling effect. I was already vaguely aware of this, and now I'm overthinking every word I'm thinking/typing.


I'm gathering that they just took a bag-of-words approach to this; basically comparing word frequencies. Writing across content types (fiction vs technical writing for example) will probably show different word frequencies, especially technical jargon, and so on. More sophisticated approaches are possible.

And yes, potentially very chilling. If you want to post truly anonymously, you might want to run your words through some kind of filter first.


Oh god, that thing starts with direct focus on the search field, opening it showed a bunch of old nicknames, I thought it was the result of some study.


The top hit for me, though not a very high correlation (0.3 ish), is to my surprise someone I have met. I don't appear on their top 20 though.


Can we find Satoshi with this?



I interviewed years ago with someone who let me know that they use a pseudonym as an employee and their chosen name even got posted as the author for articles they wrote for the company. They were very concerned about their privacy.

I know their blog, which is their HN username, and this tool found their other account.

Perhaps ironically, this person stood out a lot because of this and I didn't forget them.


It's funny that I only match at 0.9999999999999982 with myself while all other username I tried matched with themselves at 1.0 ^^.



Huhu


Sticking myself in (I haven't ever had another account) my closest match (at 0.43) is the maintainer of an Open Source project which I have occasionally commented about. They are also British, as am I.

My guess is that as they commonly mention the project and I have on a number of occasions, that has formed the link. Plus maybe usage of common British terms, but that seems far less significant.

It's super interesting!

It would be good if there were more controls to filter the type of words and language that are used for the matching algorithm. So you could say exclude words not in the dictionary. I wander how that would effect my link with this other person.


That’s why I always use throwaway :) everywhere. Reddit. HN. Twitter. Everywhere. I’ll spam every site with my throwaways.

Long live throwaways.


That’s the point of this post, that you are not safe by throwaways at all, because all of your throwaways can be linked together purely by your textual style.


No they can’t. If you only have a small amount of text to work with, stylometry is unreliable.


> This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.

Is this? I thought that it was ok to have throwaway accounts, as long as they're not specifically to avoid a ban or something like that.


I find this tool to be disturbing. It is reality so I accept it. But I'm going to make effort to change my style between accounts.

A question for the author (costco): You created that account in 2019 but you didn't post or submit a single thing until 4 hours ago. Why did you create an account almost 3 years ago for no purpose?


Alone out here in the 0.30s. the three times I've used a throwaway account, they've been for a single post on a single topic, so no surprise they did not get picked up by this analysis I guess.

Does a low correlation with other users imply higher susceptibility to de-anonymization if I were using alts regularly?


Probably. It means your writing is more unique and using an alt would be another "very unique" but only similar to yours.


There's someone (michaelmior if you're around!) with a false positive 0.46 match to me.

Maybe we could be friends :)


Not sure if that is a false positive. It just lists the top 20 accounts ranked by similarity score. Under 0.8 or so is unlikely to be a 'positive'.


This needs to exclude who’s hiring post because it confuses me with a few of my wonderful former colleagues!


Well the only solution is too have too many alts so that nobody can believe you can possibly have that many


Wow. This is insane, it found my old accounts. So throwaway obviously (because I'm a bit of an asshole) but this really is amazing. It also highlighted another account that's not me, but looking through their comments i don't see any resemblance to me either.


I've complained a lot about Haskell and now it thinks I like Haskell =(

Needs sentiment analysis IMO, otherwise you'll get "Here's a bunch of people who are JUST LIKE YOU", except they use a similar grammar style but hold opposite opinions on the same nouns.


It just thinks you engage a lot with Haskell. These are people with who you have something to talk about. :)


Serves you right for disparaging The One True Language!

Ok, fine, we'll present Idris with a fig leaf.


I have two accounts. This one, “soneca”, that is my first one and most active by far, and another one that I use sometimes mostly for Show HN and few comments.

When I searched the other one, “soneca” was the first guess, with 0.4.

But when I searched “soneca”, the other one was not in the top 20.


Those interested in the implications of this kind of analysis might enjoy the book The Secret Life of Pronouns http://secretlifeofpronouns.com/


Thank you for this.. I thought I was being careful but evidently it's not enough. It found 13 of my previous accounts with the topmost being 0.4937 and lowest one being 0.3616 bold. All the bold ones were right, some correct matches weren’t bold.


Seems pretty spot-on to me. I tried it with two accounts I was already certain were alts - based on other factors like favorite topics and common enemies as well as style/tone - and the top hits for both were the ones I would have expected.


Very interesting, .59 is my lowest, .64 is my highest match, none of these accounts are one of my alts. Though to be fair the handful of times I've used a throwaway I used it for a single comment so I didn't give it much to go off.


Anything like this for Reddit?

Would translating to other language and back defend against this algorithm?


> Anything like this for Reddit?

No but it would be easily adaptable especially given that Pushshift is archiving every Reddit comment. Based on some of the feedback I'm getting here I don't know if I should open source this even though it really wasn't that hard to make.

> Would translating to other language and back defend against this algorithm?

Yes. But then you have to send your original comment to a translation company so there are privacy concerns there too.


> Based on some of the feedback I'm getting here I don't know if I should open source this even though it really wasn't that hard to make.

I'd say you should. I'd rather see this as being publicly and freely available to everyone rather than some shady "Big Tech" analytics company.

If the "weapons" exist, I would feel more comfortable knowing everyone can access them, not just an elite that can use it for their own (selfish) purposes.


I am genuinely torn, because my initial reaction was almost the exact opposite, but the comparison to a weapon does ring true. And there is indeed an argument to be made for level playing field. At the very least, maybe counter-measures can be developed.


People don't usually understand privacy risks till their own curtains fall down.


I wouldn't worry about that too much as someone's already done something similar for reddit (https://towardsdatascience.com/using-nlp-to-identify-reddito...), and has released their code publicly (https://github.com/jabraunlin/reddit-user-id)

Given the technique used, I don't see why something simple and local wouldn't defeat it? The "easiest" technique would be to use this weighting as a negative metric in rewriting.


> But then you have to send your original comment to a translation company so there are privacy concerns there too.

There are modern offline translation systems available such as Bergamot https://browser.mt/


Trailing (and probably leading, didn't check) spaces confuse the user lookup.


I wonder how much this can be improved if metadata is taken into account as well. Especially the distribution of common post dates and times modulo a week, which also exposes in which timezone somebody probably lives.


On one hand, thank you for showing us all how easy it is to make something like this. No doubt organizations with more resources already have more sophisticated systems in the same vein.

On the other hand, can we agree that this product is unethical?

In many cases, when a person uses an alt, it is a direct and strong signal that they do not wish their other posts to be associated.

So this product is circumventing the explicit will of the person, and making it available to anyone with zero effort i.e. there is no barrier to getting this info.

I met someone about 10 years ago who said they built this at a university. And their argument also was "actually this enhances privacy because it lets you know something something something". And yet their research grants were coming from one source only.

It can be used for good, but most often it won't.


<< On the other hand, can we agree that this product is unethical?

It does create a high level of discomfort, because it illustrates well what privacy advocates try talking about to the population at large, but all that said.. how is it any different from regular scraping and analyzing it any other way?

This is a real question.


It's different because you're removing all barriers to access and making it easy and convenient to stalk/dox people.

Imagine you get the urge to track someone, but in order to do that you have to spend a week writing some new software. That's a barrier. And because of it you may change your mind because it's a lot of work with little payoff.

But if that info is just one click away, it's a whole different ballgame.


> On the other hand, can we agree that this product is unethical?

No.


Fun exercise would be to find all accounts that suddenly stopped posting around today and correllate them with new accounts created around today.

All those scared folks who naively think that it's not too late yet. Busted.


502 Bad Gateway


Apologies for the downtime. It is up again and I'm looking into why uwsgi crashed.


The asymmetry is interesting. I have no alts but of course it nonetheless reported accounts similar to mine.

Running then the most similar person to my account did not put me in their top 20.



Very cool.


I’m guessing that a small corpus for a given account doesn’t produce a very good score? I’ve done throwaways a couple times in the past and this has not “outed” them.


I've only had one account here. The highest match has a 0.624 score and the lowest a 0.572. I'm not sure if that means I'm unique or common but I'd like to know.


One way to get around this legitimately would be by posting a lot of quotes/lyrics/excerpts and the like thus fooling the algorithm unless it had a way to filter them out


This has been a great way to find people whose commentary I enjoy!


We knew this was possible and was coming, and probably around a few years. Fascinating from a technology perspective, terrifying from a long-term privacy perspective.


It's moments like this I'm proud to have my insanity on full display without obscurity. Was surprised to see a bunch of ~30% matches despite not having any alts.


My runner-up has a rating of 0.42378790667730715

C'mon guys, work harder. That's not even close! :-D

Btw, I myself am only at 0.9999999999999999 so I guess I need to work harder at being myself.


I tried it on a few user-ids that I strongly suspected were owned by the same person. My hunches stand corroborated. Not sure who is corroborating whom though, me or the script.

Good job.


Oddly, I am not an exact match to myself.

> Most likely candidates:

skymarshal: 0.9999999999999997

The other few usernames I tested (pg, dang, some random ones from this thread) all matched themselves at 1.0.


I had hard time to understand some comments made by my closest match. I guess this is good reality check. I need to learn how to write more legible posts now.


Sorry, what did you mean? :P


It didn't find my alt, but the second match is one of my twitter mutuals - I wonder if we've inadvertently borrowed style quirks from each other.


I wasn't aware this was even a thing! Scary stuff. 2 alts are listed but not with any great accuracy, so easy to dismiss. What an interesting topic.


Does anyone here have a reasonably wide variety of similarity ratings? I'd love to see the difference between a 0.2 and a 0.8 for the same account.


Interesting; I must have a fairly unique style as there are no matches over 0.40 for me.

I’m a native English speaker as well, so I’m unsure how to feel about that.


> I made this site mostly to show how easy this is and how it can erode online privacy

looks like it can indeed

> Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)

How surprising that someone might object to being included in a demonstration of the erosion of privacy!

Is the site opt-in or opt-out?


I doubt they asked 78k users for permission when there's no standardized way of reaching out if you're not a site admin. It's opt out if anything.


You opt into making your writing publicly available when making posts on this site. I’m not sure what Ycombinator’s user agreement* says about this, but it is pretty obvious that they haven’t done anything to prevent it (and it isn’t clear what they could do).

* and I mean they author of the tool is here making posts, so I guess they have agreed to the TOS, but clearly someone who hasn’t agreed to it could also make this tool and scrape out publicly available posts without agreeing to anything.


Is it weird that my rating is very low compared to alternative options? I have no alts, but I'm curious how similar others might write to me.


What is the threshold to be reasonably confident that two accounts are from the same individual?

I ever had only one account here and the closest match is at 0.47.


ive had maybe a hundred throwaway accounts on HN over the past ten years. generally, i make an account, say something that is apparently wildly offensive to someone else, get flagged and down-voted and then muted or hell-banned. then i make another account because i never did anything wrong and start the process over again. ive emailed the admins, tried to reason with the admins, it never does any good. the power is held by power-users who flag people -- most of the power of an admin at the end of the day but without any of the accountability. as long as they are following the mainstream dogma, its all good.

anyway, this app was able to identify a lot of my accounts. but a lot of the matches werent me. bold matches were almost all me. but i know there are many more matches than those that were listed. it mainly showed my most recent accounts.

i think most people would get a sick feeling in their stomach if they tried this app. i dont think people are prepared for a world where you can type someones name into an app like this and produce everything ever recorded online that was created by that person. not only this but everything highlighted and summarized to answer any question about that person. this is what advanced ai will bring us. an information implosion where the planet-sized ocean of data that is just floating all around us suddenly and violently coalesces into the objects of our new societal calculus. violent is a good word. and this is just the change that one can see coming with ai.


You are definitely right. Part of the reason I chose the 10,000 character minimum was so that people using throwaways in the true sense would be entirely excluded. I don't plan on keeping this up forever and I too would not feel comfortable if this was deployed at scale.


Would you be open to open sourcing the code when you decide to shutdown the service?


You really don't need advanced AI to do it. Just a bunch of scrapers and some run of the mill statistics. And guess what, it's been done by many companies already. They just don't care to create such a site.


you have no idea what im talking about. you dont realize how much data is out there. you dont comprehend how much smarter than you something can be.


pretty cool- i think there should be a term for two accounts that have each other as the top most similar account. kinda sad i dont have one :(


We’re pretty close me and you — closer than my actual alts


hello friend! but... id never use an m dash


Well… I would never use a lowercase word after an exclamation point!

…Because I’m on mobile


Stylotwins?


Make a fundraiser and start doing it for other sites.


It would be possible for Reddit because Pushshift.io archives all the comments there and Reddit is still pretty small. I'd probably need to make things a lot faster. Doing it on a specific subreddit would be very feasible. I'll think about it but I don't actually know if I really want to do that because for instance I've been banned from subreddits before but I don't want a ban from when I was 12 years old to follow me around forever because my writing style hasn't changed. Moderation is the most obvious application of this kind of software.


> I'll think about it but I don't actually know if I really want to do that because for instance I've been banned from subreddits before but I don't want a ban from when I was 12 years old to follow me around forever

Insightful that your personal experience and impact on you personally affects your decision. I invite you to think about the impact of the products you build in your CS career by putting yourself in the shoes of other people as well.

Some products should not be built, even though it's easy to build them.


What other easily-built products do you think should not exist?


Clicking on my top match (0.61) - I can see the similarity. I also note they quote the same way, with a > symbol. I wonder if that helps!


Inserting random Unicode blank, 1/4, 1/2, or zero space characters into your writing may help thwart it too, if you are paranoid


Would thwart this tool, presumably, but not anything which considered spacing ("do they use double space after a sentence?") and punctuation, etc., as markers.


Huh, that’s how I signal my KGB handler…


Very cool! And really a shame that you’re not allowed to delete an old alt account or comments on HN! It follows you forever apparently.


All false positives for me - I want to reach out to the accounts that talk similar to me and see if we make good friends


Maybe this is a good tool to find new friends. :P


How do you protect yourself from impersonators?


So what are some good tools to obfuscate style?


It found my “alternate” account. If someone puts my username in, it’s not hard to figure out which alternate is mine.


No alt, and the highest match is 0.36

And that accounts last several comments were flagged as dead.

I'm a native speaker, but my english succcccks.


Funny thing would be to find most unique user account stylistically.

Which user has lowest best match?

Mine is 0.58 so I'm really not that unique.


Fractionally more unique with a best match of 0.547.


would probably work better with case and punctuation preserving n-grams, sentence length, paragraph length and use of whitespace stats.

also maybe a tf-idf vector of top n words per user.

also could maybe do a same phrase analysis across the corpus to find some hand picked features.

timestamps could be interesting.

or, of course, let the machine do it with comment2vec.


I was curious to use this on myself to see if anyone writes like me. Closest was a .51 confidence, so I guess not?


This is cool!

If an account returns a high score for many accounts, does that also mean they’re relatively less original in style?


It puts almost all of my old accounts decently near the top, but my original account is almost comically low.


Cool! I wonder if it could be run backwards, to identify the users on hackernews with the most unique voices.


This is creepy.


I think the word you are looking for is uncanny


My alt accounts (not really, all below 0.5) seem to also be European or German Firefox users. Good for us ;)


Obviously the next thing to do is make this a popup on someone's account name when you hover over it.


This is super impressive!

Is there a common open source library (Python, JS, whatever) that implements something like this?


> imagine what a company with millions of dollars and a couple dozen PhD linguists could do.

Could they do much better?


How much writing do you need to analyze results? Would changing account every X sentences eliminate this?


Current minimum is 10000 characters. In my own tests accuracy was still pretty good at 3-5000 but I instituted the 10000 minimum to reduce false positives. Yes it would, if you read the advice page on avoiding detection that is one of the things I recommend. Unfortunately HN moderators do not really like that.


I have no alts, but to those of you compared to me by this engine : "Hey, good lookin'!"


wow, this is way off on me, didn't find my alts and the bolded accounts on my list are from different countries, use language I'd never use (cusses) and I see I've downvoted some of them...

I'd love to have the experience and or apparent wealth my "alts" have


This is great.

One funny thing though, while your example says 1.0, for my own account it says 0.99lotsof9s4


I like the way some usernames are only 0.9999999 correlated with themselves.

Perhaps 6 or 7 digits is enough?


This found an old account that I forgot I even had but with a lot of false positives. Neat!


I have no alternate accounts, and all my matches are below 0.4 for whatever it’s worth.


Interesting, but it gave me 20 accounts, and I know that I only have this one.


Sorry for any misunderstanding, read https://news.ycombinator.com/item?id=33756725


Sounds like a nice tool to find friends. You locate people who might think like you.


Strip leading/trailing white space from the name if it says no match.


I would have expected to be a closer match to myself.

> uberduper: 0.9999999999999991


Well, one of the closest on my list is my twin, so there's that.


Love a little NLP project on a public dataset - thanks for sharing!


Would this work for Fernando Pessoa and all his heteronyms? :)


I’d like to request the author takes this offline please until the implications can be thought through.

This is breaking anonymity that people incorrectly thought would not be revealed.

For some it might be awkward, others it might be quite problematic.


I would agree with you but the genie is out of the bottle already. Nigh everyone can and could have reproduced these results, especially that archive.org and similar things exist.

So, I don’t think it causes any new harm, if anything it gives you future risk aversion.


This is nothing new, e.g:

Analyzing stylistic similarity amongst authors

https://news.ycombinator.com/item?id=10050603

http://markallenthornton.com/blog/stylistic-similarity/

37 points by lingben on Aug 12, 2015


This is not complex and is a well known method that state actors have been using for quite a long time. Governments have FAR more advanced ways to track you than this, but it's good for people to realize it exists.


Found my phone account; I'm quite impressed, really !


Haha, you got me and my main account. That's spooky.


Im tempted to use it to find likeminded friends :)


This could be a good idea for identifying bots.


Not sure if GPT3 at least if prompted right would have clearly identifiable style. Could probably detect converted call centers in Russia or Cambodia where 50 employees post on 10000 accounts though.


at what threshold is it considering alt account?


There is no threshold. This site does not make any call as to whether a user is an alt or not. It just gives the users with the most similar word choice and from there it is up to you to decide (is there a very specific detail that both accounts mention, do they post at similar times, etc). I will say bolded accounts are substantially more likely to be alts though. But obviously it is not guaranteed that every user has an alt.


Jokes on you, this is my one and only account.


Are short sentences better for anonymity?


Well, interesting. This is one of the reasons we have the GDPR. @costco, if I were to make a GDPR erasure request, would you service it?

And I'm no lawyer, but it seems like there's also an outside chance of a breach of section 171 here as well, which is a criminal offence committed by a person who reidentifies de-identified data.

Plus - the laws have extraterritoriality. Vanishingly unlikely that you'd actually be pursued for it, but it's worth bearing in mind when you munge people's personal data.


It's an EU law.


With extraterritoriality. And if identifying people in this way is covered (I'm not a lawyer, I'm not claiming it definitely is), then it's also possible that EU citizens using the tool are committing a criminal offence.

The law seems to only apply where the deidentification has been made by the data controller, but HN admins changing someone's username, for example, if they ever do, would count. A person then using the tool to match another non-anonymous username to that account would seem to be caught.

Important to stress how much of a technicality this is, but that sort of thing can be interesting sometimes.


Wow... that's shockingly effective


Welp, so much commenting for me then.


Site seems to have been down when you commented this. If you want to try again it is up again :)


What's a high correlation number?


Are you going to try it on Twitter?


Now I can find my HN doppelganger


heh, I looked up the top bold hit for my name and they really do sound a bit like me (:


writing from throwaway:

Holy shit, it works really, really good. It found all of my older accounts.


What algorithm is being used?


It's described here: https://stylometry.net/about


I changed my nickname so my employer can't find me here. I'm not amused by this.


If this basic implementation can catch you, I’d consider it a friendly reminder that changing your account name is not a very effective means of adding privacy.


New account, then translate your comments to Spanish and then back to English using Google translate.


The website is down...


Apologies for the downtime. It is up again and I'm looking into why uwsgi crashed.


Now do one for reddit


why is my username not exactly equal to 1? https://stylometry.net/user?username=julienreszka


Python/floating point rounding error. It doesn't mean anything.


does it use the most used words or least used?


Possibility to hide user comments in profile should be optional.


didn't find a single one of my alts. nice


I obviously don't expect you to help me but do they have at least >10000 characters written and are you varying your writing style in any way?


Of the top ten accounts listed for my name two of them are me.


nice one. are you using gpt3 under the hood?


I'm not that smart - my site is basically just doing some calculations on word frequencies. You can read https://academic.oup.com/dsh/article-abstract/17/3/267/92927... and https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53... and https://news.ycombinator.com/item?id=33755898 for more information.


As you mention on the site, you don't do punctuation. But I'm guessing there are some pretty good fingerprints like:

two spaces after a period

Whether someone uses an em-dash/single hyphen/double hyphens (which may correspond to house style they're used to)

Whether they use semi-colons

(Presumably harder) but consistent substitutions like loose for lose, break for brake, etc.

Use of accents


I manually determined there was an individual posing as two people (playing both the antagonist and the adversary) because they consistently misspelt certain words such as "definitely" as "defiantly".

Fingerprinting certain linguistic traits and mapping that to time-zones as well as confirming there is a partial overlap in posts but never exact worked exceedingly well. Someone can't easily maintain a fluent conversation between themself on two accounts, but they can either get close, either through unnatural delays between sentences or just never interacting with the "other" party at the same time.


Simplicity is the greatest form of sophistication! Great work!

One small nit from a user experience point of view..: it'd be easier on the eyes if you just truncated those cosine similarity scores (or whatever score you're using) after the, say, 5th digit. Showing the entire float is kinda messy to my eyes.


Don’t sell yourself short. Simplicity is smart. It’s astonishing how often the simplest thing turns out to be exponentially more effective than the so-called smart thing.

I can’t get over how phenomenal this is. Please put every one of your side project ideas into production!


I am curious whether it could pick GPT3 out of the crowd.


Its easy to write complicated systems, it takes a genius to make it simple.


cool and thanks for the clarification. i ask that mainly because of the request limit of openai, which is something that makes many scalable ideas unfeasible


we leave fingerprints everywhere


ColinWright is Dang?

Woah


totally on spot

my current and my old account


w


Wow... how !


[deleted]


Over in the D language forums, we welcome people who post under a pseudonym, and our policy is we won't allow attempts to unmask them.

This is to protect high profile users who are secretly enjoying programming in D rather than the language they are supposed to use.

And, of course, to protect users who feel they might be discriminated against if their background was known.


It's very important for those people to be aware of these style analysis attacks! Glad this post is raising awareness.


What's up with cluster of users like:

j_s,password4321,carolinew,colinwright,kuharich etc.

https://stylometry.net/user?username=j_s https://stylometry.net/user?username=carolinew https://stylometry.net/user?username=colinwright https://stylometry.net/user?username=password4321

Lowest match for j_s is 0.80 and all but one is black.


On a cursory glance it looks like a cluster of users that post links, especially with italicized quoted excerpts.


Most likely candidates:

    pg: 1.0
    montrose: 0.604073065373204
    mattmaroon: 0.5900372458160795
    natsu: 0.5519832271289953
    rauljara: 0.5418566694533273
    waterlesscloud: 0.5378996309342633
    damoncali: 0.5292014150349463
    gruseom: 0.5290151637991445
    kemiller2002: 0.5254174524920762
    jfengel: 0.5231938496089998
    jamesaguilar: 0.5229081613163672
    houseabsolute: 0.5219738531025365
    danssig: 0.5195368367601849
    austenallred: 0.519343009683366
    loewenskind: 0.5177030083877397
    baguasquirrel: 0.5153841099708854
    asdfasgasdgasdg: 0.5146704002447524
    aptwebapps: 0.5144149629369845
    allenbrunson: 0.512802806408646
    danielweber: 0.5123620795710832


[flagged]


Not to diminish one bit how you're feeling, but the bright side is: Today you know this is easily done (information you didn't have yesterday), that the creator had no intention of "outing" you specifically, and that you can take steps to obfuscate this specific aspect of your posts that connects your public alts.


If you want to ask HN to remove your data, send a message to hn@ycombinator.com.


[flagged]


Yes, sadly. In this case, it'd be an arsehole move, but good point.


Not today.

You fail, I win.


Nice. Just out of curiosity are you taking any countermeasures or varying your writing style across accounts in any way?


My second closest match was 0.35 but searching people where they have matches 0.5-0.75 I suspect that's mostly to do with number of posts leading to better statistics.


yeah I vary my writing styles. Much of the stuff I post through this account is controversial, to say the least. So I have to take "measures".




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: