Hacker News new | comments | show | ask | jobs | submit login
Anagram Scoring (plover.com)
378 points by panic 216 days ago | hide | past | web | 106 comments | favorite



Ahh... so much poetic inspiration in this post!

"Timesaving negativism" said the earringed grenadier, with dispensable piebaldness, "can be the intermediator for chronic endometritria!"

"Ah nay!" the cinematographer megachiropteran replied! "it is only with maths--integrals, triangles, that can cause such excitation, can intoxicate!"

But I, though their admirer, now married had to disagree: "it would be a series of masculine calumnies to give such clitoridean, directional advice to women! We cannot precipitate such peripatetic thoughts!"


True then, as I believe best anagrams transcend mere good manipulation of dictionary entry,

a letter rearrangement goes beyond the plain, ordinary substitutions of vain academic men.


In case anyone is wondering, the two lines above are anagrams of each other. It's pretty impressive, any hints on the method you use to come up with this?


Work on both sides at once. I knew the phrase had to contain the word "anagram", so I started with that. At first, it was "better anagram", and so it looked like this:

  better anagram lerrneet
  letter rearrangement ba
So then I think of a word that both uses up the excess letters and can reasonably fit into the phrase. Whatever letters in the word that are not in the stock for that side get added to the other side. Sometimes I add words specifically to up vowel count or supply low frequency letters needed for words that fit that half of the phrase.

At some point, I had an excess 's', so "better" got turned into "best", and I had to use up its excess "ter", which got stuck with an excess 'u' to make "true". I also had an excess 'n', so a "the" became "then". Those somewhat awkward words go in last, which is where that "True then, as" came from. Getting rid of the last letters on both sides is the hardest part, because there is rarely a perfect word that uses up all your letters, just like in Scrabble. Sometimes you just have to unravel a few words and try different ones in their place. In the end, I just ran out of time to work on it.

It's a bonus if I can get near synonyms on both sides, like "transcend" and "goes beyond"


You might like the anagrammatic sonnet "Washington Crossing the Delaware", in which every line is an anagram of the title and the story of the titular event is told: https://en.wikipedia.org/wiki/Washington_Crossing_the_Delawa...


In the same vein, I wrote a Twitter bot that finds and retweets pairs of tweets that form anagrams:

https://twitter.com/anagrammatweest

It's a clone of anagramatron, the original anagram Twitter bot that went inactive almost a year ago.

I use a combination of Damerau–Levenshtein distance, longest common sequence, length, number of English words, and number of different words to score and rank anagrams to figure out which ones to retweet.


Amazing how apt some of them seem, it looks human-curated.

This one feels so right -> These hoes so trifling

no idea...lost -> desolation

the skeleton war -> take shelter now


It took me a while to realize which tweets go together, particularly since Twitter shows only the original tweet time, not when they were retweeted. Maybe you could retweet them as links?


They're also posted to tumblr, which can make it easier to see the pairs:

http://anagrammatweest.tumblr.com/


LCS = Longest Common Subsequence. In case someone eants to look it up.


Super fun!

The list of examples in the 7-11 range seemed to have a high number of pairs that have strangely coincidental meanings somehow. I thought maybe that was because they are hand-picked, but looking at the complete list, I honestly see more of those than I'd expect. I'm not sure why it's surprising, there's no actual pattern in a pair, and it's not surprising that lots of pairs of words have some way to connect them. But it still feels surprising when looking at them. It's like a lot of high scoring pairs of words share some kind of root, even though high scoring should mean high dissimilarity.

Anyway, great post. Weekend coding projects looking for anagrams and palindromes and boggle solutions are often rewarding for me, above and beyond just being fun.


I bet that the "strangely coincidental meanings" are your human brain doing pattern-matching, rather than any etymological similarities. Which I think is why the anagrams are so fun in the first place!


Words with similar letter distributions will tend to be from the same root languages, which frequently maps to category of discourse (scientific, sophisticated, coarse, exotic, etc.)


Totally, I think you're right. I suspect there are lots of infrequent and different ways to associate two words, and my brain conflates them all into one concept. It would be surprising if a random dissimilar anagram was frequently a pair of synonyms, but it might not be surprising that a pair has some association of some kind, where there are tens or hundreds of different ways to associate words.

Probably not a perfect analogy, but this is vaguely reminding me of Ramsey's theorem from combinatorics; the idea that if you have enough stuff, some of it must be structured.


Love this guy's blog. His article about Your Age as a Fraction [0] caught my eye a couple years ago and I've been hooked ever since. I love checking the script to see how far into my year I am.

[0]: http://blog.plover.com/math/age-fraction-2.html


Here is a program I wrote a while ago that is vaguely related to that:

https://gist.github.com/jedberg/61b2587b298af92adc5f985b11a2...

Basically, if you run "watch ./remaining.py" on your command line, you can see a running countdown of how long is left in the day, month, year, and your life.


This is a fun read, I've always enjoyed anagrams and palindromes. The "best anagram" was cinematographer megachiropteran.

However, it comes across as a little "smarter than thou" with the section:

"The thing you do not want to do is to compute every permutation of the letters of each word, looking for permutations that appear in the word list. That is akin to sorting a list by computing every permutation of the list and looking for the one that is sorted. I wouldn't have mentioned this, but someone on StackExchange actually asked this question."

That statement makes sense to me when I think about it, but it wouldn't have appeared as totally obvious to me from the outset. And I can't see casting derision at someone because they asked this question.


I think that paragraph perfectly demonstrates why all the commenters on Why I Don’t Talk to Google Recruiters (https://news.ycombinator.com/item?id=13696004) who insist that programmers don't need to know algorithms are wrong. It should be immediately obvious to anyone with a cursory understanding of algorithms that computing all the permutations of every word is much more expensive than necessary, and from there it's not all that hard to figure out that sorting the letters produces the desired effect.


Is it though? There's only 230k words in that list, and most words aren't that long. Building a trie and then trying all the permutations of all of the words seems like it would finish in a reasonable amount of time. Even then O(n^2) approach of comparing pairwise would probably get you to a reasonable time. Especially since this is a one-time thing and doesn't get run over and over.

And your argument is kind of silly when in the very same article when he uses brute force to see how many segments. Shouldn't he have found the most optimal algorithm for that instead of brute forcing it?


Sorting the words is significantly easier in virtually every programming language than testing all of the permutations. So the "it should finish in a reasonable amount of time" isn't a good excuse, since you're doing more work than necessary in order to implement the suboptimal solution.


The observation that two words are anagrams if they contain all the same letters isn't a huge leap from two words that contain all the same letters are anagrams.

That being said: Sorting isn't required. (An O(kn) solution exists to beat your O(nk log k))


It's not entirely clear that the O(nk) solution actually wins because k is small in this case, and so log(k) is very small. Unless you do some careful bit packing, the O(nk) algorithm will use more memory, which could make it slower because of increased cache misses.


The O(nk) solution has fewer branches (O(1) versus O(log k)), and even if you use a naïve implementation, O(nk) will still fit into L1 (assuming n is bigger than k).


> O(nk) will still fit into L1

Not even close. The lexicon is 230k words, times 26 letters is nearly 6MB. The i7 has a 64kB L1 cache and a 246kB L2 cache.

I think the only way to resolve this would be to actually do the experiment. I wouldn't bet my life savings on the outcome either way, but I'll give you even odds at low stakes that sorting is faster.


Yeah, there are other solutions, such as allocating an array of 26 letters (ignoring accents) and using that to count, but sorting is easy and is generally fast enough.


Sorting is O(kn) where k is the largest number. (See radix sorting)


It also requires O(k) memory which is certainly larger than L1, causing two stalls instead of one.


In this case, k is 26. Or ASCII value of 'z'. Either way, very cheap.


Map a-z to the first 26 primes, multiply and then quicksort your dictionary.


Limited to around 11 character words/phrases. You can do better.


yeah, there's no reason to sort, you can just hash. but I'm not sure I can improve the asymptotic complexity of a comparison


It's probably obvious enough once you've buried your head in this train of thought continuously for weeks. Thank god we don't have to do that because this helpful person has just told us.


It did not come across that way to me at all. I thought it was a great read through and through.


If it's not obvious, it's probably because you've slightly misunderstood the problem somewhere. You have a finite list of items, and you want to create matches between them based on how many of each letter they have. Now it's simple, right?


My intermediate steps were: "word equivalence classes" and "[flawed] fundamental theorem of arithmetic^Wspelling", and then you have it.


First time I found about this same approach to discover anagrams was in John Bentley's Programming Pearls, 1st Edition (which I believe came out in 1986, and was in itself a collection of columns from previous years).


So the author ends up scoring anagrams by the minimal number of 'chunks' such that one word can be rearranged into the other, and points out that this maps to the NP-complete independent set problem.

Another way you can do it is to score each pair by the length of the longest shared substring, which can be done in linear time with a suffix trie or quadratic with dynamic programming. Under that metric the 15-length word is no longer 'best', you'd end up picking the longest pair that fully permutes the other word.


Sorting by edit distance might be interesting, too.


I also immediately thought of edit distance as the obvious metric, and it has a reasonably simple and efficient algorithm too:

https://en.wikipedia.org/wiki/Levenshtein_distance

But then i realised that it was only obvious to me because i spent years hunched over a computer crunching DNA sequences.

I'd also be interested in normalising edit distance by length somehow. It's no coincidence that higher-scoring words tend to be longer.


I was curious, so I rescored his list using Levenshtein. No normalization for long words though:

https://gist.github.com/anonymous/431b163b2a2d532bfd0a3bdcc7...

I sorted it reverse so that the interesting pairs would be on top.

Compare to his original scored list, here: http://pic.blog.plover.com/lang/anagram-scoring/anagrams-sco...


Thanks for doing the legwork!

Sadly, it turns out our genius idea doesn't work so well - the infamous cholecystoduodenostomy/duodenocholecystostomy is in joint fourth place, and the scarcely any better duodenopancreatectomy/pancreatoduodenectomy is joint second.

mjd's key insight was that good anagrams are ones where all the letters are thoroughly scrambled; both of the cases above are ones where large blocks of letters remain intact. I don't know if you've had scrambled eggs where large lumps of white or yolk remain intact, but i have, and they're vile.

So, anyway, perhaps a better edit distance metric is one which allows cut-and-paste of whole blocks as an edit, since it would give those words much lower scores. My years of DNA-hunching immediately led me to think of sequence alignment algorithms, because that kind of cut-and-paste is exactly what happens in DNA:

https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorit...

https://en.wikipedia.org/wiki/BLAST

But i'm not sure those algorithms are much good, because they model the editing as deleting, inserting, and modifying runs of letters, rather than moving them around.

Next, i thought of compression, because that often works by finding repeated runs of characters, and also i don't really know any other areas of string algorithms. What if you concatenated two strings, compressed them using LZ77, and measured the amount of compression? Or what if you did it with something simpler and more local, like Predictor [1]? Could you use the Burrows–Wheeler transform?

[1] https://tools.ietf.org/html/rfc1978 - your daily obscure blast from the past!


I had the best luck with "Anagrams by percent of string not part of longest common substring, then length"

The top-to-bottom ranking isn't quite right...need some better way to push down short words, but it does seem to put most of the good stuff near the top.

https://gist.github.com/anonymous/ec28b75fb6c841b56def8eefec...


I wonder how kendall-tau distance (effectively, the number of bubble-sort swaps required) would work as a scoring metric.


I was always fond of "presbyterians" "britneyspears".


Not to get political, but I thought as far as name anagrams go

Rex Tillerson -> Risen Ex Troll

is pretty amazing. Not sure how it would score with the OP's metric though - typing this as I'm on the ski lift lol


From http://wordsmith.org/anagram/hof.html

Clint Eastwood = Old West Action

Madam Curie = Radium came

It's good fun to go find anagrams of friends and family. One of my nephew's comes out Alpaca Tits. He's not real happy about it but the rest of us think it's pretty good.


alec guiness = genuine class

not original , but certainly on topic.


Jeremy Irons -> Jeremy's iron

(courtesy of Lisa Simpson)


Too many E's. Drop the middle initial.


Ah sorry knew that looked off


Shitposting -> Top insights


More interesting is that you made a novelty account just to post this.


His blog title is an anagram of "code overuse ruins fetish".


"Fetish overuse ruins code"?


Winner of most metal anagram goes to "incorporate procreation"


Interesting article! The metric you use does eliminate triviality, but it sometimes uses very obscure (and arguably uninteresting words), such as calumnies, ivoriness, coprophagist, etc. That's what you describe as Webster's Second jargon that nobody knows".

It would be interesting if you could adapt your metric to account for general prevalence of the word in English. Scan a giant subsection of say Wikipedia, and assign a frequency to each of the 234,000 words in a map, giving unseen words an infinitely small frequency, and then use the sum or multiple of the frequencies of each of the anagrams to bring out some truly interesting ones!


I would strongly argue that coprophagist is a very interesting word, and should be less obscure. But then again, that might just be my juvenile sense of humor.


Calumnies and coprophagist aren't particularly obscure. I've come across both.

I think you have to discriminate between slightly obscure or archaic words that anyone familiar with a reasonable range of the literary canon would know, and truly uninteresting words that even a highly educated and well-read person wouldn't know.

There are better corpuses than Wikipedia that could be used for this purpose, like the British National Corpus

http://www.natcorp.ox.ac.uk/


Fun stuff - especially debugging the time printout ;) A really ancient but still functioning site I always go to is wordsmith.org. He calls it "Internet anagram server" == "I rearrangement servant" !


Why didn't he just calculate the Levenshtein distance instead of coming up with that custom difference calculation method?


The Levenshtein distance from duodenocholecystostomy to cholecystoduodenostomy is actually pretty high: 14. Transposing two contiguous chunks of letters has a high edit distance if both chunks are long.

A "good" anagram doesn't just have letters being moved a long distance, but also requires that the movements of letters be independent, something that Levenshtein distance doesn't measure.


The article gives an example, but to make it a bit more explicit: Levenshtein distance is a poor measure for the intuitive notion of how "non-trivial" an anagram is. Moving a chunk of letters together results in a high edit distance, but low complexity.

If you rank the entire list of anagrams by edit distance, the highest pair is "anatomicophysiologic" and "physiologicoanatomic", with an edit distance of 16 but a chunk score of only 3.


Because his distance measure is anagram specific and thus more appropriate than a general one.

(Technically speaking, a proper distance measure requires a distance of 0 for equal strings, so perhaps he should pick the number of splitting points rather than the number of resulting fragments.)


I wrote a javascript game based on anagrams as part of a 'get to know your team' exercise at a previous job. It was a weekend project.

The game asks the user to guess the name of a team member from anagrams of that name displayed for short time periods on screen.

Click the 'automate with answers' button for a nice animation of rearranging the letters.

http://tomcdonnell.net/submodules/anagram_game

Here is the tool I used to find the anagrams:

http://tomcdonnell.net/submodules/anagram_finder


Those of those who are word freaks already knew about MEGACHIROPTERAN CINEMATOGRAPHER, as this pair is mentioned in the best seller Word Freak from ~15 years ago :)

Some of my favorite pairs are descriptive ones, like GULLIBLE BLUEGILL and HAPPIEST EPITAPHS.


http://www.anagrammy.com/anagrams/faq6.html

18 letters

    conversationalists = conservationalists
17 letters

    basiparachromatin = marsipobranchiata
    Charles Holding, 1971 
16 letters

    thermonastically = hematocrystallin
    John Edward Ogden, 1978 
15 letters

    nonuniversalist = involuntariness
    refragmentation = antiferromagnet
    megachiropteran = cinematographer
    centauromachias = marchantiaceous
    Eric Albert, c. 1986


Of course, the key idea of the linked article is that it's easy to find long anagrams, but it's hard to find interesting ones. The anagram "megachiropteran = cinematographer" is ranked higher than "conversationalists = conservationalists" because of the more complex rearrangement.


I just experimented with building n-grams for each anagram candidate and calculating the dissimilarity of pairs within each group, then multiplying this score by the length of the word.

The results look pretty decent, but I get "BASIPARACHROMATIN" vs "MARSIPOBRANCHIATA" as number 1.

https://gist.github.com/aclemmensen/725f60a0746382521faefa0d...


That's a lot of fun. I'd like to see someone go further with some of the suggestions the author mentions at the end, like weighting the scores by word familiarity and word length.


I did something like that for multiword anagrams: scoring them by their probability in a word-bigram model of English. It's expensive because I had it pick the best permutation of the words, by brute force. It's far from a measure that could really judge "this is a great anagram", but there are so astronomically many boring ones to filter out that it's still a big help.


Kind of related: The anagrams problem is probably my favourite interview question to dole out. Always shows completely different ways of thinking and normalising things.


I think the "best" one would be the word(s) with the most letters that are both common, and also somehow related.


"nonuniversalist" and "involuntariness", listed in another comment, both have a theological aspect. While not terribly common, they're not hard to understand either.


A scoring rule I'd like to try maximizes the product of the lengths of the chunks / length of the word. This makes individual letters shuffling less interesting and many medium chunks most interesting, which matches my intuitions.


He makes a good point that the "scoring function" for what we think of as a good anagram is actually the more important.

In general just generating the solution space is actually much easier than coming up with a satisfying scoring function.


If you like anagrams and a bit of a challenge, you should check out https://twitter.com/anagramdaily


They've taken a lot of liberties with the "Daily" in "Anagram Daily." Fifteen days between today's anagram and the last one!

Plus one that is marked easy: MYTH. Maybe I'm just an idiot, but I cannot think of an anagram for MYTH.


My /usr/share/dict/words does not have an anagram for "myth" in its 99171 words, which is already larger than almost everybody's vocabulary.


SCOWL[0] has 675491 words. There's no anagram of "myth" there either.

[0] http://wordlist.aspell.net/


MILK is also listed as easy, and I'm unable to find an anagram for it.


for a real workout, check out https://aerolith.org/

it's extremely fun, though it's scrabble-focused, so you'll see a lot of weird words that you are expected just to have memorised.


Thanks! This is my site :) At some point I'll take the time to make it more beginner-friendly and have a better landing page/etc. Any suggestions welcome :)


bring back the common words challenge :)


You could also use their Levenshtein distances.


See this comment (https://news.ycombinator.com/item?id=13698252) by lmkg for an explanation of why Levenshtein distance isn't good.


Here's the list, sorted with Levenshtein: https://gist.github.com/anonymous/431b163b2a2d532bfd0a3bdcc7...

Probably not the best approach, but it did unearth some interesting pairs.


I'm going to use his "coprophagist topographics" pair the next time I write a post about readers of fake news.


computers have taken a bit of the fun out of searching for interesting anagrams :) my favourite one i found "by hand" (mostly a mental tic where i randomly anagram words i encounter) was {antipyretic, pertinacity}, followed by the fact that cremate, cremating and cremation all have nontrivial anagrams.


I really enjoyed this.

Nice solution to put the (sorted by letters) words into a hash table to find the anagrams.


Wondering if there is class of poetry where the rhyme is via palindrome rather than phonetic.


unrelated but reminded me of another non-rhyme-based form of poetry, https://en.wikipedia.org/wiki/Ci_(poetry)


senorita = not a word. That would be "señorita", with an "ñ". It doesn't really work as an anagram for "notaries" because "ñotaries" is also definitely not a word.


SENORITA is acceptable in the English-language Scrabble dictionary. Its other anagram is NOTARISE.


Huh, that's extremely surprising! TIL.


I guess that really depends on whether you define ñ and n as two seperate letters or the same letter with an accent of some kind


Interesting. It doesn't "feel" like any other accented letter. If I read, say, "despues" instead of "después", I'll know what it means, and I'll think "oh, forgot an accent". I can't think of any words in Spanish where the presence of absence of an accent will make a lot of difference.

Ñ, on the other hand, does. Substituting a N for a Ñ usually doesn't work, and it feels like a typo, not like a forgotten accent. There are also many words where the only difference is a N or a Ñ which have significantly different meaning (off the top of my head: año/ano, moño/mono).


I agree with you, and I was worried about this point myself while I decided to include the example in the article. In Spanish “notaries” is certainly not an anagram of “señorita”. I eventually decided along the lines of cdelsolar’s comment above: in English, it is also correct to spell “señorita” without the tilde.

I think the reason that ñ feels different to you than é is that while the é has never been more than a variation on an ‘e’, the ñ is actually an abbreviation for ‘nn’. For example, “año” is derived from Latin “anno” and “cañon” was originally from Latin “canna”. So I think the correct way to handle ñ may be to treat it like nn. (An analogous strategy for German, which people might find less surprising, would be to treat ‘ö’ and ‘ü’ as if they were ‘oe’ and ‘ue’; in older times in English, it was considered correct to equate “w” with “uu” in anagrams.)

On this plan, I find (in English)

      señoritas nestorian
but unfortunately nothing in Spanish. (There might be one though; I did not try hard and my Spanish data is of very poor quality.)

Also unfortunately, “señorita” is an exception, and was _never_ spelled with a double “n”. So for this example, the “nn” equivalence is less defensible.

Thank you for bringing up this point.


Thanks for the thorough explanation! I had never made the connection between "ñ" and "nn".

Still unsure about "in English, it is also correct to spell “señorita” without the tilde" - it doesn't seem to be an English word at all, can't find many references online, Google Translate corrects it to "señorita", neither Cambridge or Oxford dictionaries have it...


No triplets or quadruplets?


would be fun to measure the word/context similarity for each pair and order by that..


I gave that a shot using Wordnet. It's flawed for a few reasons. A lot of the more unusual words in his list aren't in Wordnet. I also had the stem the words first, and had to pick just the most used sense if the word had multiple senses.

So, the list is only those pairs that made it through the processing.

https://gist.github.com/anonymous/e69f448fa08481603d1ca77ea7...

Would have been better if I had also weighted using his approach, as pairs like "theater theatre" end up on top.

Still interesting though. It found these, which I liked:

  0.33  swinger wingers
  0.33  parrot raptor
  0.33  borsht broths


I also tried a variation based on "Anagrams by Percent of String Not Part of Common Substring, then Length"

It's flawed too, but does put some good ones at the top.

https://gist.github.com/anonymous/ec28b75fb6c841b56def8eefec...


it's not "flirting geisha abandons the menu"?


predators = teardrops


tldr: cinematographer megachiropteran (the latter is a giant bat)


For anagram scoring, why not just determine the overlap between both words in terms of shared bigrams (i.e., two-letter substrings), relative to their length? Has anyone tried that? Is it equivalent to what mjd did?




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: