
Anagram Scoring - panic
http://blog.plover.com/lang/anagram-scoring.html
======
michaelleland
Ahh... so much poetic inspiration in this post!

"Timesaving negativism" said the earringed grenadier, with dispensable
piebaldness, "can be the intermediator for chronic endometritria!"

"Ah nay!" the cinematographer megachiropteran replied! "it is only with maths
--integrals, triangles, that can cause such excitation, can intoxicate!"

But I, though their admirer, now married had to disagree: "it would be a
series of masculine calumnies to give such clitoridean, directional advice to
women! We cannot precipitate such peripatetic thoughts!"

~~~
logfromblammo
True then, as I believe best anagrams transcend mere good manipulation of
dictionary entry,

a letter rearrangement goes beyond the plain, ordinary substitutions of vain
academic men.

~~~
pierrec
In case anyone is wondering, the two lines above are anagrams of each other.
It's pretty impressive, any hints on the method you use to come up with this?

~~~
logfromblammo
Work on both sides at once. I knew the phrase had to contain the word
"anagram", so I started with that. At first, it was "better anagram", and so
it looked like this:

    
    
      better anagram lerrneet
      letter rearrangement ba
    

So then I think of a word that both uses up the excess letters and can
reasonably fit into the phrase. Whatever letters in the word that are not in
the stock for that side get added to the other side. Sometimes I add words
specifically to up vowel count or supply low frequency letters needed for
words that fit that half of the phrase.

At some point, I had an excess 's', so "better" got turned into "best", and I
had to use up its excess "ter", which got stuck with an excess 'u' to make
"true". I also had an excess 'n', so a "the" became "then". Those somewhat
awkward words go in last, which is where that "True then, as" came from.
Getting rid of the last letters on both sides is the hardest part, because
there is rarely a perfect word that uses up all your letters, just like in
Scrabble. Sometimes you just have to unravel a few words and try different
ones in their place. In the end, I just ran out of time to work on it.

It's a bonus if I can get near synonyms on both sides, like "transcend" and
"goes beyond"

~~~
madcaptenor
You might like the anagrammatic sonnet "Washington Crossing the Delaware", in
which every line is an anagram of the title and the story of the titular event
is told:
[https://en.wikipedia.org/wiki/Washington_Crossing_the_Delawa...](https://en.wikipedia.org/wiki/Washington_Crossing_the_Delaware_\(sonnet\))

------
briandrupieski
In the same vein, I wrote a Twitter bot that finds and retweets pairs of
tweets that form anagrams:

[https://twitter.com/anagrammatweest](https://twitter.com/anagrammatweest)

It's a clone of anagramatron, the original anagram Twitter bot that went
inactive almost a year ago.

I use a combination of Damerau–Levenshtein distance, longest common sequence,
length, number of English words, and number of different words to score and
rank anagrams to figure out which ones to retweet.

~~~
pimlottc
It took me a while to realize which tweets go together, particularly since
Twitter shows only the original tweet time, not when they were retweeted.
Maybe you could retweet them as links?

~~~
briandrupieski
They're also posted to tumblr, which can make it easier to see the pairs:

[http://anagrammatweest.tumblr.com/](http://anagrammatweest.tumblr.com/)

------
dahart
Super fun!

The list of examples in the 7-11 range seemed to have a high number of pairs
that have strangely coincidental meanings somehow. I thought maybe that was
because they are hand-picked, but looking at the complete list, I honestly see
more of those than I'd expect. I'm not sure why it's surprising, there's no
actual pattern in a pair, and it's not surprising that lots of pairs of words
have _some_ way to connect them. But it still feels surprising when looking at
them. It's like a lot of high scoring pairs of words share some kind of root,
even though high scoring should mean high dissimilarity.

Anyway, great post. Weekend coding projects looking for anagrams and
palindromes and boggle solutions are often rewarding for me, above and beyond
just being fun.

~~~
CGamesPlay
I bet that the "strangely coincidental meanings" are your human brain doing
pattern-matching, rather than any etymological similarities. Which I think is
why the anagrams are so fun in the first place!

~~~
JCzynski
Words with similar letter distributions will tend to be from the same root
languages, which frequently maps to category of discourse (scientific,
sophisticated, coarse, exotic, etc.)

------
NoahTheDuke
Love this guy's blog. His article about Your Age as a Fraction [0] caught my
eye a couple years ago and I've been hooked ever since. I love checking the
script to see how far into my year I am.

[0]: [http://blog.plover.com/math/age-
fraction-2.html](http://blog.plover.com/math/age-fraction-2.html)

~~~
jedberg
Here is a program I wrote a while ago that is vaguely related to that:

[https://gist.github.com/jedberg/61b2587b298af92adc5f985b11a2...](https://gist.github.com/jedberg/61b2587b298af92adc5f985b11a20642)

Basically, if you run "watch ./remaining.py" on your command line, you can see
a running countdown of how long is left in the day, month, year, and your
life.

------
grepthisab
This is a fun read, I've always enjoyed anagrams and palindromes. The "best
anagram" was cinematographer megachiropteran.

However, it comes across as a little "smarter than thou" with the section:

"The thing you do not want to do is to compute every permutation of the
letters of each word, looking for permutations that appear in the word list.
That is akin to sorting a list by computing every permutation of the list and
looking for the one that is sorted. I wouldn't have mentioned this, but
someone on StackExchange actually asked this question."

That statement makes sense to me when I think about it, but it wouldn't have
appeared as totally obvious to me from the outset. And I can't see casting
derision at someone because they asked this question.

~~~
eridius
I think that paragraph perfectly demonstrates why all the commenters on Why I
Don’t Talk to Google Recruiters
([https://news.ycombinator.com/item?id=13696004](https://news.ycombinator.com/item?id=13696004))
who insist that programmers don't need to know algorithms are wrong. It should
be immediately obvious to anyone with a cursory understanding of algorithms
that computing all the permutations of every word is much more expensive than
necessary, and from there it's not all that hard to figure out that sorting
the letters produces the desired effect.

~~~
geocar
The observation that two words are anagrams if they contain all the same
letters isn't a huge leap from two words that contain all the same letters are
anagrams.

That being said: Sorting isn't required. (An O(kn) solution exists to beat
your O(nk log k))

~~~
lisper
It's not entirely clear that the O(nk) solution actually wins because k is
small in this case, and so log(k) is _very_ small. Unless you do some careful
bit packing, the O(nk) algorithm will use more memory, which could make it
slower because of increased cache misses.

~~~
geocar
The O(nk) solution has fewer branches (O(1) versus O(log k)), and even if you
use a naïve implementation, O(nk) will still fit into L1 (assuming n is bigger
than k).

~~~
lisper
> O(nk) will still fit into L1

Not even close. The lexicon is 230k words, times 26 letters is nearly 6MB. The
i7 has a 64kB L1 cache and a 246kB L2 cache.

I think the only way to resolve this would be to actually do the experiment. I
wouldn't bet my life savings on the outcome either way, but I'll give you even
odds at low stakes that sorting is faster.

------
smallnamespace
So the author ends up scoring anagrams by the minimal number of 'chunks' such
that one word can be rearranged into the other, and points out that this maps
to the NP-complete independent set problem.

Another way you can do it is to score each pair by the length of the longest
shared substring, which can be done in linear time with a suffix trie or
quadratic with dynamic programming. Under that metric the 15-length word is no
longer 'best', you'd end up picking the longest pair that fully permutes the
other word.

~~~
castratikron
Sorting by edit distance might be interesting, too.

~~~
twic
I also immediately thought of edit distance as the obvious metric, and it has
a reasonably simple and efficient algorithm too:

[https://en.wikipedia.org/wiki/Levenshtein_distance](https://en.wikipedia.org/wiki/Levenshtein_distance)

But then i realised that it was only obvious to me because i spent years
hunched over a computer crunching DNA sequences.

I'd also be interested in normalising edit distance by length somehow. It's no
coincidence that higher-scoring words tend to be longer.

~~~
tyingq
I was curious, so I rescored his list using Levenshtein. No normalization for
long words though:

[https://gist.github.com/anonymous/431b163b2a2d532bfd0a3bdcc7...](https://gist.github.com/anonymous/431b163b2a2d532bfd0a3bdcc711f7b3)

I sorted it reverse so that the interesting pairs would be on top.

Compare to his original scored list, here:
[http://pic.blog.plover.com/lang/anagram-scoring/anagrams-
sco...](http://pic.blog.plover.com/lang/anagram-scoring/anagrams-scored.txt)

~~~
twic
Thanks for doing the legwork!

Sadly, it turns out our genius idea doesn't work so well - the infamous
cholecystoduodenostomy/duodenocholecystostomy is in joint fourth place, and
the scarcely any better duodenopancreatectomy/pancreatoduodenectomy is joint
second.

mjd's key insight was that good anagrams are ones where all the letters are
thoroughly scrambled; both of the cases above are ones where large blocks of
letters remain intact. I don't know if you've had scrambled eggs where large
lumps of white or yolk remain intact, but i have, and they're vile.

So, anyway, perhaps a better edit distance metric is one which allows cut-and-
paste of whole blocks as an edit, since it would give those words much lower
scores. My years of DNA-hunching immediately led me to think of sequence
alignment algorithms, because that kind of cut-and-paste is exactly what
happens in DNA:

[https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorit...](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)

[https://en.wikipedia.org/wiki/BLAST](https://en.wikipedia.org/wiki/BLAST)

But i'm not sure those algorithms are much good, because they model the
editing as deleting, inserting, and modifying runs of letters, rather than
moving them around.

Next, i thought of compression, because that often works by finding repeated
runs of characters, and also i don't really know any other areas of string
algorithms. What if you concatenated two strings, compressed them using LZ77,
and measured the amount of compression? Or what if you did it with something
simpler and more local, like Predictor [1]? Could you use the Burrows–Wheeler
transform?

[1] [https://tools.ietf.org/html/rfc1978](https://tools.ietf.org/html/rfc1978)
\- your daily obscure blast from the past!

~~~
tyingq
I had the best luck with "Anagrams by percent of string not part of longest
common substring, then length"

The top-to-bottom ranking isn't quite right...need some better way to push
down short words, but it does seem to put most of the good stuff near the top.

[https://gist.github.com/anonymous/ec28b75fb6c841b56def8eefec...](https://gist.github.com/anonymous/ec28b75fb6c841b56def8eefec813656)

------
siegelzero
I was always fond of "presbyterians" "britneyspears".

------
stephancoral
Not to get political, but I thought as far as name anagrams go

Rex Tillerson -> Risen Ex Troll

is pretty amazing. Not sure how it would score with the OP's metric though -
typing this as I'm on the ski lift lol

~~~
te_platt
From
[http://wordsmith.org/anagram/hof.html](http://wordsmith.org/anagram/hof.html)

Clint Eastwood = Old West Action

Madam Curie = Radium came

It's good fun to go find anagrams of friends and family. One of my nephew's
comes out Alpaca Tits. He's not real happy about it but the rest of us think
it's pretty good.

~~~
knodi123
alec guiness = genuine class

not original , but certainly on topic.

------
funposting
Shitposting -> Top insights

~~~
jrockway
More interesting is that you made a novelty account just to post this.

------
abainbridge
His blog title is an anagram of "code overuse ruins fetish".

~~~
jessaustin
"Fetish overuse ruins code"?

------
drumttocs8
Winner of most metal anagram goes to "incorporate procreation"

------
dd367
Interesting article! The metric you use does eliminate triviality, but it
sometimes uses very obscure (and arguably uninteresting words), such as
calumnies, ivoriness, coprophagist, etc. That's what you describe as Webster's
Second jargon that nobody knows".

It would be interesting if you could adapt your metric to account for general
prevalence of the word in English. Scan a giant subsection of say Wikipedia,
and assign a frequency to each of the 234,000 words in a map, giving unseen
words an infinitely small frequency, and then use the sum or multiple of the
frequencies of each of the anagrams to bring out some truly interesting ones!

~~~
pavel_lishin
I would strongly argue that coprophagist is a very interesting word, and
should be less obscure. But then again, that might just be my juvenile sense
of humor.

------
hakcermani
Fun stuff - especially debugging the time printout ;) A really ancient but
still functioning site I always go to is wordsmith.org. He calls it "Internet
anagram server" == "I rearrangement servant" !

------
Darthy
Why didn't he just calculate the Levenshtein distance instead of coming up
with that custom difference calculation method?

~~~
lmkg
The Levenshtein distance from duodenocholecystostomy to cholecystoduodenostomy
is actually pretty high: 14. Transposing two contiguous chunks of letters has
a high edit distance if both chunks are long.

A "good" anagram doesn't just have letters being moved a long distance, but
also requires that the movements of letters be independent, something that
Levenshtein distance doesn't measure.

------
tomcdonnell
I wrote a javascript game based on anagrams as part of a 'get to know your
team' exercise at a previous job. It was a weekend project.

The game asks the user to guess the name of a team member from anagrams of
that name displayed for short time periods on screen.

Click the 'automate with answers' button for a nice animation of rearranging
the letters.

[http://tomcdonnell.net/submodules/anagram_game](http://tomcdonnell.net/submodules/anagram_game)

Here is the tool I used to find the anagrams:

[http://tomcdonnell.net/submodules/anagram_finder](http://tomcdonnell.net/submodules/anagram_finder)

------
cdelsolar
Those of those who are word freaks already knew about MEGACHIROPTERAN
CINEMATOGRAPHER, as this pair is mentioned in the best seller Word Freak from
~15 years ago :)

Some of my favorite pairs are descriptive ones, like GULLIBLE BLUEGILL and
HAPPIEST EPITAPHS.

------
olegkikin
[http://www.anagrammy.com/anagrams/faq6.html](http://www.anagrammy.com/anagrams/faq6.html)

18 letters

    
    
        conversationalists = conservationalists
    

17 letters

    
    
        basiparachromatin = marsipobranchiata
        Charles Holding, 1971 
    

16 letters

    
    
        thermonastically = hematocrystallin
        John Edward Ogden, 1978 
    

15 letters

    
    
        nonuniversalist = involuntariness
        refragmentation = antiferromagnet
        megachiropteran = cinematographer
        centauromachias = marchantiaceous
        Eric Albert, c. 1986

~~~
richdougherty
Of course, the key idea of the linked article is that it's easy to find long
anagrams, but it's hard to find _interesting_ ones. The anagram
"megachiropteran = cinematographer" is ranked higher than "conversationalists
= conservationalists" because of the more complex rearrangement.

------
aclemmensen
I just experimented with building n-grams for each anagram candidate and
calculating the dissimilarity of pairs within each group, then multiplying
this score by the length of the word.

The results look pretty decent, but I get "BASIPARACHROMATIN" vs
"MARSIPOBRANCHIATA" as number 1.

[https://gist.github.com/aclemmensen/725f60a0746382521faefa0d...](https://gist.github.com/aclemmensen/725f60a0746382521faefa0df9ecd8a1)

------
baddox
That's a lot of fun. I'd like to see someone go further with some of the
suggestions the author mentions at the end, like weighting the scores by word
familiarity and word length.

~~~
abecedarius
I did something like that for multiword anagrams: scoring them by their
probability in a word-bigram model of English. It's expensive because I had it
pick the best permutation of the words, by brute force. It's far from a
measure that could really judge "this is a great anagram", but there are so
astronomically many boring ones to filter out that it's still a big help.

------
joeskyyy
Kind of related: The anagrams problem is probably my favourite interview
question to dole out. Always shows completely different ways of thinking and
normalising things.

------
legohead
I think the "best" one would be the word(s) with the most letters that are
both common, and also somehow related.

~~~
gpvos
"nonuniversalist" and "involuntariness", listed in another comment, both have
a theological aspect. While not terribly common, they're not hard to
understand either.

------
JCzynski
A scoring rule I'd like to try maximizes the product of the lengths of the
chunks / length of the word. This makes individual letters shuffling less
interesting and many medium chunks most interesting, which matches my
intuitions.

------
arthurjj
He makes a good point that the "scoring function" for what we think of as a
good anagram is actually the more important.

In general just generating the solution space is actually much easier than
coming up with a satisfying scoring function.

------
afandian
If you like anagrams and a bit of a challenge, you should check out
[https://twitter.com/anagramdaily](https://twitter.com/anagramdaily)

~~~
grepthisab
They've taken a lot of liberties with the "Daily" in "Anagram Daily." Fifteen
days between today's anagram and the last one!

Plus one that is marked easy: MYTH. Maybe I'm just an idiot, but I cannot
think of an anagram for MYTH.

~~~
jerf
My /usr/share/dict/words does not have an anagram for "myth" in its 99171
words, which is already larger than almost everybody's vocabulary.

~~~
jwilk
SCOWL[0] has 675491 words. There's no anagram of "myth" there either.

[0] [http://wordlist.aspell.net/](http://wordlist.aspell.net/)

------
gens
You could also use their Levenshtein distances.

~~~
eridius
See this comment
([https://news.ycombinator.com/item?id=13698252](https://news.ycombinator.com/item?id=13698252))
by lmkg for an explanation of why Levenshtein distance isn't good.

------
pklausler
I'm going to use his "coprophagist topographics" pair the next time I write a
post about readers of fake news.

------
zem
computers have taken a bit of the fun out of searching for interesting
anagrams :) my favourite one i found "by hand" (mostly a mental tic where i
randomly anagram words i encounter) was {antipyretic, pertinacity}, followed
by the fact that cremate, cremating and cremation all have nontrivial
anagrams.

------
timlod
I really enjoyed this.

Nice solution to put the (sorted by letters) words into a hash table to find
the anagrams.

------
11thEarlOfMar
Wondering if there is class of poetry where the rhyme is via palindrome rather
than phonetic.

~~~
philsnow
unrelated but reminded me of another non-rhyme-based form of poetry,
[https://en.wikipedia.org/wiki/Ci_(poetry)](https://en.wikipedia.org/wiki/Ci_\(poetry\))

------
ggambetta
senorita = not a word. That would be "señorita", with an "ñ". It doesn't
really work as an anagram for "notaries" because "ñotaries" is also definitely
not a word.

~~~
cdelsolar
SENORITA is acceptable in the English-language Scrabble dictionary. Its other
anagram is NOTARISE.

~~~
ggambetta
Huh, that's extremely surprising! TIL.

------
dzhiurgis
No triplets or quadruplets?

------
lun4r
would be fun to measure the word/context similarity for each pair and order by
that..

~~~
tyingq
I gave that a shot using Wordnet. It's flawed for a few reasons. A lot of the
more unusual words in his list aren't in Wordnet. I also had the stem the
words first, and had to pick just the most used sense if the word had multiple
senses.

So, the list is only those pairs that made it through the processing.

[https://gist.github.com/anonymous/e69f448fa08481603d1ca77ea7...](https://gist.github.com/anonymous/e69f448fa08481603d1ca77ea76d458a)

Would have been better if I had also weighted using his approach, as pairs
like "theater theatre" end up on top.

Still interesting though. It found these, which I liked:

    
    
      0.33  swinger wingers
      0.33  parrot raptor
      0.33  borsht broths

~~~
tyingq
I also tried a variation based on "Anagrams by Percent of String Not Part of
Common Substring, then Length"

It's flawed too, but does put some good ones at the top.

[https://gist.github.com/anonymous/ec28b75fb6c841b56def8eefec...](https://gist.github.com/anonymous/ec28b75fb6c841b56def8eefec813656)

------
jepler
it's not "flirting geisha abandons the menu"?

------
platz
predators = teardrops

------
javajosh
tldr: cinematographer megachiropteran (the latter is a giant bat)

------
texttheater
For anagram scoring, why not just determine the overlap between both words in
terms of shared bigrams (i.e., two-letter substrings), relative to their
length? Has anyone tried that? Is it equivalent to what mjd did?

