"Timesaving negativism" said the
with dispensable piebaldness,
"can be the intermediator for chronic endometritria!"
"Ah nay!" the cinematographer megachiropteran replied!
"it is only with maths--integrals, triangles,
that can cause such excitation, can intoxicate!"
But I, though their admirer, now married
had to disagree: "it would be a series of masculine calumnies
to give such clitoridean, directional advice to women!
We cannot precipitate such peripatetic thoughts!"
a letter rearrangement goes beyond the plain, ordinary substitutions of vain academic men.
better anagram lerrneet
letter rearrangement ba
At some point, I had an excess 's', so "better" got turned into "best", and I had to use up its excess "ter", which got stuck with an excess 'u' to make "true". I also had an excess 'n', so a "the" became "then". Those somewhat awkward words go in last, which is where that "True then, as" came from. Getting rid of the last letters on both sides is the hardest part, because there is rarely a perfect word that uses up all your letters, just like in Scrabble. Sometimes you just have to unravel a few words and try different ones in their place. In the end, I just ran out of time to work on it.
It's a bonus if I can get near synonyms on both sides, like "transcend" and "goes beyond"
It's a clone of anagramatron, the original anagram Twitter bot that went inactive almost a year ago.
I use a combination of Damerau–Levenshtein distance, longest common sequence, length, number of English words, and number of different words to score and rank anagrams to figure out which ones to retweet.
This one feels so right -> These hoes so trifling
no idea...lost -> desolation
the skeleton war -> take shelter now
The list of examples in the 7-11 range seemed to have a high number of pairs that have strangely coincidental meanings somehow. I thought maybe that was because they are hand-picked, but looking at the complete list, I honestly see more of those than I'd expect. I'm not sure why it's surprising, there's no actual pattern in a pair, and it's not surprising that lots of pairs of words have some way to connect them. But it still feels surprising when looking at them. It's like a lot of high scoring pairs of words share some kind of root, even though high scoring should mean high dissimilarity.
Anyway, great post. Weekend coding projects looking for anagrams and palindromes and boggle solutions are often rewarding for me, above and beyond just being fun.
Probably not a perfect analogy, but this is vaguely reminding me of Ramsey's theorem from combinatorics; the idea that if you have enough stuff, some of it must be structured.
Basically, if you run "watch ./remaining.py" on your command line, you can see a running countdown of how long is left in the day, month, year, and your life.
However, it comes across as a little "smarter than thou" with the section:
"The thing you do not want to do is to compute every permutation of the letters of each word, looking for permutations that appear in the word list. That is akin to sorting a list by computing every permutation of the list and looking for the one that is sorted. I wouldn't have mentioned this, but someone on StackExchange actually asked this question."
That statement makes sense to me when I think about it, but it wouldn't have appeared as totally obvious to me from the outset. And I can't see casting derision at someone because they asked this question.
And your argument is kind of silly when in the very same article when he uses brute force to see how many segments. Shouldn't he have found the most optimal algorithm for that instead of brute forcing it?
That being said: Sorting isn't required. (An O(kn) solution exists to beat your O(nk log k))
Not even close. The lexicon is 230k words, times 26 letters is nearly 6MB. The i7 has a 64kB L1 cache and a 246kB L2 cache.
I think the only way to resolve this would be to actually do the experiment. I wouldn't bet my life savings on the outcome either way, but I'll give you even odds at low stakes that sorting is faster.
Another way you can do it is to score each pair by the length of the longest shared substring, which can be done in linear time with a suffix trie or quadratic with dynamic programming. Under that metric the 15-length word is no longer 'best', you'd end up picking the longest pair that fully permutes the other word.
But then i realised that it was only obvious to me because i spent years hunched over a computer crunching DNA sequences.
I'd also be interested in normalising edit distance by length somehow. It's no coincidence that higher-scoring words tend to be longer.
I sorted it reverse so that the interesting pairs would be on top.
Compare to his original scored list, here: http://pic.blog.plover.com/lang/anagram-scoring/anagrams-sco...
Sadly, it turns out our genius idea doesn't work so well - the infamous cholecystoduodenostomy/duodenocholecystostomy is in joint fourth place, and the scarcely any better duodenopancreatectomy/pancreatoduodenectomy is joint second.
mjd's key insight was that good anagrams are ones where all the letters are thoroughly scrambled; both of the cases above are ones where large blocks of letters remain intact. I don't know if you've had scrambled eggs where large lumps of white or yolk remain intact, but i have, and they're vile.
So, anyway, perhaps a better edit distance metric is one which allows cut-and-paste of whole blocks as an edit, since it would give those words much lower scores. My years of DNA-hunching immediately led me to think of sequence alignment algorithms, because that kind of cut-and-paste is exactly what happens in DNA:
But i'm not sure those algorithms are much good, because they model the editing as deleting, inserting, and modifying runs of letters, rather than moving them around.
Next, i thought of compression, because that often works by finding repeated runs of characters, and also i don't really know any other areas of string algorithms. What if you concatenated two strings, compressed them using LZ77, and measured the amount of compression? Or what if you did it with something simpler and more local, like Predictor ? Could you use the Burrows–Wheeler transform?
 https://tools.ietf.org/html/rfc1978 - your daily obscure blast from the past!
The top-to-bottom ranking isn't quite right...need some better way to push down short words, but it does seem to put most of the good stuff near the top.
Rex Tillerson -> Risen Ex Troll
is pretty amazing. Not sure how it would score with the OP's metric though - typing this as I'm on the ski lift lol
Clint Eastwood = Old West Action
Madam Curie = Radium came
It's good fun to go find anagrams of friends and family. One of my nephew's comes out Alpaca Tits. He's not real happy about it but the rest of us think it's pretty good.
not original , but certainly on topic.
(courtesy of Lisa Simpson)
It would be interesting if you could adapt your metric to account for general prevalence of the word in English. Scan a giant subsection of say Wikipedia, and assign a frequency to each of the 234,000 words in a map, giving unseen words an infinitely small frequency, and then use the sum or multiple of the frequencies of each of the anagrams to bring out some truly interesting ones!
I think you have to discriminate between slightly obscure or archaic words that anyone familiar with a reasonable range of the literary canon would know, and truly uninteresting words that even a highly educated and well-read person wouldn't know.
There are better corpuses than Wikipedia that could be used for this purpose, like the British National Corpus
A "good" anagram doesn't just have letters being moved a long distance, but also requires that the movements of letters be independent, something that Levenshtein distance doesn't measure.
If you rank the entire list of anagrams by edit distance, the highest pair is "anatomicophysiologic" and "physiologicoanatomic", with an edit distance of 16 but a chunk score of only 3.
(Technically speaking, a proper distance measure requires a distance of 0 for equal strings, so perhaps he should pick the number of splitting points rather than the number of resulting fragments.)
The game asks the user to guess the name of a team member from anagrams of that name displayed for short time periods on screen.
Click the 'automate with answers' button for a nice animation of rearranging the letters.
Here is the tool I used to find the anagrams:
Some of my favorite pairs are descriptive ones, like GULLIBLE BLUEGILL and HAPPIEST EPITAPHS.
conversationalists = conservationalists
basiparachromatin = marsipobranchiata
Charles Holding, 1971
thermonastically = hematocrystallin
John Edward Ogden, 1978
nonuniversalist = involuntariness
refragmentation = antiferromagnet
megachiropteran = cinematographer
centauromachias = marchantiaceous
Eric Albert, c. 1986
The results look pretty decent, but I get "BASIPARACHROMATIN" vs "MARSIPOBRANCHIATA" as number 1.
In general just generating the solution space is actually much easier than coming up with a satisfying scoring function.
Plus one that is marked easy: MYTH. Maybe I'm just an idiot, but I cannot think of an anagram for MYTH.
it's extremely fun, though it's scrabble-focused, so you'll see a lot of weird words that you are expected just to have memorised.
Probably not the best approach, but it did unearth some interesting pairs.
Nice solution to put the (sorted by letters) words into a hash table to find the anagrams.
Ñ, on the other hand, does. Substituting a N for a Ñ usually doesn't work, and it feels like a typo, not like a forgotten accent. There are also many words where the only difference is a N or a Ñ which have significantly different meaning (off the top of my head: año/ano, moño/mono).
I think the reason that ñ feels different to you than é is that while the é has never been more than a variation on an ‘e’, the ñ is actually an abbreviation for ‘nn’. For example, “año” is derived from Latin “anno” and “cañon” was originally from Latin “canna”. So I think the correct way to handle ñ may be to treat it like nn. (An analogous strategy for German, which people might find less surprising, would be to treat ‘ö’ and ‘ü’ as if they were ‘oe’ and ‘ue’; in older times in English, it was considered correct to equate “w” with “uu” in anagrams.)
On this plan, I find (in English)
Also unfortunately, “señorita” is an exception, and was _never_ spelled with a double “n”. So for this example, the “nn” equivalence is less defensible.
Thank you for bringing up this point.
Still unsure about "in English, it is also correct to spell “señorita” without the tilde" - it doesn't seem to be an English word at all, can't find many references online, Google Translate corrects it to "señorita", neither Cambridge or Oxford dictionaries have it...
So, the list is only those pairs that made it through the processing.
Would have been better if I had also weighted using his approach, as pairs like "theater theatre" end up on top.
Still interesting though. It found these, which I liked:
0.33 swinger wingers
0.33 parrot raptor
0.33 borsht broths
It's flawed too, but does put some good ones at the top.