One could definitely optimize this to be less destructive and produce more pronounceable results. It's basically two pieces: an engine for suggesting mutations, and a simple algorithm to score and pick mutations. Changes to either half (vowel distribution, ngrams, etc) could result in better strings.
When Pfiffer told me peh would post that on HN, my reaction was something like "Fine by me, but I'm pretty sure I'm going to get schooled." Looks like I was right.
I got schooled, I'm learning, next time I'll be more clever about it, thank you!
Aw guys, don't go start posting on their site (which I won't link) with devil names. Everyone knows communities turn to shit when they get too big and 300 different users posting all with blank icons is going to kill the fun for them.
I'm not even a member of Merveilles but that makes me sad for them.
Well; for what ostensibly is a secretive, millenial "circle of artists and wizards"; that is an awfully non-cryptographic, weak hash function. Apparently (variants thereof) are commonly used in some JavaScript circles to mimic a popular implementation of Java's String.hashCode() method [1,2,3].
You expected something more esoteric? I think it's a matter of perspective. Sure, as a crypto challenge, it's weak. But if you think of it as our standard UI for configuring one's user icon...
I don't think it's going to be a problem. Before anyone can post anything to merveill.es, they have to figure out how the heck anyone posts anything to merveill.es, which will take some lateral thinking.
Also, this hash function is linear so it has the equivalent substrings property.
One can take advantage of it and be able to generate preimages even faster. Take a preimage and find m equivalent strings for n substrings of it. Replacing the substring with their equivalent will get you n to the m-th preimages that hash to the same value.
You can brute force it insanely fast by iterating from the rightmost character and caching the entire hash prefix.
This hash doesn't scramble input very well, so you can fiddle with individual characters to converge on any desired hash value.
Record the "closest" hash value to your target generated by this loop, apply (only) that character change, repeat. If the hash value stops converging, add a character. This naive version pretty often does the job in 5 full iterations or so, which means (5 * len(s) * len(alphabet)) = maybe ~3000 total hashes to get a solution.
for i in xrange(len(s)):
copy = s
for c in alphabet:
copy[i] = c
diff = abs(hash(copy) - 666)
best = min(diff, best)
I’m half tempted to buy a few hours of highcpu AWS compute power and get
it done nowish instead ... I set myself a $50 spending limit, which
gave me about 24 hours of compute on an instance with 32 virtual cores
The price of a c3.8xlarge with 32 cores and 60 GB is currently $0.28 in us-west-2.
You could get 178 hours of compute for your $50 budget.
This is for a spot instance, which could get stopped at any moment if the spot price increases, and that would be pretty bad for that task as I could lose computation results. The on-demand price is about $1.60 an hour. But wait! I'm not using the same dollar, I live in New Zealand, which makes that about NZ$1.90 an hour, or NZ$45 for 24 hours.
WOW my C++ solution is horrible. It's as though I'd just ignored everything I've learned about Doing Things Right in C++ Post 2010. Such is hacking, I guess.
on my 4-core MBP (2.6ghz ivy bridge) i can manage ~1.8 billion hashes per second.
i could parallelize with OpenCL, but i think this is enough. after a few minutes, i get ARbyhlf as a valid name (although i don't know if this is actually valid.. but it definitely might be)
Most of the time is spent on generating the random string, which was because I very quickly realised looking through the entire possibility space in order would be unproductive. Granted, my entire approach was unproductive, as shown by linuxbochs, pedrox, and others below :)
I actually did some micro-benchmarking in the midst of all that and found that hashing a single, static, string took something of the order of 20ns. That's half a billion hashes per second, or 2 billion on four cores. Add a little overhead for string generation, and yeah, we get the same number.
> Most of the time is spent on generating the random string
when i took out string generation, i went from 1.2 GH/s to 1.8GH/s (although, i'm assuming my string generation was much simpler)
> I very quickly realised looking through the entire possibility space in order would be unproductive
you say that, but it's not as unproductive as you might think. assuming the hash function is uniformly distributed (hah), there are 4 billion hash values that could result from a string. if we can check 2 billion a second, we should see a result every few seconds. maybe my search space pruning was particularly bad, because i see on average ~4-10 a minute, and they are usually quite clustered together. this suggests the hash is not uniformly distributed (although a cursory glance at the function should make it obvious to people with a maths background).
anyways, it was good fun to poke around :) cheers for sharing!
Example: http://bochs.info/img/mutation-20140606-024906.png
One could definitely optimize this to be less destructive and produce more pronounceable results. It's basically two pieces: an engine for suggesting mutations, and a simple algorithm to score and pick mutations. Changes to either half (vowel distribution, ngrams, etc) could result in better strings.
(fyi, this kind of attack is a big reason to use cryptographic hashes: http://en.wikipedia.org/wiki/Cryptographic_hash_function)