Hacker News new | past | comments | ask | show | jobs | submit login
Engineyard Challenge Python Code (fpaste.org)
23 points by bcl on July 15, 2009 | hide | past | favorite | 27 comments

I dont wish to be, umm, mean. But I know of at least 3 companies (ours included) that could run this in a few minutes.

this is my code (using mighty, our python based cluster service):

  import mighty

  word_generator = mighty.word_generator("wordlist",(mighty.APPEND_NUMERALS,5))
  phrase_generator = mighty.word_concentate(word_generator,' ')

  job = mighty.match_to_hash('PROVIDEDHASH',mighty.SHA1,phrase_generator.iterate_all)
  job.priority = mighty.CRITICAL
Done. Takes about 5 minutes (I wont enter - it's not fair).

EDIT: added CRITICAL as the priority or it takes longer :)

EDIT: ok so the code is a bit the wrong side of the tracks. It wouldnt take long to refactor it :)

This isn't actually correct, is it? Assuming an average 6-letter word length, there are about 10^67 valid combinations that your iterate_all() can go through: (((2^6) * 1000) ^ 12) * (94 ^ 5) = 3.5 * 10^67

Since this is much larger than the total hash space (2^160 = 10^48), you can hope you'll get a collision after the number of tries equal to half of the hashspace (10^48 / 2). That means you'll have to generate 10^47 hashes. At 5 minutes of runtime, you'll have to generate 10^45 hashes per second.

Bottom line: if you can run this on 200,000 cores, you'll have to generate 10^40 hashes per second per core, and that seems impossible to me.

Weeell we have 20,023 computers (that fluctuates a tad). That's probably about 30,000 cores in the end.

I cant iterate the whole lot - but I can cover a substantial portion more of the keyspace in 5 mins than most could in the 30hrs. I bet 5 minutes will easily get me damn close :)

I'm seeing about 1 million hashes per second on a single core. So, you can do about 10^11 hashes per second on 30,000 cores. At that rate, it will only take you 10^48 years to cover the keyspace.

You won't even cover a fraction of a percent of the keyspace before the sun burns out.

I think it unlikely you have a compute cluster that can generate 3.3 million, billion, billion, billion SHA1 hashes per second (edit: and that assumes not adding the random 5 characters to the end). The actual contest will have a word list of a thousand words.

Also, (a) you're allowed to permute case of letters, exploding the search space even more, which is handy because (b) I presume the target sha1 will come from words not on the word list, making an exact match unlikely (hence score based on hamming distance).

Theoretically I'm having around 500 CPUs with 3gHz each and 2000GB RAM at my disposal. Maybe I should give it a try. Hint: University Cluster =)

Anyway thanks for the code examples, I'm pretty sure I can learn something from them.

500 cores gives you no really higher chance of getting a good combination over someone with one PC.

Yes you get 500 times the attempts in the 30hr period. But if you run the figures it's something like 0.00000001% difference in the amount of keyspace you can cover :)

AKA just run it on a couple of PC's an hope you get lucky!

I have an account on this: http://www.cse.illinois.edu/turing/

Had very similar thinking, but decided against it because of usage policies. :-p

Apparently, I am not the only one who is into using the university cluster for my own personal projects ;-)

Be careful. This is not "exactly" legal, and it could hurt you. Make sure you also throw some legitimate computational work at the clusters so you have an alibi at least.

Here's a first pass at a python implementation for the Engineyard Challenge (http://www.engineyard.com/blog/2009/programming-contest-win-...).

Note that it doesn't do any random chars at the end of the string and doesn't change the case of the words.

Ahh, I see that you're generating an entirely new random string each iteration. Interesting.

I've done mine in Pike, which will be faster (though not as fast as C).


I'm being lazy about the search, depending on the random distribution to keep work from overlapping too much instead of taking the time to keep track of what has been done, or implementing a systematic search of the possible combinations.

My thought is that there are so many possibilities and so few CPU cycles available to me that making random guesses is a better way to approach it. Kinda like winning the lottory, except harder.

I've started mine in erlang. I'm going to run 10 dollars worth of ec2 time and see how far I get.

I've got mine in C. I probably won't run it, just wrote it for fun. Honestly, if you were thinking of spending a cent on this competition, I think you'd get a better ROI buying a SuperLOTTO ticket.

Lotto tickets aren't nearly as much fun as coding, so what's another 10 bucks.

Personally I'd run through a thousand bucks of computer time to play around with something or test a hunch before considering a lotto ticket.

Lotto tickets are taxes for people that are not good in mathematics.

or they just play for fun

Alas, being good in mathematics spoils the fun.

Well, keep in mind that you only have to have the best score out of all the entrants, so your odds aren't that bad.

I'm getting about 250k sha1 hamming distance calculations per core per second, on an Intel(R) Xeon(R) CPU 3050 @ 2.13GHz (from /proc/cpuinfo). This is a full implementation... I can just drop in the word list and challenge phrase on game day.

Intel(R) Xeon(R) CPU 5148 @ 2.33GHz gives me 300k/sec.

Anyone else have a quick & dirty benchmark?

I'm getting more than 1M SHA1 + Hamming Distance calcs/core/sec on a Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz, and I haven't really optimized beyond the obvious.

It's a brutally simple algorithm, really. XOR + bitcount, which only iterates for each 1 bit, so the lower the total hamming #, the faster it finds it.

It's in C, so everything works in unsigned long blocks. Compiled with -O3 it's:

real 0m0.968s

for 1M each on two forked processes, and 80 lines of code.

Oooog. Boy is my face red. I was recalculating the main SHA1 each time through the loop. A minor refactoring mistake with a major impact...

With the base string hash pulled out of the loop, it's looking more like 1.7M SHA1 + HD calcs/core/sec.

Yeah, mine's Perl, and running on a slower processor, so the difference feels about right.

Of course my main calculation is only one line:

$hamming_dist = unpack("%160b*", sha1($attempt_phrase) ^ $challenge_sha);


I found this academic paper on finding collisions in SHA1, but unfortunately I'm not smart enough to figure out what it's telling me. http://people.csail.mit.edu/yiqun/SHA1AttackProceedingVersio...

Is this supposed to run on Unladen Swallow? Or does the GIL not come into play here?

It uses the multiprocessing package, which is new in Python 2.6. From the docs:

The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. http://docs.python.org/library/multiprocessing.html

multiprocessing is also available in 2.5, which is what I used, just do: easy_install multiprocessing

and here's a probably slower, brute force perl variety: http://gist.github.com/149173

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact