Some math about the Engineyard contest

sophacles · on July 17, 2009

So here is my general thinking on this contest, with an eye on computational efficiency:

1. SHA-1 is done in 64 byte chunks, with the result of one chunk being the base of the next. Padding and length only factor in at the last chunk. This means you can half the number of computations (or even 1/3 it...) by computing the first block of a "prefix string" then going from there.

2. Cuda is good on integer ops. SHA-1 is integer ops.

3. All the combinations of <= 5 printable chars can be represented in <60 MB meaning they don't have to keep being generated.

Not saying anything the article says was wrong per-say Im saying that large amounts of computation can be factored out from the naive approach, large enough to be meaningful that is.

antirez · on July 17, 2009

1) you are right, I and a lot of other guys are already doing this.

2) Indeed Cuda can be used with success, and this is probably the way to go. Unfortunately I don't have any board at hand, but I'll buy one later in order to experiment with this technology.

3) This is not going to save you any time. Basically the part that alters the five chars at the end is nothing compared to the rest of the computation.

There are other tricks that can be used, like to compute a fast hamming distance, just xor the bytes of the two strings and count the bits using a lookup table. Or even use some specific asm instruction for bit counting. But again unless you are doing this stuff of the xor in a very slow way this is a negligible amount of work: I commented out the hammer distance computation in my code and the numbers are exactly the smae, 3.5 million comparison per second per core more or less.

Another obvious optimization is of course to avoid doing too much permutations of the words. For every permutation there are a lot of tries to do just altering the final five chars.

Basically there is no way to win without to resort to fast hardware and parallelization, being it a large set of computers, CUDA, or a very lucky run that outputs in 30 hours an entry with an hamming distance of 31 or 32 just... against the probabilities ;)

The only reason I'm running the contest is that it's funny and a very good way to teach something interesting to the programmers working in my company. We care a lot about having fun programming.

profquail · on July 17, 2009

I've done some work with CUDA, so I'll just tell you ahead of time: it will probably be decently fast, but not nearly the kind of speeds you'd get from something like matrix multiplication. The problem is going to be that you need to fit the data to hash in "shared memory" (which is sort of like an L2 cache on the GPU). The standard shared memory size in CUDA is 16kb. Also, because of the way that data must be allocated in shared memory, you're not going to be able to get many string permutations into shared memory at once.

However, I think a good solution (if you're going to use CUDA, which I was thinking about doing as well) might be to load one permuation of words from the dictionary into the shared memory, and then have each thread compute one permuation of the various capitalizations for that string. This way you're mostly working out of the "cache" and that's about the best thing you can do on CUDA.

sophacles · on July 17, 2009

I thought that shared mem of 16kb was on a per core basis? It's very possible I misunderstood, in which case: crap! Either way the, the algorithm I have in mind looks pretty matrixy, if you want I can keep you updated as I test things out.

profquail · on July 17, 2009

It's 16kb per multiprocessor. I believe there are 16 cores per multiprocessor, and multiple threads are scheduled at one time in 'blocks'.

I actually started a thread in the CUDA general discussion forum, perhaps you could come by there and discuss it:

http://forums.nvidia.com/index.php?showtopic=102228

sophacles · on July 17, 2009

Actually, if anyone has a good cuda card and wants to team up on the code writing email me: sophacles@gmail.com. We an use different random seeds when it comes time to running the stuff or whatever... I just think it would be fun!

notaddicted · on July 17, 2009

## cross posted to blog as well.

Assume the xor of your test hash with their desired hash is just 160 random bits. Furthermore, choose a certain number N, and assert: they must be white!! What is the probability that this is true:

  (0.5 for each bit)^(number of bits you chose).

So given 160 random bits, what are the chances of there being atleast N bits set to zero?

  (0.5^N)*(ways to choose N from 160).

  (0.5^N)*(160!/(N!*(160-N)!)).

define d = 160 - N

  0.5^(160-d)*(160!/((160-d)!*d!))

So I think that you are miscalculating by a bit, that is why & how.

NOTE: difference is in the 0.5^(here)

jibiki · on July 17, 2009

He was computing the probability of a Hamming distance of exactly D instead of at most D. To get the "at most" version, you could just add up the exact ones, but it's better to just use the central limit theorem instead.

As for your math, I think you are over counting a little :). The problem is, 100...0 is counted by both ??0...0 and ?0?00...0, where ? marks a digit that you've allowed to be free, and 0 marks a "chosen" 0.

profquail · on July 17, 2009

I'm going to try to crank out a little program to work on this over the weekend (what's the worst that could happen, I wasted a couple of hours?).

Unless you have some really serious hardware at your disposal, or you have a good knowledge of how to attack the SHA-1 hash algorithm, the best bet is going to be to just have the computer try random combinations of the words with various capitalizations (plus the extra couple of letters you're allowed to put at the end). Best case scenario, you get lucky and happen to find a fairly well-scoring string that you can submit.

sync · on July 17, 2009

here's that program for you, in shitty perl: http://gist.github.com/149173

jpwagner · on July 17, 2009

Rule 5 from the contest:

  "5.  In the case of a tied Hamming distance
      (entirely possible), the winner will be chosen
       by lottery among people with the best distance"

This says: don't take me seriously.

Hopefully for their sake they get 1000s of people running implementations and by chance someone will get 19 and win. They won't even have the best algorithm, but it will test permutations randomly.

I wrote up a solution, but I prefer scratch off tickets.

JimmyL · on July 17, 2009

What would you suggest they do? Your follow-on indicates that you think ties should be adjudicated on the basis of the "best" algorithm, but how do you objectively define that?

sophacles · on July 17, 2009

A simple criterion would be "capable of calculating the most results", which can be objective and meets some definitions of "best algorithm".

antirez · on July 17, 2009

A simpler and less biased way IMHO is that if two teams will end with a result of, for instance, 30, but one team was able to find two or more strings, this is the winner.

jpwagner · on July 17, 2009

Here are some thoughts:

First one to hit Hamming distance of 5.

Lowest average Hamming distance over top n trials.

Most trials in x hours.

...

JimmyL · on July 17, 2009

Something like throughput in trials on some standardized machine could be an interesting criteria - it would test both your hashing algorithm and your permutations.

Having said that, I'm betting that the EY people chose this particular problem since highly-optimized algorithms are easily available to all - meaning that the amount of HW one can throw at the problem (and luck) will likely be the determining factor in who wins.

Which HW should someone use? Well, EY has a cloud computing program, and it's right there....

That is, I think, the point of this contest. Make a problem that's hardware-intensive, so that people will try deploying it on the EY cloud platform and get them into the proverbial front door of the product family. They're not interested in novel algorithms; they're interested in getting people onto their platform, and getting them used to it.

grandalf · on July 17, 2009

Uh I don't think EY offers a cloud solution that would make this easy... so I am doubtful that was their goal.

It seems that their goal (by having defined a specific dictionary) was to have lots of people tweeting those dictionary words to @engineyard.... sort of a way to get better twitter-driven pagerank.

JimmyL · on July 17, 2009

Interesting...I think you may be right.

I originally though this - http://www.engineyard.com/private-cloud - was a cloud environment a la EC2 based on the title (which would work with my theory), but reading more about it makes it seem like it's still optimized for Rails, and hence not for this.

grandalf · on July 18, 2009

Yeah, one of EY's services actually is semi-managed EC2 and they advertise it as such.

preview · on July 17, 2009

That limits the allowed solutions to those that are purely software. What about a hardware/software hybrid or hardware-only solution (e.g. PCI Express card with one or more FPGAs)?