

My Engineyard Contest Odyssey - wooby
http://alan.dipert.org/post/146555473/my-engineyard-contest-odyssey

======
jcsalterego
Looking at the C code, here are a few suggestions:

\- __builtin_popcount() and __builtin_popcountll() are great gcc builtins
which perform bit population counts, useful for hamming distances (after the
XOR operation, of course). [1]

\- avoid scanf(), which is costly. My approach was to generate a header file
with the char * words[] array, and also a int words_len[] with, you guessed
it, the pre-computed lengths. Then go for the memcpy() route and you're in a
much better position.

\- speaking of memcpy, gcc provides one that should be optimal in most cases,
but for really short words, it might be easiest to write your own. I came
across a good article on it [2] but this is what I ended up using:

    
    
      void
      c_memcpy (char * dst, char * src, size_t len)
      {
        __builtin_prefetch(dst, 1, 3);
        while (len--) {
          *dst++ = *src++;
        }  
      }
    

[1] [http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Other-
Builtins.h...](http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Other-
Builtins.html)

[2] <http://www.embedded.com/columns/technicalinsights/19205567>

~~~
wooby
Thanks for the tips. gcc builtins are definitely the next level for me.

I'm curious about your scanf point though. I'm using scanf/strdup before the
check loop to load the words into memory. Are subsequent memcpy calls on an
implicitly loaded *char dict[] array faster on non-malloc'd memory for some
reason?

~~~
jcsalterego
Unfortunately, I have no empirical data or to support that hard-coding in a
payload has an advantage over a dynamic allocation. As an aside, the Python
dictionary (hash table for you rubyists/perlfaces) works this way, as it over-
resizes to allow growth, unless you specify its maximum size (__slots__ I
think?)

My hope and understanding are that the compiler's static analysis skills will
best align the given data, whereas something like malloc() would try to
allocate on the heap _as best it can_ with the open possibility of other data
needing to be on the heap as well.

The point I was driving at, and what I tried to do, was to hard-code in as
many known variables as possible such that the compiler could do its thing and
try to be as efficient as it could with the given data. This includes the word
list, the list of lengths, the target hash -- none of which has to be done by
hand, but rather a higher-level script called before compilation.

Good times.

~~~
wooby
That makes sense, I might test it sometime.

------
nkurz
Great writeup, thanks! I'm confused by your conclusion, though: "I wish I
would have identified Rackspace initially as the platform of choice."

From the results of the contest, I would have concluded that the platform of
choice was CUDA. Do you think otherwise? Is there a reason you'd choose to use
a flotilla of Rackspace servers for a problem like this rather than a graphics
card? If you are going to be programming in C anyway, it seems like a much
simpler approach.

~~~
wooby
If I was using just my own machine, or spent a few grand to put together my
own CUDA farm, CUDA would have been great.

I think using cloud resources was the way to go though, because it's easy to
search for keys concurrently, and you can summon an effectively unlimited
amount of computing power pretty cheaply. You can add as many cloud servers as
you want to your cluster without worrying about locking or divvying up work,
and the number of keys your cluster is able to check per second increases
linearly with the number of machines.

The winning team, awesome guys with a cluster of 8 very powerful computers and
a CUDA program, were able to check just over 400 million keys per second. If I
would have used 400 Rackspace cloud servers, it would have cost me about $100
to run them for the duration of the contest and I would have been achieving
similar speeds.

~~~
bwag
Looks like the winning team was able to generate more than 400 million keys
per second.

"With all the machines running we got up to a total of 3079431974
hashes/second" (via <http://www.win.tue.nl/cccc/sha-1-challenge.html>)

~~~
wooby
Uhoh - I missed decimal place. 3 billion/sec would mean about 3000 of the
cheapest Rackspace cloud machines, which could have been operated for 18 hours
for about $810. I'm not sure Rackspace lets you have that many though.

------
deno
I'd like to see CUDA Cloud Hosting. The contest's result are showing that'd be
an awesome solution for heavy computational problems.

