I remember thinking during level1 "Wait, they have miners going for each user? No way... these have to be premined (wrong, makes sense now why) somehow." I recall after thinking that, looking at the timestamps and seeing that they didn't exactly align with when the git pull happened. I would check, do some work, check, do some work, etc., and a new commit would come in that looked to be only a few seconds after the last commit, although it clearly wasn't available to me at that time.
"One consequence of this architecture was that submitting Gitcoins was decently slow — we weren't maintaining persistent connections to the backend gitcoin server, so there was a decent amount of overhead. We compensated for this by tuning the difficulty to ensure the time to mine a coin was large compared to the time to complete a push."
Interestingly, this is the reason Bitcoin confirmations are 10 minutes and not less: it gives plenty of time for new blocks to be broadcast to the majority of the network.
It's not currently a big issue since the maximum block size is 1 MB (50% propagation typically takes less than 10 seconds: http://bitcoinstats.com/network/propagation/) but to scale to Visa volumes block sizes will need to be increased to hundreds of megabytes.
It's also a reason why quicker confirmation times of Litecoin and Dogecoin aren't necessarily a good idea.
Thanks to Stripe for running the ctf! It was great fun.
I'd be interested to know how much of the load was generated from folks looping 'git push' on level4 in order to find a test case the benchmark bombed on (or just to rng the goraft race condition). The variability in scoring on that level was huge.
That was a lot more sophisticated than I expected! By the way, I think I was responsible for about 10k of those 640k pushes :)
I was really hoping for an overview of the solutions people found for each level though. Going through individual write ups to find them all is a little tedious.
Here's the rough list of what I was aware of:
level0 (take input of words from stdin, output to stdout whether they exist in the dictionary):
a) Read file into a hash set, test each word as it comes in.
b) Read file into a bloom filter
c) Realize that it is always the same dictionary and pre-compile it into a hash set/bloom filter. The fastest solutions were generally of this type, written in highly optimized C with a ~600kB bloom filter compiled into the binary.
level1 (gitcoin mining):
Not very familiar with this one, but I believe near the end several people had written custom GPU miners with openCL and were running them in small clusters of EC2 GPU instances.
level2 (ddod prevention):
I am also unfortunately not very familiar with the high scoring solutions to this level, since I never really figured out how idle time was calculated. However, the one common and easy solution was:
a) Keep a list of requests per IP and ban any IP after it made more than n requests (4 was a good number).
level3 (text search):
a) Shell out to 'grep -r -n query path', parse out the relevant part of the path/line number and submit (surprisingly, this worked to get a score just above passing if you tried a few times).
b) The intended solution was probably to build text search with a trigram index/suffix array/etc. I know a few people did this with reasonable success.
c) A faster solution was to realize that the tester only ever queried words that were in the dictionary. You still had to return results for words in which the query was a substring, but this is a finite set that could be pre-computed during indexing (or even before uploading your solution, since the dictionary is known in advance). The fastest (legitimate) solutions were of this type. Single node, single threaded (because requests were sequential) TCP servers written in C in which the answer to every possible query (only 250k possibilities) was pre-computed and stored in memory during indexing. It was just a matter of running this until you got a sufficiently optimal run, since the bottleneck was always the ~1.68ms of latency the test framework added to the request.
level4 (distributed consensus):
a) Just proxy all requests to a hardcoded master node which ran a single sqlite instance in memory. You would lose a connection to it, but as long as your queries were fast you would score enough points early in the round to pass even after the single-point-of-failure tester was added.
b) Implement Raft. Conveniently, the example code was laid out suspiciously similar to the open source goraft implementation, which served as a drop-in solution. Minor tweaks were necessary to account for the test framework pinging non-leader nodes (which meant you just had to proxy to the leader).
c) The most clever solution I was aware of, which I believe was discovered by the highest scoring player (xthexder), was to abuse the underlying http library in the octopus test framework. To understand how it worked, you must first understand that the test framework used a stripe developed system called 'Octopus' which would run your nodes in isolated containers that could only communicate with each other over specifically designated Unix sockets. Octopus would do nasty things to these sockets such as adding latency, severing connections between certain nodes, or partitioning whole sets of nodes from each other. However, the nodes were always running and they always had a connection to Octopus. So if you could convince Octopus to carry your message to the other nodes for you, you wouldn't have to worry about any of this silly distributed consensus logic.
The way xthexder discovered to do this, was to send 302 redirects back to Octopus with the query and the location of the hard-coded leader node (which was running a custom in-memory SQL database that was optimized for the very few queries that octopus ever used). I don't remember the exact details (read his write up for more info!), but it required some trickery to get Octopus to respect the redirects, but once it did you were able to get a perfect connection to a master node with 0 intra-node traffic (which would normally reduce your score).
I also discovered a couple really dirty tricks for level0 and level3:
a) The test framework actually used predetermined test cases so that if you failed the test, you could rerun the exact same test locally and see where your code went wrong. The test cases were randomly selected after you pushed your code, so you couldn't hard-code the solutions into your code (that would be cheating!). Except there was a weakness: The build phase ran on a container that allowed external network connections so that you could download packages from common package repositories like npm. The test framework would also output the test case it selected to you (but not to your code) prior to running the build phase. This meant you could write a request in your build script to your own server which would hold open the request until you read the test case from the test framework output, ran it locally, packaged up the answers, and gave them to your server for it to return back to your running solution on the test container. This allowed your solution, running on the test container, to have the answers to level0 and level3 loaded in memory before the test framework even sent its queries.
b) Method 'a' was patched over the weekend by moving the test case disclosure after the build phase, so I went looking for a workaround. What I found was just about as good. It turns out there was a fixed number of test cases for each level (I believe it was 1000 for all levels). All of them were publicly available on AWS S3, you just needed their names which I got from continually running a git push, parsing the test case, aborting, then running git push again. I ran at about 30 cases per minute, which meant running through all 1000 only took ~30 minutes on average. I downloaded every test case, found the ones that had the highest scoring potential (generally the ones with the worst benchmark score) and hard-coded the answers to those in my solution. I then had a script to run 'git push', parse out the test case it would run, abort, and rerun until it got one of the test cases that I had hard-coded. It's important to note that the remote server still reported the chosen test case prior to finishing the test run, so it only took a couple seconds to decide whether to let the run finish rather than waiting a minute or two for it to fully run. With 10 hard-coded test cases (out of 1000 possible) and checking one possibility every 2 seconds, it only took an average of ~3 minutes to get a run where the code running on the test server would once again have the answers loaded in memory before the test framework sent a query.
This was already the bottleneck for all of the top solutions. Reducing the start up time was the primary goal and the top solutions were heavily optimizing for it, including compiler configurations. One of the primary reason to compile in a bloom filter instead of a hash set was that it made your binary ~2 MB smaller which was a significant advantage in start up time.
I was #6 using a bloom filter. I don't think anybody properly used both a libc startup optimization and a bloom filter because the numbers I saw were significantly better than those required to score in the top few.
Well you just use system calls and import functions from other shared libs dynamically. There is really not more to it.
The stdlibs are supposed to make your life more comfortable, of course its not that comfy without them.
Thank you posting the summary and breakdown for all the levels, I started and stopped at level1 because the pushes were broken and never got back around to it (I believe it was during when they were fixing the level0 stuff).
Thanks. One more question: the writeup mentions that Stripe is hiring for remote folks in US timezones. Is that for all engineering positions? I could be missing something but the jobs page doesn't mention anything about remote people.