Hard as it may be to believe, there's a difference between things that don't interest you personally and things that are junk. And the existence of things that don't interest you somewhere does not automatically mean that everything there is crap.
Don't see the sort of kits you like being made? Join us! Innovate, and design your own!
tl;dr: You've set up the challenge in such a way that demonstrating any of the threat models against which client side crypto is weak would require compromising other layers of security first that are out of scope for the challenge.
Enough parts for 20 boards cost me about 100 pounds; the board fabrication cost about $60, but a lot of that was because I wanted it in a rush. All up, that works out to about 8 pounds, or 13 USD per board. It'd be somewhat cheaper - around $10-$11 - if you're not in a rush for the boards, and of course cheaper again in larger quantities.
I probably overdid the solder a bit, but there's no way most of the pads had enough solder on them for SOICs without my adding any. I've soldered SOICs before, but this is by far the largest volume I've done in a sitting.
As I mentioned briefly in the post, the autorouter was something I used largely as a result of time constraints - with 200-odd wires to route and very little time to do them in if I wanted the PCBs back in time, I decided to give it a go. That said, I was surprised with the quality of the results - I'd be surprised if I can find any easy-to-remove vias left in its solution. The tool is also very good for assisted hand routing, since it lets you nudge traces around without ripping them up every time.
I got them done at Hackvana (hackvana.com, or #hackvana on irc.freenode.net); similar prices to Seeed, and excellent customer service.
I'm not sure about the maximum speed. The flipflop, for instance, has a delay of 14ns; if we take that as average, each slice has to go through 5 parts (input mux, LUT, flipflop, async select mux, output enable) for a total delay of 70ns, so in theory one slice could do about 14MHz. Since it's a ripple counter, I guess we should divide that by the number of slices, so 4MHz seems like a reasonable upper bound.
I'm guessing you're talking about the SRAM-based alternative I discussed at the end? Yes, that would be an option, as would an SD card. 64Mb would be enough for plenty of 256 kilobit slices.
I do prefer the idea of using EEPROM if I was going down that route, though, so it'd be more like a CPLD than an FPGA. I just need a good way to load the only bit of remaining discrete state - the output enables - on startup. My best idea thus far is to dedicate half the EEPROM to configuration data, store the latch states at address 0, and use two RC networks to create rising edges first on a register latch pin then on the highest address pin to latch in the config before enabling the EEPROM in 'operating mode'.
Have you read the papers I linked in detail? Some of them, such as HyperLogLog, provide corrections to give better estimates for small sets, and although I can't follow the proof in its entirety, they claim to be more efficient than the alternatives, including the one you propose.