How well does a system like this work for bootstrappers on a tight budget? It seems like by nature of the system design, you're going to need quite a few more servers than a simple LAMP-like setup. Between Hadoop, Cassandra, Storm, Web Servers and the like, you're looking at ~10'ish server instances right out of the gate.
I ask because I'm intrigued by this kind of design, but not the server cost that seems to be associated with it for a newly launched (and potentially unproven) product.
If you have big data, you're going to need lots of servers anyway, and I think there's no better way to manage that data than with the techniques I talk about in the book.
While I think these techniques can scale down, the current crop of Big Data technologies (esp. Hadoop) don't scale down very well. That is, they have a lot of overhead for small amounts of data. So while these techniques can work for "small data", it's going to be relatively more costly. For big data, the overhead is amortized. In the future, I do see scaling down as an important evolution for these technologies.
Can you recommend some tools for someone starting down this path? I'm comfortable with apt-get and mildly capable with the AWS console, but I'm a bit daunted by the idea of attempting to automatically spin up 2-3 servers, have them configure themselves, and then have them form up a little Hadoop cluster. The "set up your own single-node Hadoop cluster on Ubuntu" guides I've skimmed have a sizeable amount of configuration details that are completely opaque to an outsider.
Not being huge into Java isn't helping either. Would I be better served by biting the bullet and doing things in Java initially or can I skip right to jython or jruby or clojure or something?
I'm a big fan of Pallet for infrastructure management ( https://github.com/pallet/pallet ). That's what we used for all our infrastructure on AWS at BackType, and my team has continued to use it to manage our machines within the Twitter datacenter. Pallet has a high learning curve, but it's worth it.
Sam wrote the pallet-hadoop tool which can spin up Hadoop clusters at the click of a button ( https://github.com/pallet/pallet-hadoop ). Although if you're on AWS you're better off just using EMR.
You don't need to use Java. I do everything in Clojure (using Cascalog and Storm's Clojure DSL).
The one thing that makes me mildly uncomfortable about pallet is that, in the end, it's just another "run these shell scripts to set up your server" system. I find I prefer tools like puppet or chef and then extending them to deal with AWS (cluster-chef, for example).