Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you have big data, you're going to need lots of servers anyway, and I think there's no better way to manage that data than with the techniques I talk about in the book.

While I think these techniques can scale down, the current crop of Big Data technologies (esp. Hadoop) don't scale down very well. That is, they have a lot of overhead for small amounts of data. So while these techniques can work for "small data", it's going to be relatively more costly. For big data, the overhead is amortized. In the future, I do see scaling down as an important evolution for these technologies.



Can you recommend some tools for someone starting down this path? I'm comfortable with apt-get and mildly capable with the AWS console, but I'm a bit daunted by the idea of attempting to automatically spin up 2-3 servers, have them configure themselves, and then have them form up a little Hadoop cluster. The "set up your own single-node Hadoop cluster on Ubuntu" guides I've skimmed have a sizeable amount of configuration details that are completely opaque to an outsider.

Not being huge into Java isn't helping either. Would I be better served by biting the bullet and doing things in Java initially or can I skip right to jython or jruby or clojure or something?


I'm a big fan of Pallet for infrastructure management ( https://github.com/pallet/pallet ). That's what we used for all our infrastructure on AWS at BackType, and my team has continued to use it to manage our machines within the Twitter datacenter. Pallet has a high learning curve, but it's worth it.

Sam wrote the pallet-hadoop tool which can spin up Hadoop clusters at the click of a button ( https://github.com/pallet/pallet-hadoop ). Although if you're on AWS you're better off just using EMR.

You don't need to use Java. I do everything in Clojure (using Cascalog and Storm's Clojure DSL).


The one thing that makes me mildly uncomfortable about pallet is that, in the end, it's just another "run these shell scripts to set up your server" system. I find I prefer tools like puppet or chef and then extending them to deal with AWS (cluster-chef, for example).





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: