If you trying to scale an Internet-style app on one of these machines, you might need to expand past one machine after a while. By staying on one machine, you're avoiding all the complexity needed in your software to coordinate between multiple machines. If you lose the ability to fit on a single box, you'll need to add that complexity in anyway. So what does 10 beefy boxes buy you as opposed to 1000 smaller ones? There is of course an operational/DC/power cost involved with more boxes, but I think most shops consider that an easily solvable problem. For example, a maxxed out POWER7 box from IBM will give you 256 processors and all the memory and I/O trimmings you need. If you need more than 256 processors or the local amount of RAM, you'll pay the software complexity cost anyway.
What you're really paying for when buying a 256 processor POWER7 box is the fact that the interconnect (and therefore the time to acquire a lock/update data from another node) is much faster and more reliable than commodity networks/kernels/stack.
I have had the opportunity to try out Google's implementation of mapreduce implemented in C++ way back in time (6 years ago). These would run on fairly impoverished processors, essentially laptop grade. Have done stuff on Yahoo's Hadoop setup as well, these used high end multicore machines provisioned with oodles of RAM (I dont think I should share more than that). If I were to be generous, Hadoop ran 4 times slower as measured by wall clock times. Not only that, Hadoop required about 4 times more memory for similar sized jobs. So you ended up requiring more RAM, running for longer and potentially burning more electricity. This is by no means a benchmark or anything like that, just an anecdote.
That Hadoop would require much more memory did not surprise me, that was expected. What was really surprising was that it was so much slower. JVM is one of the most well optimized virtual machines we have out there, but its view of the processor is very antiquated and it does not surface those hardware level advances to the programmer. You pay for a hot-rod machine but run it like an old faithful crown victoria.
Four times might not seem like much, for one thing I am being generous, and it makes a big difference when you can make multiple run through the data in a single day and make changes to the code/model. Debugging and ironing out issues is a lot more efficient.
I think Hadoop gave Google a significant competitive advantage over the rest, probably still does.
Facebook's business model involves getting 1 billion people to post a ton of stuff inside Facebook, costing them about $2/user/year in infrastructure, $3.50/user/year in other costs, and making about $7/user/year in advertising revenue, yielding about $1.50 in profit. So cutting costs on that $2 makes them significantly more profitable.
Bloomberg is probably a better example of a company which builds "optional" technology in house, just to be awesome, though -- they're not at the scale of FB (where "traditional" solutions break down), but from what I've seen, they do a lot of interesting work in-house because their staff want to do it, and because it lets them have really top-quality staff in a highly competitive market.
I'm not so sure about that. Bloomberg processes an incredible amount of data, and they have strict latency requirements. In many cases, traditional solutions would in fact break down under those requirements.
Is this really true? Seems to me that these challenges are completely due to collecting a whole bunch of information that people would rather they didn't, and routinely rebuke them for. It's like working on drone targeting problems that are made more difficult because children move more unpredictably (or more quickly, etc.) than adults. "Yeah, but the math is insane!"
You may dislike my analogy, but it only appears "trivial and unimportant" because it's the most mundane aspect of a larger unsavory project.
Yes this is a shameless plug for the company I love working for, but I think it addresses Skywing's point. We are one of the few companies at this scale that are completely location agnostic, and we hire by trial (can you do the job), not by credentials.
Anyone else feeling this way, drop me a line chris.hoult at datasift dot com.
Any other ways? I don't believe there is a VM for this sort of "experimentation"
PS. you calculated using 400 million.
Considering my latest fb backup was ~18MB (unzipped), out of which 14MB was pictures, this doesn't sound too unreasonable to me. If anything, it sounds very conservative. If I was actively using fb for photos, I'd easily have at least 100 times as many, maybe 200 times as many. Not to mention that the single (short) video I've uploaded is 1.4 MB.
So maybe that's not as stupid as it sounds. You could also have some level of caching, eg I visited your pictures yesterday so people can get it from me for some time if you're away.
I definitely see potential in this form of "distributed but controlled" storage mechanisms.
Similarly, I like the idea of having a local backup drive, and another copy at a friend's place (possibly encrypted).
A better bet might be for them to scale up in Utah and harvest methane from waste to generate power.
Utah's power grid is primarily coal fired, but they use a fair bit of natural (methane) gas as well.  Utah has a couple of major hydroelectric dams (glen canyon, flaming gorge), and the use of solar and wind are on the rise.
NSA is building a major data warehouse in Utah; one of the considerations would have to have been cheap power. I'm guessing the Columbia river produces cheaper electricity than pretty much anywhere else, but Utah has very diverse (and affordable) power production overall.
 failed on humor the first time, trying again.
Actually all tech can be criticised for banality. Think of the telephone - 'they put in all those cables just so she can talk to mother...'. Or television - 'they dug up teh street just so they could lay those fibre optic cables so your gran could watch the wresting...'. Or even the trains - the train stopping at my home town does seem a waste of time, I can't believe they bother when you look at who gets on.
Telephone was the first infrastructure to provide real-time voice communication. It enables families staying in contact, but it also enables economic growth, and a more effective society writ large.
Television is now a mindless wastefield of race-to-the-bottom drivel, but there are newer networks have haven't yet succumbed to drive, mostly on digital cable. I only have the respect for science and technology that I do because I grew up watching The Magic School Bus and Bill Nye the Science Guy. My parents watched the moon landing on television.
Facebook does not do anything novel, nor has it ever been used for anything terrifically insightful. It provides some social value and exists for that reason, but it is clearly not equivalent to all other technologies.
You should really separate the application of the technology from the technology itself. Your last sentence can be said the exact same way about the telephone "It provides some social value and exists for that reason" but that is clearly a ridiculous statement to make about the telephone.