A study of jobs submitted to the Yahoo! cluster showed that the median job involved 12GB of data.
There's really nothing wrong with that at all, because breaking on 64MB blocks, that 12GB can be processed in parallel, which means turning an answer around really quick on that 12GB, say 30 seconds or so. Usually the work can be scheduled on machines that already have the necessary input, so the network cost is low.
Now, it might not be worth it for one hacker to build a Hadoop cluster to do that one job, but if you have a departmental-wide or company-wide cluster you can just submit your jobs, get quick answers, and let somebody else sysadmin.
Sure the M/R model is limited, but it's a powerful model that is simple to program. You can write unit tests for Mappers and Reducers that don't involve initializing Hadoop at all, and THAT speeds up development.
Yes, it is easy to translate SQL jobs to M/R, but M/R can do things that SQL can't do. For instance, an arbitrary CPU or internet intensive job can be easily embedded in the map or in the reduce, so you can do parameter scans over ray tracing or crack codes or whatever.
I built my own Map/Reduce framework optimized for SMP machines and ultimately had my 'shuffle' implementation break with increasing input size. At that point I switched to Hadoop because I didn't plan to have time to deal with scalability problems.
Looking at the median is not very interesting, since jobs in these environments are always heavily skewed. You have those 5-10% jobs that are seceral orders if magnitude beyond those 13GB, and those are the ones you run the cluster for.