

Are simplified hadoop interfaces the next web cash cow? - davidedicillo
http://brianbreslin.com/are-simplified-hadoop-interfaces-the-next-web-cash-cow/

======
stephenjudkins
He's right that Hadoop needs to be simpler, easier and faster to use than it
is right now. And I don't doubt there is value in being someone who puts a
friendlier layer on top of it.

However, I don't think doing so would be as huge or game-changing as he
argues. Any company that has somehow aggregated petabytes of data probably has
some reasonably competent developers and admins. Dealing with the inevitable
scale-related issues will either burn someone out or make them competent. For
these developers, getting Hadoop running now is probably _easier_ than doing
whatever else they've been doing to keep aggregating these petabytes of data.
I went from knowing little about Hadoop to running relatively complex queries
in a matter of hours through Amazon EMR and Hive (a SQL-ish layer on top of
Hadoop). Basically, I gzipped and uploaded the data to an S3 bucket,
downloaded a client to easily use the EMR API, wrote a simple Hive query
derived from examples, and it just worked. All this was much easier than what
I've had to do to keep a MySQL cluster or web application running.

~~~
brianbreslin
well my thesis was based upon the premise that those who would need a UI for
hadoop scale queries wouldn't be those doing the data collection. i.e. the
marketers or research scientists who want to run queries against existing data
sets, or data sets in use by others in their companies.

imagine someone in say the sports division at aol wanting to know patterns for
users across aol who have viewed auto related info or something.

the idea is not having to bother the sophisticated db admins, and let end
users figure out the data themselves.

------
jluxenberg
Amazon's Elastic MapReduce does a decent job of this, and there's a GUI for it
in the form of the Console (<http://console.aws.amazon.com>):
<http://aws.amazon.com/elasticmapreduce/>

~~~
seldo
EMR merely abstracts away the problem of configuring and hosting hadoop; the
actual business of crunching data is still very much in the build-it-yourself
realm.

There is a huge, huge market for a value-added service on top of EMR to help
people understand their data.

------
jhammerb
from my perspective, a single user interface paradigm is unlikely to cover the
variety of use cases that the full suite of hadoop platform services (flume,
sqoop, hdfs, mapreduce, hive, pig, oozie, hbase) provides.

at cloudera, we've built and released into open source a user interface
construction kit and an environment to house these various "applications".
that way we can incorporate any metaphor you'd like: file browser,
spreadsheet, yahoo pipes, etc.

see <http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-hue> and
[http://www.cloudera.com/blog/2010/07/developing-
applications...](http://www.cloudera.com/blog/2010/07/developing-applications-
for-hue) for details.

------
cageface
Interesting idea. How do you solve the problem of marshaling tera/petabytes of
customer data into a system like this?

~~~
brianbreslin
I honestly don't know. beyond physical hardware transfers (shipping a disk) I
don't think the average person has a connection that could upload terabytes of
data.

you'd have to assume much of the data is already in the cloud. i'm no expert
on the matter, but I'm sure someone on HN could figure out the solution. I was
merely proposing the idea.

~~~
jacquesm
Tanenbaums stationwagon always seems to be one step ahead of fixed line
capacity.

------
vannevar
Those interested might want to check out Cascading
(<http://www.cascading.org/>), and open-source project tackling this very
problem.

~~~
earl
Cascading is nice, don't get me wrong, but you still know you're writing MR.
And you're still far more involved in the details of getting data to and from
than I would like.

