
The Commoditization of Massive Data Analysis  - makimaki
http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html
======
fauigerzigerk
I find this MapReduce vs RDBMS controversy so odd as the tasks for which both
could be used are very few.

MapReduce is used to implement massively parallel batch pipelines. RDBMS are
used either for interactive OLTP systems or for data warehouses meant to
support queries that may not be known upfront.

Google uses MapReduce in a way similar to traditional ETL pipelines, not for
query processing. You could fill any relational data warehouse using
MapReduce. So why does this debate refer to SQL at all?.

The answer is stream processing, that is querying, filtering, analysing data
without ever storing it. It's about making decisions based on data as it flows
through the system. It's not stored in any index, it's thrown away as soon as
possible. Google doesn't do that as far as I know, at least not in it's search
engine. Algorithmic trading applications do that. Processing sensor data could
conceivably work that way as well.

That blank stare from most companies is not surprising at all. There is just
no sensible reason why anyone should think about using MapReduce where RDBMS
are used today. That's just never going to happen because it's nonsense in 99%
of all cases. MapReduce will be used more, but it will be used for new tasks,
not as a replacement for anything.

~~~
sh1mmer
What about things like Hadoop HBase or Bigtable? Surely those are aiming to
replicate an RDBMS like environment but on a massively more scalable platform.

~~~
fauigerzigerk
BigTable is totally unrelated to MapReduce though. It gains scalability by
stripping 90% of the query and indexing features of RDBMS. It may be a
reasonable tradeoff for a datastore that's supposed to support one single
relatively simple application that's hit by millions of people. Not many apps
work that way though, and those who do may be equally well served by a plain
file system.

------
sh1mmer
I was at Mashup Camp this week and one of the guys from IBM pointed me at a
library called Flare (<http://flare.prefuse.org/>) which has a number of
prefabricated visualizations you can run on data.

The discuss we had was along the lines of using tools like Yahoo! Pipes or
IBM's Mashup Hub to combine data in new and interesting ways and then throw it
at a the visualization library to see what representation fit the output.

I think this would be absolutely applicable to processing data with
Hadoop/Pig. By trying a number of standard displays to represent data it would
be easy for non-technical folks to find new and interesting trends.

------
aneesh
Analysis will definitely get easier and easier to automate. However, I am
skeptical that it will every be truly commoditized.

Technology advances won't drive out human experts, it will just drive them
higher up the chain. No longer do human experts have to do such mundane tasks
as calculating averages. They are free to focus on more complicated aspects
like building predictive models and programming the behavior of systems that
analyze data. In the future, the list of "mundane" things they don't have to
do will keep growing, and they can focus instead on controlling the behavior
of the systems that do the analysis.

