
Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System - posharma
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/
======
_dark_matter_
>“We don’t really use MapReduce anymore”

This is not even true. They have recently published research that involved
using MapReduce in their own systems. Example:
[http://research.google.com/pubs/pub41376.html](http://research.google.com/pubs/pub41376.html)

~~~
kiyoto
As someone who has one foot in marketing and one foot in development (I am a
developer evangelist), here is what's happening.

1\. Google's infrastructure is evolving atop MapReduce (FlumeJava/MillWheel).

2\. Google's PR decided to call it "not using MapReduce anymore" because in
marketing, "beyond <current fad>" sounds really cool.

3\. The rest of the world's PR/press/marketing fall for Google's clever PR.

Either way, it is great to see Google making its core technology accessible as
part of their PaaS =)

~~~
rsync
I know what I'm supposed to do when my taxi driver starts recommending stocks
to me ...

But what do I do when people start describing themselves as "developer
evangelists" ?

That has to be some kind of sell signal, right ?

~~~
sanswork
The title has existed for years now at various companies(most with an open
source program have them) so whatever you were suppose to do you're too late
now.

------
basyt
Its fundamentally the same thing as MapReduce isn't it? Can someone explain
the differences to me please? There isn't much of use in the article

~~~
dyoo1979
You'll probably want to read the FlumeJava paper.
[http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...](http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf)

Citation:
[http://dl.acm.org/citation.cfm?id=1806638](http://dl.acm.org/citation.cfm?id=1806638)

The key word is _pipeline_. If you have some analysis that runs in several
stages, you'll be taking the output of one stage, and connecting it to the
next. If you want to compose multiple phases, chained together, raw MapReduce
isn't going to help you very much with the chaining.

What's described in the paper is a way to do the chaining in a nice way. The
system will take care of writing the raw MapReduces for you. But it'll also do
a lot of work on the interconnections between your stages as well.

------
wyager
>Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System

 _Buzzword... overload!_

~~~
msane
Hey sometimes you just have to do whatever it takes to synergize global
channels on virtual platforms. You know, really aggregate extensible markets
with repurposed leading-edge metrics.

~~~
dclusin
But how will these leading-edge metrics enable us to deliver paradigm shifting
solutions to our customers while simultaneously reducing costs and increasing
operational efficiency?

~~~
source99
While making the world a better place!

------
jey
Is this basically like Apache Spark in its programming model?

~~~
espeed
Yes, it's like Spark ([http://spark.apache.org/](http://spark.apache.org/))
and SparkStreaming
([http://spark.apache.org/streaming/](http://spark.apache.org/streaming/))
combined.

Here are the relevant papers...

* FlumeJava (iterative, data-parallel pipelines like Spark): [http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...](http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf)

* MillWheel (fault-tolerant stream processing like SparkStreaming): [http://research.google.com/pubs/pub41378.html](http://research.google.com/pubs/pub41378.html)

Pointers to the IO blog posts...

* "Reimagining developer productivity and data analytics in the cloud" [http://googlecloudplatform.blogspot.com/2014/06/reimagining-...](http://googlecloudplatform.blogspot.com/2014/06/reimagining-developer-productivity-and-data-analytics-in-the-cloud-news-from-google-io.html)

* "Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service" [http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-g...](http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-google-cloud-dataflow-a-cloud-native-data-processing-service.html)

The Dataflow-specific talks at Google IO 2014...

* Big data, the Cloud way: Accelerated and simplified [https://www.youtube.com/watch?v=Y0Z58YQSXv0](https://www.youtube.com/watch?v=Y0Z58YQSXv0)

* The dawn of "Fast Data" [https://www.youtube.com/watch?v=TnLiEWglqHk](https://www.youtube.com/watch?v=TnLiEWglqHk)

* Predicting the future with the Google Cloud Platform [https://www.youtube.com/watch?v=YyvvxFeADh8](https://www.youtube.com/watch?v=YyvvxFeADh8)

* Keynote (starts at Urs Hölzle's segment on Google Cloud) [https://www.youtube.com/watch?v=wtLJPvx7-ys#t=6932](https://www.youtube.com/watch?v=wtLJPvx7-ys#t=6932)

~~~
jey
Cool. Does this mean Google is moving away toward languages that allow for
easier use and serialization of closures than in C++ and Java? (For example,
Spark uses Scala natively.)

~~~
espeed
Dataflow is language agnostic. The Java API is being released first, and more
languages will follow.

------
entrusted
The main takeaway from this article is that the author, Yevgeniy Sverdlik, has
demonstrably never worked with distributed computing systems.

The rest is buzzwords propping up sweeping ridiculous conclusions.

------
miralabs
"said it got too cumbersome once the size of the data reached a few
petabytes."

I dont think there's a lot of companies where data would reach this huge.
Anyone has any idea on how large a typical warehousing database is?

~~~
jamesaguilar
What do you mean, warehousing? Like, item tracking inside an actual warehouse?
Hard to imagine spending more than a kB per unique item -- more per SKU, but
less per individual object -- so even if you have 1M items being tracked, the
total size would only be a gigabyte. Even if you had a billion unique things
in your store, the resulting database would still fit on a single flash drive.

~~~
seanp2k2
Given the context, it seems like "warehouse" in parent's comment was more
specifically "data warehouse":
[http://en.m.wikipedia.org/wiki/Data_warehouse](http://en.m.wikipedia.org/wiki/Data_warehouse)

While I couldn't quickly find anything that speaks to any kind of average or
normal size of a data warehouse, this article mentions Facebook's being around
300PB: [https://code.facebook.com/posts/229861827208629/scaling-
the-...](https://code.facebook.com/posts/229861827208629/scaling-the-facebook-
data-warehouse-to-300-pb/)

------
neckbeard
For those that missed it, similar discussion on this a couple of days ago:
[https://news.ycombinator.com/item?id=7947782](https://news.ycombinator.com/item?id=7947782)

~~~
dang
Yes, it's the same story, so we'll call this thread a dupe.

------
t1m
Is it just me or is using 'cloud' in their product names just shark jumping.

Seriously.

------
capkutay
This sounds more like Google app engine's shot at amazon kinesis than anything
else.

------
CHY872
Sounds a bit like Dryad?

------
jpgvm
Yay, marketing, yay.

------
infocollector
More snake oil? :-)

