
Google proposes its Dataflow batch/stream tech to the Apache Incubator - crb
https://wiki.apache.org/incubator/DataflowProposal
======
fhoffa
Note that this proposal is being back not only by Google, but also Cloudera,
Data Artisans, Talend, Cask, PayPal, ...

Some other posts on the announcement:

[http://googlecloudplatform.blogspot.com/2016/01/Dataflow-
and...](http://googlecloudplatform.blogspot.com/2016/01/Dataflow-and-open-
source-proposal-to-join-the-Apache-Incubator.html)

[http://blog.cloudera.com/blog/2016/01/spark-dataflow-
joins-g...](http://blog.cloudera.com/blog/2016/01/spark-dataflow-joins-
googles-dataflow-sdk/)

[http://data-artisans.com/dataflow-proposed-as-apache-incubat...](http://data-
artisans.com/dataflow-proposed-as-apache-incubator-project/)

[http://blog.cask.co/2016/01/cask-anticipates-googles-
dataflo...](http://blog.cask.co/2016/01/cask-anticipates-googles-dataflow-to-
flourish-in-apache/)

~~~
ericand
Also Talend blog: [https://www.talend.com/blog/2016/01/20/talend-joins-
google-t...](https://www.talend.com/blog/2016/01/20/talend-joins-google-to-
propose-dataflow-as-an-asf-incubator-project)

------
mindprince
> While Google has previously published papers describing some of its
> technologies, Google decided to take a different approach with Dataflow.
> Google open-sourced the SDK and model alongside commercialization of the
> idea and ahead of publishing papers on the topic.

A large number of ASF projects in the Big Data space are inspired by Google's
publications. Good to see Google finally taking the lead and coming out with
code.

------
melted
Seems like this would duplicate a rather large chunk of Apache Crunch, which
implements Google Flume nearly exactly as far as public API is concerned. As
far as I can tell, Google Dataflow is also a variation on top of Google Flume.
It would be helpful if they could elucidate why this project would not be
redundant under the Apache umbrella.

~~~
nl
Apache explicitly allows multiple overlapping projects. See, for example
Apache Storm/Spark/Flink.

Also worth noting is that the implementation matters at least as much as the
API.

In this case there are substantial differences: Data flow is a DSL, not a Java
API, and it is designed for streaming data. It is unclear if Crunch handles
the streaming case, but it talks a lot about Map/Reduce which makes me think
it isn't the primary usecase.

~~~
mey
Apache Ant + Ivy vs Apache Maven vs Apache Buildr Apache Tapestry vs Apache
Wicket vs Apache MyFaces

those jump to mind, but I'm sure there are plenty of other overlaps in the
Apache sphere. They certainly embrace it.

------
sysk
Can anyone ELI5 what it means for an open source project to become an Apache
project? Why doesn't Google just push the code on Github?

~~~
oh_sigh
It's about who maintains it.

~~~
sysk
Does that mean that Google will stop maintaining the project?

~~~
chimerasaurus
There are several organizations included on the proposal, including Google,
who will still be actively involved in the project, if accepted.

------
Wonnk13
what are the best resources to learn about streaming, dataflow, etc? Not
necessarily the Google implementations, but the core concepts backing them.

~~~
harlanji
Some of the best material I've read recently comes from the Confluent blug,
esp. Martin Klepmann. The views are tilted toward Kafka and Samza since the
founders are the same people, though they are both Apache projects. The
article that blew my mind was "Turning the Database Inside Out":
[https://martin.kleppmann.com/2015/03/04/turning-the-
database...](https://martin.kleppmann.com/2015/03/04/turning-the-database-
inside-out.html) . Doesn't encompass the full space, but the architectural
implications when combined with CQRS/Event Sourcing models are huge.

Samza's architecture and API embodies a lot of the important ideas at a lower
level than Storm; while it may not be the easiest to use in practice, the
concepts and documentation translate.

~~~
Wonnk13
cool, thank you! Def need to pickup a copy of Klepman's book

------
xcelq
Can we hope to see a google like search engine open source? I'm just waiting
for this day to happen.

------
ericand
O'Reilly post also released today references the Apache Dataflow submission:
[https://www.oreilly.com/ideas/the-world-beyond-batch-
streami...](https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-102)

~~~
spenrose
The author is one of the DataFlow committers.

------
obulpathi
It would be awesome to have the code portable across various big data engines.

------
BenoitP
Where does Dataflow stands? Is it only a wrapper, trying to define a standard
API for combining stream producers, datastores, and stream engines?

~~~
vgt
It's a Batch+Stream unified processing model, an SDK. Idea is you can code up
your pipeline in Dataflow and have your choice of where to run it - Spark,
Flink, etc.

Google Cloud Dataflow is a fully-managed service that executes Dataflow
pipelines and has nice value adds on top like fault tolerance and auto-
optimization.

(Disclosure: I work on BigQuery, not Dataflow)

