
Dataflow/Beam and Spark: A Programming Model Comparison - vgt
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
======
samuell
I'm much disturbed that the big G hijacked the dataflow term, to suddenly mean
_their specific - rather involved I must say - dataflow based programming
model_. The real dataflow [1] is a much broader term that doesn't outline
specifics like programming semantics.

Seems as if they're trying to ride the wave of the recent upsurge in interest
in dataflow in general (with sub-fields such as Flow-based programming, and
implementations like Akka streams etc). That's OK, but hijacking a term for
the whole field, is not.

[1]
[https://en.wikipedia.org/wiki/Dataflow](https://en.wikipedia.org/wiki/Dataflow)

~~~
lern_too_spel
It's called Google Cloud Dataflow. Your complaint is like saying Google Cloud
Platform is hijacking the meaning of Platform.

~~~
samuell
Yes, but if you look at the use of the word in the linked post, you will see
that plain "Dataflow" is almost exclusively used - even in the title.

------
nl
I think that this is useful.

One question though. Is there a reason you can't use the Spark Window
functions? [https://databricks.gitbooks.io/databricks-spark-reference-
ap...](https://databricks.gitbooks.io/databricks-spark-reference-
applications/content/logs_analyzer/chapter1/windows.html)

~~~
glogla
I haven't studied the code very carefully, but I think this is mostly PR
piece. Not only using Spark from Java isn't a very good idea, but the
"canonical" way to do transformations like that is using dataframes or
datasets (which are from Spark 1.6 and provide some improvements over
dataframes).

Take it with a grain of salt.

~~~
takidau
You're not going to get clean out-of-order processing semantics with any mode
of Spark transformations. If you actually take the time to read the article,
there's a section discussing the Java/Scala angle. The difference in code size
is really secondary (though it is a difference). The difficulty in maintaining
and evolving your pipeline over time using Spark is the main point, given the
way important concepts become conflated with their API (any version of it).
This all comes across much more clearly for those that actually take the time
to read the words.

~~~
glogla
> If you actually take the time to read the article, there's a section
> discussing the Java/Scala angle.

They claim this isn't about the length of code, yet they select the most
verbose way to use Spark and proudly display how long it is.

I mean sure, lack of event-time based processing is known limitation of Spark
(and a pretty annoying one - though it is supposed to be worked on) but there
are ways to write about it without code made to look bad on purpose.

EDIT: come to think of it, this whole article is "spark streaming can't do
event time" written in thousands of words with contrived examples attached.

~~~
vgt
I think both length of code is a side effect that results from the primary
argument, and "cant do event time" is one of the symptoms. Neither is a
primary argument in the blog post.

The primary argument is demonstrated through color coding different logical
bits, which end up being clearly portable and elegantly distinct in dataflow.

This is demonstrated in two ways:

1\. The "juicy value add" code that does the aggregation is labeled yellow,
and doesn't change across all the samples with Dataflow. With Spark, it needs
to be rewritten for every use case. Similarly, for all colors.

2\. In Dataflow all the colors are separate. This makes expressing your logic
easier. In Spark, the colors mix in dramatic ways with every demonstrated use
case.

As Tyler said, all this is described in the blog post itself, but I don't
blame you for missing it, since it's a really long post :)

------
teraflop
I really like the Dataflow programming model, but this feels to me like an
apples-to-oranges comparison.

Spark Streaming is a fully open-source project; although the Dataflow SDK is
also OSS, my understanding is that the released version can only handle
bounded datasets. Support for streaming (which is the major innovation, IMO)
is only available in the form of stubs that call out to Google's paid,
proprietary Dataflow service.

It's totally fine to compare an open-source project with a proprietary
alternative, but I think it's odd that this article opens by talking about how
the Dataflow SDK is being opened, and then spends all its time talking about
proprietary features.

~~~
azurezyq
Accoring to the proposal
([https://wiki.apache.org/incubator/BeamProposal](https://wiki.apache.org/incubator/BeamProposal)),
OSS impl. of streaming is on the way, by Apache Flink, etc. The blog is just
suggesting the model itself is superior, regardless of OSS or not.

~~~
teraflop
Huh, I can't find anything to that effect in the proposal. It does mention the
existence of runners for Spark and Flink, but doesn't say they'll be getting
streaming support. I had assumed it was unlikely for the same reason that this
article talks about (lack of support for first-class processing by event
time).

But assuming it's true. it's very welcome news! I'll be keeping a close eye on
future releases.

~~~
takidau
Flink's event-time support is coming along nicely. Their first round of true
event-time support came in November
([https://flink.apache.org/news/2015/11/16/release-0.10.0.html](https://flink.apache.org/news/2015/11/16/release-0.10.0.html)),
and much more is on the way. Flink will be an excellent platform for Beam,
both batch and streaming.

As I understand it, Spark has event-time support coming soon as well. I think
basic stuff is landing in 1.7. Not sure precisely what they have planned, but
I can only imagine that Spark will also become an excellent platform for
executing streaming Beam pipelines in due time. In the meantime, the streaming
runner for Spark can either target those features which Spark does support
well (i.e., processing-time windowing, in this case), or try to emulate those
it doesn't (such as how it was done in the article).

------
massemphasis
Did anyone read this? There was so much buzzwords and bullshit I couldn't slog
past the first page.

Who do they write these things for anyways? It's not like we're college
admissions or professors, just give us the straight deal. Unless you're
pitching to schools and naive undergrads of course.

------
TheLogothete
Hey, this is kind of offtopic, but figured still appropriate to ask;

How come Google Cloud Storage can be used instead of HDFS? I'm comparing
google/amazon/azure right now. Both Amazon and Azure have 2 types of storage
options - the regular object storage (S3a and Blobs) and block storage (S3 and
Data Lake Store). S3 and DLS can act as the file system for Hadoop themselves
(meaning you can let the data sit there and fire up clusters just for
processing when needed), but they cannot interface with tools like the regular
storage.

Meanwhile, Google's storage is like regular object storage, but you can run
map/reduce (dataproc) and Spark on it.

~~~
nda
GCS can be used instead of HDFS, Dataproc is shipped with GCS connector
installed ([https://github.com/GoogleCloudPlatform/bigdata-
interop](https://github.com/GoogleCloudPlatform/bigdata-interop))

~~~
TheLogothete
That... was kind of my point.

------
throw42
Spark is dead.

~~~
abiox
long live _____ !

------
dang
Url changed from [https://cloud.google.com/blog/big-data/2016/02/comparing-
the...](https://cloud.google.com/blog/big-data/2016/02/comparing-the-
dataflowbeam-and-spark-programming-models), which points to this.

~~~
vgt
Thanks! The original URL is a new developer-focused Big Data blog by Google
and Google customers:

[https://cloud.google.com/blog/big-data/](https://cloud.google.com/blog/big-
data/)

