
A Scala API for Google Cloud Dataflow - Mullefa
https://github.com/spotify/scio
======
samuell
It's a bit interesting that Cloudera went the opposite way than Spotify and
fitted Google's Java API on top of Spark instead (so, changed the backend
instead of the "frontend") [1].

[1] [https://github.com/cloudera/spark-
dataflow](https://github.com/cloudera/spark-dataflow)

~~~
sinisa
Scio author here.

A bit background: Spark and Flink are both frameworks with their own execution
engine. Scalding is tightly coupled with Cascading + Hadoop as it's execution
engine (also tez WIP). Dataflow Java SDK/Apache BEAM on the other hand is
designed to be a simple abstraction with pluggable engines and Cloud Dataflow
service is just one of the many runners possible.

Right now there are:

\- local runner

\- Dataflow runner, fully managed service in GCP

\- Spark runner

\- Flink runner

Scio wraps Dataflow Java SDK(Apache BEAM) and can potentially leverage any
runner available.

------
lucdurette
Interesting project, glad to see more and more organisation are using Scala
with Data projects.

------
wiradikusuma
scio is also the name for a portable molecular sensor:
[https://www.consumerphysics.com/myscio/scio](https://www.consumerphysics.com/myscio/scio)

~~~
hnbroseph
this is cool! thanks for mentioning it, i may have to grab myself a unit or
two.

------
anacleto
Is this native? Or just a Scala wrapper?

~~~
samuell
I would expect it to be as native as Google's own Java API [1], though it is
still just the API, not the actual backend.

[1]
[https://github.com/GoogleCloudPlatform/DataflowJavaSDK](https://github.com/GoogleCloudPlatform/DataflowJavaSDK)

~~~
sinisa
Correct it's a thin Scala wrapper with some additional features. Execution is
delegated to Dataflow/BEAM.

------
ecesena
Any plan to port it to Beam?

~~~
sinisa
Scio author here. Yes as soon as BEAM finishes bootstrapping.

