

IBM Invests to Help Apache Spark - rshaban
http://bits.blogs.nytimes.com/2015/06/15/ibm-invests-to-help-open-source-big-data-software-and-itself/

======
rpalmaotero
For everyone that wants to start working with Spark and Big Data, I recommend
them to enrole into this MOOC published by UC Berkeley at EDX:
[https://www.edx.org/course/introduction-big-data-apache-
spar...](https://www.edx.org/course/introduction-big-data-apache-spark-uc-
berkeleyx-cs100-1x)

~~~
TDL
Thanks for that link. I've been looking for something like this to get a
better working understanding of Spark.

------
zaroth
I read about an interesting technique, an "all-or-nothing tracker" in a blog
post from an Apache Spark engineer.

You dispatch n jobs, where n is quite large, and you want to know; have all n
jobs have completed, or has less than n jobs have completed. How to do so with
a small fixed number of bytes with very high probability?

Give each job a random 128-bit ID number. XOR each ID number together as you
start each job, and XOR into the same value as each job completes. If all the
jobs have completed, the result is 0. The chance of zero turning up randomly
if all jobs are not complete is negligible.

The technique is mentioned here under 'Lineage Tracking':
[https://highlyscalable.wordpress.com/2013/08/20/in-stream-
bi...](https://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-
processing/) but there's a better blog post I remembered reading but can't
find at the moment...

~~~
nl
This is kinda a Bloom filter:
[https://en.m.wikipedia.org/wiki/Bloom_filter](https://en.m.wikipedia.org/wiki/Bloom_filter)

~~~
sappapp
Both concepts make use of the mathematics behind xor logic gates.

~~~
snissn
I've used bloomfilters for a bit, but I'm not sure how bloom filters use xors,
can you explain that for me? thanks!

------
mark_l_watson
Spark is a great technology, for sure. I was hesitant to get into Spark
because I have lots of experience writing Hadoop map reduce apps. Then I
decided a while back to base all of the machine learning examples in my
current book project on Spark and MLlib and I am happy with that decision.

As the article mentioned, IBM certainly did validate the Linux "market." When
people would ask me what was great about Linux I used to just say that IBM was
investing billions in Linux, and that was an acceptable answer for people.

~~~
TallGuyShort
Curious about what your concerns with Spark were? I work for a company that
supports Spark development, but I don't work closely with that project, so my
opinion is not sufficiently well-informed, and obviously biased.

As far as I know, virtually any MapReduce job can be rather trivially
translated to Spark's .map() and .reduce() operations. And the downsides are:
it's model hasn't yet been proven at the largest scale's MapReduce has been
used, and possibly the use of Scala (although Java / Python bindings are
obviously available). Were there any other major factors in your hesitance?

~~~
mark_l_watson
I didn't have concerns about Spark, rather I already felt comfortable with
Hadoop.

Another issue is that I am sort of retired now. I still accept small
consulting jobs and do a lot of writing but my technology choices have shifted
to fun things like Pharo Smalltalk, Haskell, etc.

~~~
boothead
If you're interested in this domain and Haskell and have some free time, you
might find hailstorm interesting [1]

From what I understand of it - it's an implementation of apache storm in
Haskell.

[1] [http://hailstorm-hs.github.io/hailstorm/demo/#1](http://hailstorm-
hs.github.io/hailstorm/demo/#1)

~~~
rshaban
Neat! Thanks for sharing this, it's a great overview.

------
century19
"IBM said it will put more than 3,500 of its developers and researchers to
work on Spark-related projects."

That is impressive. I wonder how that will be split among core contributors,
consultants, etc.

------
rshaban
A link to the IBM press release:
[http://www-03.ibm.com/press/us/en/pressrelease/47107.wss](http://www-03.ibm.com/press/us/en/pressrelease/47107.wss)

"At the core of this commitment, IBM plans to embed Spark into its industry-
leading Analytics and Commerce platforms, and to offer Spark as a service on
IBM Cloud. IBM will also put more than 3,500 IBM researchers and developers to
work on Spark-related projects at more than a dozen labs worldwide; donate its
breakthrough IBM SystemML machine learning technology to the Spark open source
ecosystem; and educate more than one million data scientists and data
engineers on Spark."

------
jamesblonde
It will be interesting to see if IBM now bet big on their ipython kernel for
Spark - [https://github.com/ibm-et/spark-kernel](https://github.com/ibm-
et/spark-kernel). I've looked at it, and it's way behind Zeppelin and even
Spark-Notebook. An Eclipse for Spark as a Notebook-style IDE will be a game-
changer.

------
itaysk
It's even more interesting to observe the dynamics in this increasingly open
source world of software.

By deciding to sponsor Spark, I think IBM is becoming practically it's owner,
without having to do anything prior to this move. Does it mean it is possible
today to "acquire" technology a project by naming your own price?

~~~
DannoHung
Doesn't Databricks actually employ most of the core committers for Spark?

~~~
agibsonccc
[https://cwiki.apache.org/confluence/display/SPARK/Committers](https://cwiki.apache.org/confluence/display/SPARK/Committers)

------
nl
SystemML (which is one of the technologies they are donating) looks very
interesting:

 _Declarative large-scale machine learning (ML) in SystemML aims at flexible
specification of ML algorithms and automatic generation of hybrid runtime
plans ranging from single node, in-memory computations to distributed
computations on MapReduce or Spark. ML algorithms are expressed in an R-like
syntax, that includes linear algebra primitives, statistical functions, and
ML-specific constructs._

[http://researcher.watson.ibm.com/researcher/view_group.php?i...](http://researcher.watson.ibm.com/researcher/view_group.php?id=3174)

------
motdiem
The register's title [1] is a bit more brutal - I wonder how this investment
will spread among committers, tooling etc.... From an open source platform
perspective, it raises interesting questions in terms of finding the right
balance for management, as well as sustainability of the project.

[1]
[http://www.theregister.co.uk/2015/06/15/ibm_backs_apache_spa...](http://www.theregister.co.uk/2015/06/15/ibm_backs_apache_spark/)

~~~
agibsonccc
There's also competition for databricks itself.

[http://venturebeat.com/2015/06/14/ibm-
spark/](http://venturebeat.com/2015/06/14/ibm-spark/)

The spark ecosystem itself has a lot of players now.

Horton/Cloudera/MapR in their hadoop distros Typesafe:
[https://www.typesafe.com/community/other-projects/apache-
spa...](https://www.typesafe.com/community/other-projects/apache-spark)

------
baldfat
Has anyone tried the new R implantation? I read about Spark and I find it very
interesting but I have no need for it with my datasets currently.

------
itkawrje
Here's another course on Spark Fundamentals:
[http://bigdatauniversity.com/bdu-wp/bdu-course/spark-
fundame...](http://bigdatauniversity.com/bdu-wp/bdu-course/spark-
fundamentals/)

There's a second, more advanced course too: [http://bigdatauniversity.com/bdu-
wp/bdu-course/spark-fundame...](http://bigdatauniversity.com/bdu-wp/bdu-
course/spark-fundamentals-ii/)

------
wzymaster
I strongly recommend the Spark course at [http://bigdatauniversity.com/bdu-
wp/bdu-course/spark-fundame...](http://bigdatauniversity.com/bdu-wp/bdu-
course/spark-fundamentals-ii/), very good course along with Docker-based
practice hand-on lab !

------
wzymaster
Strongly recommend the Spark course at [http://bigdatauniversity.com/bdu-
wp/bdu-course/spark-fundame...](http://bigdatauniversity.com/bdu-wp/bdu-
course/spark-fundamentals-ii/)

------
rshaban
Some interesting perspective from The Register: "Up until last year, Spark had
just 465 contributors."

