
Microsoft announces major commitment to Apache Spark - RmDen
https://blogs.technet.microsoft.com/dataplatforminsider/2016/06/06/microsoft-announces-major-commitment-to-apache-spark/
======
insulanian
Usually when I hear about Apache Sparc, there is Scala being mentioned too.
Does Scala have the best support for it and what's the situation on .NET side?
(F#?)

~~~
nchammas
First class language support in Apache Spark:

    
    
      * Scala
      * Python
      * Java
      * R
    

All these languages are equal, but Scala tends to be more equal than others in
some areas of the API. I also believe R is mostly restricted to the DataFrame
API.

Third-party language support:

    
    
      * Clojure [0]
    

To develop on Spark in a new, non-JVM language, you'd need a bridge to Java.
That's how PySpark works [1], and I believe R follows a similar pattern.

[0] [https://github.com/yieldbot/flambo](https://github.com/yieldbot/flambo)

[1]
[https://cwiki.apache.org/confluence/display/SPARK/PySpark+In...](https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals)

~~~
jbooth
More like 3 tiers of support:

Scala -- native API

Java -- java wrappers for Scala API, slightly clunky but same implementation

Python+R -- janky process involving forking a process and feeding strings back
and forth between JVM and python/R via pipes

~~~
nchammas
You're right about Python and R having to pass data back and forth to the JVM
for certain operations, but also keep in mind that native code still runs in
the native interpreter. That means you have access to the full ecosystem of
the native language.

For example, if I want to convert an RDD of JSON strings into Python
dictionaries:

    
    
        import json
        rdd_dict = rdd.map(lambda x: json.loads(x))
    

Same goes for any external Python libraries I install on the cluster and want
to use in my Spark job. You can even run your Python code on PyPy [4]!

For me, working in Python generally feels like a first class experience on
Spark. There are areas -- like GraphX [0], certain niche features [1] -- where
Scala is definitely easier to work with, but with time that is becoming less
[2] and less [3] true thanks to the DataFrame API.

[0] [https://spark.apache.org/graphx/](https://spark.apache.org/graphx/)

[1]
[http://stackoverflow.com/q/23995040/877069](http://stackoverflow.com/q/23995040/877069)

[2]
[https://github.com/graphframes/graphframes](https://github.com/graphframes/graphframes)

[3]
[http://stackoverflow.com/a/37150604/877069](http://stackoverflow.com/a/37150604/877069)

[4]
[https://github.com/apache/spark/pull/2144](https://github.com/apache/spark/pull/2144)

------
elcapitan
Can't wait until they install it on my computer!

~~~
DoofusOfDeath
I assume that Microsoft considers the Windows EULA to let them use your
computer as an Azure compute node.

~~~
mistermann
This is actually a rather interesting idea.....allowing users to voluntarily
participate as a compute node would be extremely cool.

~~~
pvelagal
Users can contribute idle CPU cycles to a variety of projects
[http://www.hyper.net/dc-howto.html](http://www.hyper.net/dc-howto.html) .
AFAIK, lot of folks have been doing this for decades.

------
peterwwillis
Is it just me, or does any technology that Microsoft supports late in the game
give you sort of a queasy feeling?

~~~
dethswatch
No, but I view MS as a collection of employees desperate to learn what they
need for the next job now.

Hence, the total apparent lack of desire to do anything new and innovative
("Let's copy AWS! Genius!").

~~~
sargun
Funny story from my time at Microsoft (this wasn't representative of the
entire population, just a few folks). This was circa 2013:

We were in a meeting talking about infrastructure and testing. I said that we
used EC2 for some of our infrastructure, and we were thinking of moving some
of our local dev infra to AWS. Someone in the room asked me what that was. I
responded with "Amazon web services" and they countered "you mean the company
that sells books and stuff?" \-- they had never heard of AWS, nor EC2

This comes from someone who had been with the organization for some time. He
was very well respected by a lot of folks. He had more awards in his office
than I could count. I still think he's a great guy.

The Microsoft monoculture hurt quite a bit.

~~~
ethbro
Cheap shot, but this is the company that created TFS in 2005(?). Which as near
as I can tell is an answer to the question "How do we reimplement SVN in an MS
environment?" Which seems like something no one should have been asking for by
that time.

~~~
sargun
Microsoft's dev tools were straight up amazing. As someone who didn't get to
take advantage of many of them, I was always envious of engineers using Visual
Studio and being able to debug code like it was magic.

Although TFS may seem terrible, it works for super large organizations, and
big code bases in a way that SVN and Git just didn't.

~~~
ethbro
I wouldn't say I worked with a super large codebase at the customer who used
it, but the only compelling feature seemed to be "it played nice with MS
infrastructure."

Teams that had issues grasping the fundamentals of VCS in general still had
issues. And in return you got the lesser compatibility and greater number of
code warts that came with proprietary enterprise software over open source.

But I'm curious, what worked specifically for you in TFS that wasn't in the
SVN ecosystem? It's probably the client was unaware of a lot of features.

~~~
sargun
I never got to use TFS, but I'd sum up the benefits with /integration/. The MS
dev ecosystem is amazing. I write software now for Linux on Mac OS X. I use
IntelliJ, Netbeans, virtual box, and a whole bunch of bash to tie it all
together.

If I want to do a deploy of my lambda software, I edit some Python in
IntelliJ, I then run a build and upload it to lambda using some bash scripts,
and test it using the GUI in Chrome.

If I'm editing C code, I edit some code in Netbeans, and edit it over Netbeans
SSH integration in VB. I then go compile it using make on the remote machine,
and load it. I have a bunch of bash scripts to do testing on that remote
machine that are executed via SSH.

\----- I have several more duct taped integrations. With VS, this is all in
one system.

------
ssahoo
Not a surprise, since
[https://www-03.ibm.com/press/us/en/pressrelease/47107.wss](https://www-03.ibm.com/press/us/en/pressrelease/47107.wss)

