
.Net for Apache Spark Preview - affogarty
https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/
======
vvladymyrov
I've seen announcement about .Net interior support in Apache Spark some time
ago. The benchmarks are interesting and tell the story - in few cases it is
faster than Python, but slower than native (for Spark) Scala/JVM. Maybe with
Arrow interchange Python's performance would increase (and for other interpose
that would use Array - i.e. for .Net).

But performance is not the only thing - there is also ability to debug issues.
For this you still need to dig into Apache core which is in Scala.

This implementation in .Net would be "gateway drug" for moving your production
to Scala/JVM.

It happened to me with PySpark - majority or tasks at hand can be solved with
PySpark. But digging into the issues and stack traces brought me to Scala
internals of Apache Spark. As a result in cases when python specific libraries
are not needed and high performance needed I would write Spark programs Scala
from the beginning.

~~~
bunderbunder
On a somewhat related note, for my purposes the real deciding factor in
sticking with Scala/JVM for (production) Spark work is testability: With that
setup, it's dead easy to fire up a local Spark context, run unit tests against
it, and keep the tests running reasonably fast.

~~~
MichaelRys
These are all good points. Debugability and general support for the
development lifecycle are important. We are definitively working on providing
first class development experiences for .NET developers. .NET for Apache Spark
is already available as a nuget package for local install. We are currently
working on adding support to VS Code, Visualstudio etc. Feel free to provide
us your preferred dev platform. [Disclaimer: I am Program Manager for the .NET
for Apache Spark effort]

~~~
bunderbunder
Thanks for the response!

FWIW, I was speaking specifically to being able to run Spark, and manage its
lifecycle, all inside the same process as the unit test code. Which is
something that I'll openly concede isn't much more than a fun party trick for
most people's purposes, but it does happen to serve me well.

In a past life, I was involved in data engineering at a .NET shop, and being
able to migrate parts of our process to something like Spark without having to
rewrite or otherwise severely damage it would have made me very happy. Even
better if I could stay inside Visual Studio, and bang on it from an F#
interactive session.

Wild speculation, but if you can produce type providers that know how to tame
`DataSet[Row]`, you might have some nonzero number of F# hipsters like me
kissing your feet.

(Or not. Like I said, my perspective on Spark is unusual.)

~~~
MichaelRys
Thanks... More idiomatic F# support is on the roadmap

------
flowerlad
Python is a second class citizen in the world of Spark. Perf issues for UDFs.
Some functions are only available through Scala. Python support for new
features is always late. And so on. It is good to have .NET support but I will
stick with Scala for the same reasons I switched from Python to Scala.

~~~
pjmlp
That is the exact reason that I learned to stay with platform languages for
production code, even if there are more interesting ones trying to plug into
it.

FFI, extra debugging layers and lesser tooling integration don't pay off a
couple of language feature bullet points.

------
kenhwang
What's probably more interesting is how similar .net, scala, and python are in
query performance. Not sure if that can be attributed to great python
performance, or really bad scala/.net performance.

~~~
imbac82
[https://devblogs.microsoft.com/dotnet/introducing-net-for-
ap...](https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-
spark/#performance)

~~~
aeroevan
Python 2.7? Also, the new Apache Arrow integration changes the python
performance characteristics a lot; I wonder if they are using arrow for their
JVM <-> CLR interop, if not that probably would be a good idea.

~~~
polskibus
It would be interesting to hear about it from MS. Do you know of other
settings / configurations / features that could greatly influence the result
of such comparison?

~~~
MichaelRys
Since we started replying to these points on the Github thread at
[https://github.com/dotnet/spark/issues/45](https://github.com/dotnet/spark/issues/45),
I am suggesting to continue the discussion there. As mentioned there, we want
to be transparent with the benchmark code and systems we use. We are currently
working on arrow support to compare fairly.

------
tombert
How is this different than Mobius [1]?

[1] [https://github.com/Microsoft/Mobius](https://github.com/Microsoft/Mobius)

~~~
jongalloway2
Here's my understanding:

\- Mobius is .NET Framework / Mono based and x-plat isn’t great, .NET for
Apache Spark is .NET Core / .NET Standard and built with x-plat as a primary
concern

\- Mobius only targets up to Spark 2.0; while Spark LTS is up to 2.4 now

\- .NET for Apache Spark is built to take advantage of .NET Core performance
improvements, showing big advantages over Python and R bindings, especially
when user defined functions are a major factor

\- .NET for Apache Spark is driven by lessons learned and customer demand,
including major big data users inside and outside Microsoft

Disclaimer: I know people that worked on this and helped from .NET Foundation
side, but the above is my non-official summary from readme's and stuff.

