Hacker News new | past | comments | ask | show | jobs | submit login
.Net for Apache Spark Preview (microsoft.com)
65 points by affogarty on Apr 24, 2019 | hide | past | favorite | 21 comments

I've seen announcement about .Net interior support in Apache Spark some time ago. The benchmarks are interesting and tell the story - in few cases it is faster than Python, but slower than native (for Spark) Scala/JVM. Maybe with Arrow interchange Python's performance would increase (and for other interpose that would use Array - i.e. for .Net).

But performance is not the only thing - there is also ability to debug issues. For this you still need to dig into Apache core which is in Scala.

This implementation in .Net would be "gateway drug" for moving your production to Scala/JVM.

It happened to me with PySpark - majority or tasks at hand can be solved with PySpark. But digging into the issues and stack traces brought me to Scala internals of Apache Spark. As a result in cases when python specific libraries are not needed and high performance needed I would write Spark programs Scala from the beginning.

Same, I recently moved an ML pipeline from PySpark to pure Python because of the debug-ability issue. The data science team, who managed the project, were experts in Python but relatively weak in Scala/Java. There were many issues were an improper data type may blow up pickling in the Java side and return absolutely cryptic errors. It was also difficult to do any sort of integration test and profiling on the code - the start of moving off of Spark originally started as a way to do integration testing and profiling.

Indeed the real sad part is you can’t lead teams there early (premature optimization). Everybody seems to make the same rough transition on their own.

On a somewhat related note, for my purposes the real deciding factor in sticking with Scala/JVM for (production) Spark work is testability: With that setup, it's dead easy to fire up a local Spark context, run unit tests against it, and keep the tests running reasonably fast.

For python guys - pyspark is now also installable from pip as a package (it include some .jars so it is ~100Mb size python package). So my team for local unit tests installs pyspark as a package.

These are all good points. Debugability and general support for the development lifecycle are important. We are definitively working on providing first class development experiences for .NET developers. .NET for Apache Spark is already available as a nuget package for local install. We are currently working on adding support to VS Code, Visualstudio etc. Feel free to provide us your preferred dev platform. [Disclaimer: I am Program Manager for the .NET for Apache Spark effort]

Thanks for the response!

FWIW, I was speaking specifically to being able to run Spark, and manage its lifecycle, all inside the same process as the unit test code. Which is something that I'll openly concede isn't much more than a fun party trick for most people's purposes, but it does happen to serve me well.

In a past life, I was involved in data engineering at a .NET shop, and being able to migrate parts of our process to something like Spark without having to rewrite or otherwise severely damage it would have made me very happy. Even better if I could stay inside Visual Studio, and bang on it from an F# interactive session.

Wild speculation, but if you can produce type providers that know how to tame `DataSet[Row]`, you might have some nonzero number of F# hipsters like me kissing your feet.

(Or not. Like I said, my perspective on Spark is unusual.)

Thanks... More idiomatic F# support is on the roadmap

Python is a second class citizen in the world of Spark. Perf issues for UDFs. Some functions are only available through Scala. Python support for new features is always late. And so on. It is good to have .NET support but I will stick with Scala for the same reasons I switched from Python to Scala.

That is the exact reason that I learned to stay with platform languages for production code, even if there are more interesting ones trying to plug into it.

FFI, extra debugging layers and lesser tooling integration don't pay off a couple of language feature bullet points.

What's probably more interesting is how similar .net, scala, and python are in query performance. Not sure if that can be attributed to great python performance, or really bad scala/.net performance.

Most of PySpark is simply telling the JVM what to do, it's not actually running python directly. UDFs are where the real differences are, and they mentioned CLR UDFs serialize the spark Rows 2x faster than Python, but it's not clear if they were using apache arrow enabled pandas UDFs which are 3x-100x faster:


Python 2.7? Also, the new Apache Arrow integration changes the python performance characteristics a lot; I wonder if they are using arrow for their JVM <-> CLR interop, if not that probably would be a good idea.

Your post made me curious and I raised the issue with MS at https://github.com/dotnet/spark/issues/45. I hope that benefits the community and gets MS on the right track (by finally supporting Arrow).

It would be interesting to hear about it from MS. Do you know of other settings / configurations / features that could greatly influence the result of such comparison?

Since we started replying to these points on the Github thread at https://github.com/dotnet/spark/issues/45, I am suggesting to continue the discussion there. As mentioned there, we want to be transparent with the benchmark code and systems we use. We are currently working on arrow support to compare fairly.

A large percentage of Spark code is really just assembling lego blocks. The built-in blocks are themselves all written in Java or Scala, and the performance of the code that stacks them together is negligible.

It's mainly when you start writing custom UDFs (IOW, fabricating your own lego blocks) that platform interop and the performance of your language of choice become a big deal.

How is this different than Mobius [1]?

[1] https://github.com/Microsoft/Mobius

Here's my understanding:

- Mobius is .NET Framework / Mono based and x-plat isn’t great, .NET for Apache Spark is .NET Core / .NET Standard and built with x-plat as a primary concern

- Mobius only targets up to Spark 2.0; while Spark LTS is up to 2.4 now

- .NET for Apache Spark is built to take advantage of .NET Core performance improvements, showing big advantages over Python and R bindings, especially when user defined functions are a major factor

- .NET for Apache Spark is driven by lessons learned and customer demand, including major big data users inside and outside Microsoft

Disclaimer: I know people that worked on this and helped from .NET Foundation side, but the above is my non-official summary from readme's and stuff.

From the github repo: https://github.com/dotnet/spark#inspiration-and-special-than...

>> Mobius: C# and F# language binding and extensions to Apache Spark, a pre-cursor project to .NET for Apache Spark from the same Microsoft group.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact