But performance is not the only thing - there is also ability to debug issues. For this you still need to dig into Apache core which is in Scala.
This implementation in .Net would be "gateway drug" for moving your production to Scala/JVM.
It happened to me with PySpark - majority or tasks at hand can be solved with PySpark. But digging into the issues and stack traces brought me to Scala internals of Apache Spark.
As a result in cases when python specific libraries are not needed and high performance needed I would write Spark programs Scala from the beginning.
FWIW, I was speaking specifically to being able to run Spark, and manage its lifecycle, all inside the same process as the unit test code. Which is something that I'll openly concede isn't much more than a fun party trick for most people's purposes, but it does happen to serve me well.
In a past life, I was involved in data engineering at a .NET shop, and being able to migrate parts of our process to something like Spark without having to rewrite or otherwise severely damage it would have made me very happy. Even better if I could stay inside Visual Studio, and bang on it from an F# interactive session.
Wild speculation, but if you can produce type providers that know how to tame `DataSet[Row]`, you might have some nonzero number of F# hipsters like me kissing your feet.
(Or not. Like I said, my perspective on Spark is unusual.)
FFI, extra debugging layers and lesser tooling integration don't pay off a couple of language feature bullet points.
It's mainly when you start writing custom UDFs (IOW, fabricating your own lego blocks) that platform interop and the performance of your language of choice become a big deal.
- Mobius is .NET Framework / Mono based and x-plat isn’t great, .NET for Apache Spark is .NET Core / .NET Standard and built with x-plat as a primary concern
- Mobius only targets up to Spark 2.0; while Spark LTS is up to 2.4 now
- .NET for Apache Spark is built to take advantage of .NET Core performance improvements, showing big advantages over Python and R bindings, especially when user defined functions are a major factor
- .NET for Apache Spark is driven by lessons learned and customer demand, including major big data users inside and outside Microsoft
Disclaimer: I know people that worked on this and helped from .NET Foundation side, but the above is my non-official summary from readme's and stuff.
>> Mobius: C# and F# language binding and extensions to Apache Spark, a pre-cursor project to .NET for Apache Spark from the same Microsoft group.