
Microsoft Releases Open Source ML Library for Distributed Search Engine Creation - mhamilton723
https://aka.ms/spark
======
mhamilton723
MMLSpark is Microsoft’s open source initiative for advancing distributed
computing in Apache Spark. MMLSpark provides deep-learning, intelligent
microservices, model deployment, model interpretability, image-processing, and
many other tools to help you build scalable and production quality machine
learning applications. MMLSpark is usable from Python, Scala, Java, and R and
can be used on any Spark cluster. For more information and setup instructions,
see our github page:

[https://github.com/Azure/mmlspark](https://github.com/Azure/mmlspark)

Thank you!

~~~
mlthoughts2018
Having evaluated Spark extensively for my company’s ML use cases, I came away
deeply disappointed.

One thing that bugs me in particular is that it essentially presumes all
workflows involve huge, distributed datasets. But most model development work,
especially for projects that will eventually be trained for production using
huge, distributed data sets, must begin their life cycle as small data
prototypes with extremely low-overhead development cycles. I’ve never found a
way to organize Spark so that it can handle both tiny WIP prototyping and also
production ML workloads. Inevitably you end up with two separated tool stacks
and lots of kludgy processes to migrate from a successful prototype over to a
Spark implementation, because the repeated overhead of Spark in the small
prototype phase is far too costly and limiting.

Other gripes include “bring your own container” solutions for controling the
surrounding runtime details of different models on a project-by-project basis,
and for Python at least, how to entirely bypass py4j and eliminate all JVM
overhead, especially when working in cases that heavily rely on custom Python
extension modules and extension types.

To boot, for use cases when you have to write your own likelihood functions
for models that don’t already exist in Spark libraries, and potentially you
need to get down to the level of tuning the actual optimizer you’ll use to
train, Spark is extremely opaque and full of incomprehensible errors and
debugging issues. Not to mention that ideally you’d like to write those
routines once and be able to use them across any development environments
(whether executing via Spark or not), it leads to incentives towards “vendor
lock in” effects where you don’t even try things at all unless they are out of
the box in Spark.

All this is to say, I wish that initiatives moving towards the very generous
goal of open-sourcing ML tools to the community would view the idea of
portability across different runtime environment / cluster computing / backend
storage models to be a first-class requirement.

~~~
mhamilton723
Hey, thanks for your interesting point of view on Spark. I do a lot of my
small data prototyping in Spark in single machine mode and it has worked for
me thus far :).

For the likelihood functions comment, I would totally agree. Autograd
libraries are easier to build custom likelihood models in, which is why we
created CNTK on Spark, and databricks created Tensorflow on Spark. These give
you the flexibility of modern deep learning stacks with the elasticity of
spark

But in the end Spark is a single tool in a collection of tools and might not
be right for your project, but it's been good for a lot of our work here at
MSFT :)!

~~~
mlthoughts2018
On the point about autograd tools, it’s unfortunately not always helpful for
types of models that are not amenable to that type of framework, like
gradient-free methods, custom Bayesian inference models, or modified versions
of some traditional models (like bias-corrected logistic regression).

Here again, if you tie machine learning to a big system like Spark, which is
typically a huge IT cost in a lot of companies, and if they commit to having
an underlying data model suitable for Spark, it necessitates orienting
everything around Spark (someone with the scale of Microsoft might not suffer
this problem like everybody else)... all together it just renders Spark to be
usually such a limiting choice as to make it totally impractical to
standardize on it and give up all the other types of models or data storage
and cluster computing techniques you might need on a project-by-project basis.

~~~
mhamilton723
I definitely would agree that you should pick the best tool for the job and
not limit yourself to one ecosystem if it's too difficult.

One way to make Spark a bit easier to work with is through kubernetes or a
tool like databricks that provides it as a service. Kubernetes, in particular,
provides you a really nice amount of flexibility and composability when
designing systems. One thing that we created to try to fill the gap of having
to integrate System X with Spark was HTTP on Spark. This makes it easy to
integrate Spark with other tools in a microservice architecture. When you
couple this with containers, you can do a lot very quickly.

For datatypes I would look into different Spark connectors, these days there
one for almost every database/streaming service/ cloud store under the sun.

This being said, Spark is a large piece of software that uses many different
programming concepts which can be daunting. Our goal is to try to listen to
feedback like this so we can try to make the Spark ecosystem a bit easier to
use for everyone.

~~~
mlthoughts2018
This is actually the part I most disagree with. The overhead of connectors to
Spark, particularly _any_ use of py4j, is far too limiting except in cases
when the data workload is so large that it effectively amortizes the overhead.
For small scale prototypes, it’s a disaster, and then separate there are
concerns for data type marshaling through the JVM when you may have a Python-
only data model.

At the time of evaluation for me, I also found Databricks had extremely
limited support for runtime environments defined by arbitrary containers. You
have to select cluster nodes according to their prescribed images and choices.

Say you need Tensorflow or CUDA compiled with a weird set of optimization
flags, or you need other special provisions in the runtime environment. In
fact, variations of the runtime environment may even be part of some
reproducible experiments, so you need to execute across a variety of
parameters that govern _how the runtime is built_.

Anything that can’t support this type of stuff out of the box is just not
worth it. Anybody can hook a notebook environment up to analyze data from some
data warehouse or distributed file system.

The hard part is always how to make that setup configurable and
parameterizable across the needs of different projects, especially arbitrary
runtime environments.

