AutoML toolkit for neural architecture search and hyper-parameter tuning

mark_l_watson · on March 1, 2019

I manage a machine learning team for a large financial services company and AutoML tools, Microsoft’s NNI included, are on our radar.

I think the `future of work` for machine learning practitioners will quickly separate into two groups: a very small and elite group that performs research and a much larger groups that use AutoML but whose jobs also deal more with data preparation (which gets automated also) and ML devops, supporting models in production.

mlthoughts2018 · on March 1, 2019

This sounds like parody to me. There are so many problems in applied statistics, and neural networks are not helpful for most of them. Consider Bayesian analysis for very small data sets as an example (just the tip of the iceberg).

In financial services in particular, there are tons of time series and regression problems on small data such that a neural network (beyond perhaps some super small MLP) would be a ridiculous thing to try.

I think the breakdown of workload you described will only happen in business departments where there is a need for large scale embedding models, enhanced multi-modal search indices, computer vision and natural language applications, and maybe a handful of things that eventually productize reinforcement learning. I could also see this happening in businesses that can benefit from synthetically generated content, like stock photography, essays / news summaries / some fiction, website generators, probably more.

What I described above is a tiny drop in the ocean of applied statistics problems that business have to solve.

DebtDeflation · on March 1, 2019

It's another example of the FAANG + Bay Area Startups world versus the other 99% of Corporate America. In the latter world, most of the "machine learning" in production is traditional stuff like Random Forest, SVM, and more recently Gradient Boosting. Hell, Marketing departments across the country are still running old school decision tree (CART and CHAID) models and logistic regression models written in SAS 20+ years ago. DL/NN is a minuscule proportion of production ML in the enterprise space.

pplonski86 · on March 1, 2019

I think there is good reason that "old" machine learning models are more popular than DNN in the enerprise space. Most of the data is in the tabular format. What is more, "old" and simple decision tree or linear model are very easy to understand, deploy and are fast. There is for sure clear advantage of having even simple decision tree implemented in the system than making decisions at random.

mlthoughts2018 · on March 1, 2019

The main reason though is that these other methods outperform neural nets in tons of different situations. Even just from an accuracy / business success metric point of view, many problems are just better solved with other classes of models, domain-specific feature engineering, etc. It will probably remain so for many decades at least.

psandersen · on March 1, 2019

DNN's make good features though, especially if you have time series data or lots of text.

I agree that the final model should be a randomforest/xgboost/lightgbm for typical tabular data.

mlthoughts2018 · on March 1, 2019

I meant that extracting an intermediate layer as a feature embedding and then sticking a classical model on top of it performs worse than curating features through domain-specific expert tuning, for a ton of diverse application domains.

byebyetech · on March 1, 2019

Deep Learning also works on very small data sets by means of embeddings. A large model trained on large data sets can be used as feature extraction tool for training for small data sets.

mlthoughts2018 · on March 1, 2019

Re-using an existing model to generate embeddings doesn’t work well for auxiliary tasks with very small data. Even if you do no fine-tuning at all, you need to have big data sets in terms of the auxiliary task too.

For example, consider needing to train hundreds of unique small models every day, based on new customer inputs affecting causality effects for that day (I had to do this for ad forecasting in a past job).

Generating embeddings via pre-trained models essentially produced gibberish and performed far worse than custom feature engineering + simple logistic models.

paperwork · on March 1, 2019

I’ve seen this mentioned before, including a blog post by the fast.ai folks. Any idea where I can get details? If my tabular data set is small, what kind of embedding can I get out of it? Or is the idea that a larger data set is used for embeddings of categorical data?

yorwba · on March 1, 2019

Pre-trained embeddings are only helpful if they are trained on a different (ideally larger) dataset or even a different task, but with the same kind of input data. So you would need to find out where else something similar to the data in your tables appears. If some of the data is text, word embeddings may be applicable. Or if you're trying to analyze user activity by time and location, you might try to transfer what can be learned about the influence of holidays from publicly observable activity e.g. on Twitter (just a random idea that popped into my head, no guarantee that it can actually work).

Of course if all you have are numbers without context, there isn't a lot you can do to improve the situation.

sdenton4 · on March 1, 2019

I think this is mainly a thing for perception (images and sounds). Tabular data would have to match up with the training dataset, and "most" interesting tabular models are the sports of things guarded like piles of gold by the businesses that build them...

human_scientist · on March 1, 2019

The parent did not specifically talk about NNs. As I understand it AutoML could apply to all statistical endeavours that involve estimation (classical or bayesian).

mlthoughts2018 · on March 1, 2019

> “AutoML could apply to all statistical endeavours that involve estimation”

Yes, this is the part that sounds like parody to me. At least, as a working statistician, I can tell you that the concept of AutoML could not apply to the far majority of things I work on.

human_scientist · on March 1, 2019

Could you give an example? I have a hard time understanding what you could mean, as Algorithm Configuration & Selection is such a general framework. If you are solely talking about the current state of the art, I would agree that techniques from AutoML do not have the generality and autonomity of an expert human.

mlthoughts2018 · on March 2, 2019

For example, look into Chapter 5 on logistic regression from the Gelman & Hill book on hierarchical models & regression.

It walks through an example with arsenic data in wells and a problem of estimating how distance, education and some other factors relate to a person’s willingness to travel to a clean well for water.

Deciding on how to standardize the input features, how to rescale for regression coefficients to be interpretable in meaningful human units, how to interpret statistics of the fitted model to decide whether a feature is helping or hurting by adding it (since this cannot be deduced from raw accuracy metrics alone), how to interpret deviance residual plots for outlier analysis, etc.

All those things have nothing to do with changing the architecture of the model, except possibly including or excluding features, and in that example there were no hyperparameters to tune, and the inference problem would not make sense for hyperparameter tuning on raw accuracy outputs anyway, since the goal was not optimizing prediction but rather understanding impact of features that have semantic meaning in the contexf of possible policy choices that could be adopted.

By way of contrast, applying an automated subset selection algorithm to automatically choose the features would be a naive idea with likely bad results in that case, and setting up an optimization framework that would optimize over possible transformations or standardizations of the inputs seems equally dubious compared with expert, context-aware human judgment.

And this is a very trivial example. If you modify a problem like this to address causal inference goals, or add some type of cost optimization on top of it, it becomes more and more complex, but exactly in a way that a tool like AutoML can’t help with.

In other words, making an AutoML that can truly apply to all types of estimation or inference problems is no easier than solving strong AI computer vision and natural language problems entirely, since you need contextual reasoning and creative proposals for inventing features and sleuthing the goodness of fit of a certain model architecture in light of the human-level inference goal you’re trying to reach.

yorwba · on March 1, 2019

So you never tune hyperparameters or try different models to see which works better?

mlthoughts2018 · on March 1, 2019

I do plenty of that, and AutoML could help with a small fraction of that.

mjburgess · on March 1, 2019

The problem is "Applied Statistics" became "Machine Learning" which became "AI" which became "Deep Learning".

Throw away all the BS. and, yes, it's obvious.

bitL · on March 1, 2019

Google, Facebook & MS already have even automated research, i.e. automated selection of a loss function, network architecture, individualized network topology etc. Amazon is not there yet. The rest of industry is still in "stone age", just "considering" using something like AutoML for basic hyperparameter tuning.

bitforger · on March 1, 2019

If you automate it, is it still research? Research implies some sort of hypothesis testing, yes?

I suppose OP means there will be two groups: people who use AutoML and people who try to make AutoML better.

p1esk · on March 1, 2019

There should be at least 3 groups, because making AutoML better != making ML better.

human_scientist · on March 1, 2019

Why? The concept of AutoML does include the design of novel algorithms.

p1esk · on March 1, 2019

What do you mean? I thought AutoML was a tool to do neural architecture search, and hyperparameter tuning.

human_scientist · on March 2, 2019

The field of automatic machine learning (abbreviated as AutoML) concerns all endeavours to automate the process of machine learning. To provide a sense of what could constitute AutoML, let me post a list from the "Call for Papers" of the International Workshop on Automatic Machine Learning (ICML 2018) [1]:

    * Model selection, hyper-parameter optimization, and model search
    * Neural architecture search
    * Meta learning and transfer learning
    * Automatic feature extraction / construction
    * Demonstrations (demos) of working AutoML systems
    * Automatic generation of workflows / workflow reuse
    * Automatic problem "ingestion" (from raw data and miscellaneous formats)
    * Automatic feature transformation to match algorithm requirements
    * Automatic detection and handling of skewed data and/or missing values
    * Automatic acquisition of new data (active learning, experimental design)
    * Automatic report writing (providing insight on automatic data analysis)
    * Automatic selection of evaluation metrics / validation procedures
    * Automatic selection of algorithms under time/space/power constraints
    * Automatic prediction post-processing and calibration
    * Automatic leakage detection
    * Automatic inference and differentiation
    * User interfaces and human-in-the-loop approaches for AutoML

[1] https://sites.google.com/site/automl2018icml/

human_scientist · on March 2, 2019

> I don't see "Automatic design of novel algorithms" in this list. Can AutoML produce something as novel as a GAN, CapsNet, WaveNet, Transformer, Neural ODE, etc? Is that even considered to be one of its goals. In my opinion, there's a clear separation between a group of people trying to improve AutoML so that it's more useful in doing all those tasks on the list, and a group of people trying to invent next gen ML algorithms or DL architectures.

I agree with you from the view of the current state of the art methods and the current state of the AutoML / fundamental ML research communities. Current methods are very limited, but I can not think of a reason why a sufficiently general searchspace of architectures/pipelines could not produce something like a GAN or a WaveNet.

I do not think that designing algorithms as novel as the ones you listed is currently a goal of AutoML, as that is not something we have an attack for. However, I do think that with increasing capabilities, the field of AutoML will seek to automate every step of the machine learning pipeline - including the design of algorithms. E.g., once/if there are attacks to apply NAS for yielding truly novel architectures, I think NAS researchers will be happy to do just that -- wouldn't you call that AutoML then?

p1esk · on March 2, 2019

sufficiently general searchspace

But that would require enormous computing resources!

p1esk · on March 2, 2019

I don't see "Automatic design of novel algorithms" in this list.

Can AutoML produce something as novel as a GAN, CapsNet, WaveNet, Transformer, Neural ODE, etc? Is that even considered to be one of its goals?

In my opinion, there's a clear separation between a group of people trying to improve AutoML so that it's more useful in doing all those tasks on the list, and a group of people trying to invent next gen ML algorithms or DL architectures.

noelsusman · on March 1, 2019

Hasn't this always been the case? Actually fitting a model has always been a pretty small part of an applied statistician's job. The real work is everything before and after that point.

williamsmj · on March 1, 2019

I'd be interested in the creator's thoughts on this paper, "Random Search and Reproducibility for Neural Architecture Search", https://arxiv.org/abs/1902.07638, posted on the arxiv last week. Among other conclusions, they find:

"Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks"

ENAS, the specific algorithm that they find does no better than chance, is in this library. My understanding is that the results are pretty generic though, i.e. NAS is very far from a solved problem. (Hyperparameter tuning for "classical" models are another matter. That's commoditized and available as a service at this point, see tpot, DataRobot, etc., etc.)

wongarsu · on March 1, 2019

> We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage.

No Windows support in a Microsoft product. Curious.

This looks very useful for tuning hyper-parameters, and the fact that the tuned algorithm is treated as a black box makes this very flexible.

yeahhhhh · on March 5, 2019

Actually, they will support in Windows later. Due to many developers usually train their deep learning model in Linux, so they support Linux and Max first.

perturbation · on March 1, 2019

Their example with LightGBM (https://nni.readthedocs.io/en/latest/gbdt_example.html) is very cool - I wanted to put together a custom script with mlflow + catboost + mlrMBO to do something similar, but this puts everything together in one package.

I think this does everything MLFlow does and more (besides maybe helping with deployment?)

yzh · on March 1, 2019

I'm working on auto hyper-parameter tuning and network optimization, I always think that people have put too much focus on NAS, which aims to create a whole new network from scratch, but not nearly enough on hyper-parameter tuning and local structural optimizations for an existing network, which I think is more demanding at least in the industry. Looks less cool than NAS though, maybe that's the reason.

sgt101 · on March 1, 2019

I don't understand - isn't this model fishing? How is it different?

thanatropism · on March 1, 2019

With training, test and validation sets.

In good old fashioned statistics there's the idea of the jackknife: for the i-th sample run a regression on all the data except i, and store statistics of interest (coefficients, predictions, etc). This gives you an ipso facto sampling distribution for the statistics of interest.

Similar and more common in econometrics is the bootstrap: run your model in like 1999 subsamples (with repetition) of the data and get sampling distributions.

With said sampling distributions, whether from the jackknife or the bootstrap, you're able to test whether your model is valid -- what's the probability that it'll have significant coefficients or an r2/mae/mape score indicating predictive capacity.

Cross-validation (and even scikit-learn is starting to default to five folds not three) is a "lazy" version of this. You don't get a sampling distribution but at least you're able to know that a given model appears good because it grips the data with all its might and doesn't work out-of-sample.

sklearn even offers the jackknife under some ML-y name like "one at a time scoring".

glial · on March 1, 2019

Yes, but that's not necessarily bad. You want a model that effectively captures the structure present in your dataset. There are currently only rules-of-thumb in model architecture, and it makes sense to explore the model space to determine which architecture and hyper parameters are suitable to the needs at hand. Two things save this from being a statistical sin: one, the final evaluation set is typically different than the validation set, and evaluation is only performed at the end of the 'fishing expedition', thus providing a reliable measure of the model's ability to generalize. Second, we're doing engineering here, not science, and our goal is to capture the structure of observations and not make a scientific claim about values of latent parameters.

sandGorgon · on March 1, 2019

interesting - there's no scikit support, which for long has been the mainstay for data scientists everywhere.

Are people migrating from scikit to tensorflow in production for non-deep learning usecases ?

mmq · on March 1, 2019

I think it should probably support scikit as well as any other library, since it's only making suggestions of hyper-parameters based on recorded/historical observations or random evaluations.

At least that's the behaviour of the platform[1] I am working on.

[1]: https://github.com/polyaxon/polyaxon#hyperparameters-tuning

pplonski86 · on March 1, 2019

I think it all depends on the purpose of the library and who is a target user. The NNI is a package for tuning neural networks models, it will be mostly used in use cases that require deep neural networks, like image classification or voice recognition.

BTW, I think all autoML solutions forget about end users. They all require too much engineering knowledge from the user. I think it will be nice to have an autoML solution that can be used by citizen data scientist.

human_scientist · on March 1, 2019

What about approaches like auto-sklearn [1]? With these it is basicaly:

  >>> automl = autosklearn.classification.AutoSklearnClassifier()
  >>> automl.fit(X_train, y_train)
  >>> y_hat = automl.predict(X_test)

[1] https://automl.github.io/auto-sklearn/stable/

minimaxir · on March 1, 2019

> BTW, I think all autoML solutions forget about end users. They all require too much engineering knowledge from the user. I think it will be nice to have an autoML solution that can be used by citizen data scientist.

This is the approach of a project I am currently working on. (and am now explicitly making clear in the README!)

pplonski86 · on March 2, 2019

Could you provide some link to the project?

mmq · on March 1, 2019

UPDATE: Looking at the docs, there's an example[1] using this library with scikit-learn.

[1]: https://nni.readthedocs.io/en/latest/sklearn_examples.html

scottlegrand2 · on March 1, 2019

At a previous gig we tried to do this: port a computational graph that wasn't a neural network to tensorflow. It was a disaster. Tensorflow is very tightly optimized for the things Google think are important. if you fall off of those paths tensorflow is a god-awful slow tool to use. We saw a ~20x regression in performance.

in contrast, when we wrote bespoke GPU code for the graph, we saw a ~25x performance increase over relying on CPU plus MKL. I am being deliberately vague here and I cannot give further detail.

ec109685 · on March 10, 2019

You are somewhat uniquely qualified to do so:

> possibly the world's first or second (full-time) CUDA programmer, with 14 filed patents, and the world's fastest implementations of molecular Dynamics (CUDA ports of Folding@Home and AMBER).

scottlegrand2 · on March 12, 2019

Yes, compared to someone who insists on doing all of their computation from python alone, I have a unique (and in my opinion absurd) advantage.

Because I think that's insane. It's one thing if you don't care about speed and you care more about time-to-market. It's another thing if you're complaining about things being too slow but you're not willing to learn about anything that would let you do anything about it. I run into far more of the latter.

williamsmj · on March 1, 2019

There is scikit support. There's an example in the docs.

https://nni.readthedocs.io/en/latest/sklearn_examples.html

cuchoi · on March 1, 2019

I think that for Neural Networks scikit has not been the "go to" library, in particular AutoML advertises that they automate neural architecture search which I don't think scikit allows a lot of flexibility for that.

samcodes · on March 1, 2019

Have you seen TPOT [0]? AutoML library that uses genetic algorithms to write scikit code for you. So fun.

[0] https://github.com/EpistasisLab/tpot

ayidnelm · on March 1, 2019

There's also auto scikit-learn https://github.com/automl/auto-sklearn if you haven't already come across that.

streetcat1 · on March 1, 2019

scikit learn is a different type of search, hence it will not be supported by this tool or any DNN search tool.

DNN require an architecture search, I.e. the building block are full layers, depth of the network, optimizer etc.

scikit learn search a parameter space, I.e. the algorithm weight are much much simpler and few.

So to sum up, DNN search involve big building blocks while scikit learn search (or for that reason any "classical ML" algorithm) is more of a parameter search.

[ The actual sci kit learn search would also include pre processing steps, which can be seen as a separate block]

Also, note that that DNN search is much more expensive than scikit learn search (100X) ]

human_scientist · on March 1, 2019

Automatically building a scikit learn estimator might include many conditional hyperparameters and also a very large amount of them (<100) [1]. However, performing joint architecture and hyperparameter search can be framed to be on a much simpler search space, e.g., for a recent paper that aims to automate the design of RNA molecules, we formulated a 14 dimensional search space which includes very little conditional hyperparameters [2].

The tools included in the repository are very broadly applicable and only a few of them are specifically targeted at neural architecture search.

[1] https://www.kdnuggets.com/2016/08/winning-automl-challenge-a... [2] https://openreview.net/forum?id=ByfyHh05tQ

williamsmj · on March 1, 2019

This tool absolutely supports scikit-learn. Please see the docs. https://nni.readthedocs.io/en/latest/sklearn_examples.html.

nurettin · on March 1, 2019

Do we need a hyper-parameter tuner tuner for this?

mlthoughts2018 · on March 1, 2019

Stuart Geman (one of the inventors of Gibbs Sampling) always used to say, “Parameters are the death of an algorithm.”

nurettin · on March 2, 2019

Environmental constraints (like width, height) are not bad. I would have argued Mr. Stuart.

angel_j · on March 1, 2019

Does it test against and prevent over-fitting?

hestefisk · on March 1, 2019

This is very cool.