
An End-to-End AutoML Solution for Tabular Data at KaggleDays - antgoldbloom
https://ai.googleblog.com/2019/05/an-end-to-end-automl-solution-for.html
======
ipsum2
> Erkut Aykutlug and Mark Peng used XGBoost with creative feature engineering
> whereas AutoML uses both neural network and gradient boosting tree (TFBT)
> with automatic feature engineering and hyperparameter tuning.

It's hilarious that gradient boosted descent tree beat Google's fancy AutoML-
generated neural networks.

~~~
ptah
yes, and i bet they didn't use 2500 CPUs either

~~~
pplonski86
The first thing is, the Neural Networks are not golden ML algorithm that is
working always the best. I'm not surprised to see GBDT methods working better
than NN. The NNs are very powerful because they can accept the wide-range of
data types (tabular, images, voice) which is not possible with GBDT (maybe not
yet).

The second, I think architecture search fo NNs is very inefficient right now.
Most of the methods are training new NN in every attempt. There is a lack for
method that can start with any NN architecture and grow/prune it dynamically.
Take a look at this poster, with dynamic topology adjustemnt for MLP
[http://www.ire.pw.edu.pl/~rsulej/NetMaker/icaisc/icaisc_post...](http://www.ire.pw.edu.pl/~rsulej/NetMaker/icaisc/icaisc_poster.pdf)

~~~
mark_l_watson
You might like AdaNet that doe’s architecture search in one TensorFlow session
and seems efficient (at least my use of it).

------
kmax12
I think it’s a bit of an overstatement to call this an end-to-end solution.

What they are starting with here is a single table of data with all the
features already defined and an existing binary label column. Typically when
this type of data is collected in the field it is much more fine grained (i.e
many observations collected over time) and unlabeled (e.g how do we define a
true example? How many false examples do we select?).

The competition description even goes as far to say “We have chosen a dataset
that you can get started with easily”.

So, yes, this is a cool demonstration of Google's product, but the success in
the competition might not extend to the problems real business face when
trying to apply ML to a problem like this.

That being said, I do think AutoML can help with these problems as it is
extended to handle data that isn’t in a single table already.

For example, I’m a developer of a open source library called Featuretools
([https://github.com/Featuretools/featuretools](https://github.com/Featuretools/featuretools))
that tries to automate feature engineering for temporal and relational
datasets. Basically, it helps data scientists prepare real world data into the
form this competition starts with.

------
hooloovoo_zoo
Slightly misleading as it wasn't a traditional competition. The same page
shows it at about the 75th percentile in "real" ones.

~~~
ptah
still pretty good for almost zero effort on human side

------
rahimnathwani
I'm interested to know how easy this for regular people (software engineers
with just a little knowledge of data science) to use.

This part stands out:

"our team spent most of time monitoring jobs and and waiting for them to
finish. Our solution for second place on the final leaderboard required 1 hour
on 2500 CPUs"

Before I got to this part, I had assumed using AutoML would involve only
reformatting the training/validation data, and then letting a single job run
its course. Why does something that's 'automatic' need people to run multiple
jobs?

Anyone know why they used CPUs instead of GPUs/TPUs? If they're distributing
the computation over 100s of CPUs, then it's clear the computations can be
done in parallel.

~~~
fnbr
(I'm a researcher in ML/DL.)

> Anyone know why they used CPUs instead of GPUs/TPUs?

Cost and resource availability.

They're distributing each computation on a different CPU, not distributing
each computation over multiple CPUs.

It would be faster to have each computation run on a CPU + GPU, but that would
be very very expensive, and hard to schedule.

GPUs/TPUs are also only faster for sufficiently large networks and
sufficiently large batch sizes. There's a large fixed cost to send data
to/from the CPU, and for smaller networks, it's often not worth running it on
a GPU. No idea if this was the case here.

~~~
yazr
DOes this 2500-cpu-hr cover the ENTIRE learning process?

Lets even your first run is crap, and you try again with RGB instead of YUV or
whatever. So you do 4 runs.

So 10000-cpu-hours replace a week of work of a qualified ML engineer. This is
pretty amazing. If i understand correctly.

~~~
fnbr
I'm not sure about the exact details, but that's my understanding too. It is
very exciting. This is directly replacing a week of work (if not more!) of a
qualified ML engineer. Additionally, as all of the experiments are cheap
(being run on a single CPU) you can run them on the cheap, interruptible,
cloud instances, and it's not the end of the world if you need to restart some
experiments.

This is also not mature science- there's still a lot of active research being
done on AutoML- so there's still a lot of potential for improvement.

------
filleokus
How does these auto ML-solutions (like h2o) work in practice, anyone willing
to share their experience?

I wonder how automatic machine learning tools like these will shape the "data
science" roles in the future. Obviously, the most cutting edge research will
always be done by specialised human experts, but perhaps tools like these will
lower the bar required for the bulk of mainstream ML work.

~~~
tixocloud
Auto ML is great as long as there is some rational on why a particular model
is developed. However, in real world applications, there are so many
constraints and considerations needed besides just having the most accurate
model.

Furthermore, as a VP of data science, selling the business on the value and
benefits of data science is non-trivial as many are not even aware of what's
capable with data science so my personal opinion is that ML isn't quite
suitable for mainstream work. Practitioners still need to have a strong grasp
of the business context, data features and nuances, etc. which still is quite
technical in nature. Asking a data analyst to build models with auto ML is
something I consider to be a leap too far and is risky to put into production.
The only way it could work is to have an experienced data scientist supervise.

~~~
mlthoughts2018
Exactly this. The combination of selling a modeling strategy to higher ups,
customizing a model for super weird deployment or resource limitations that
exist for political or historical reasons, and all your standard trade-offs vs
feature engineering and model selection just means that AutoML (and similar
tools) has very little applicability in most product companies.

If you work in Google & you can remove the political blockers & guarantee the
model’s space of deployed resources is parametrizable with a clean set of
parameters that AutoML can consider as part of the optimization, then by all
means use it.

That’s just decades away from being viable at any given product company.

------
pplonski86
Anthony, can Kaggle make this dataset public or make competition public and
enable post-competition submission? It will be beneficial for AutoML research.

------
villux
Does someone know what process they use for feature engineering?

~~~
ptah
from the diagram on that page it looks like it is automated
[https://2.bp.blogspot.com/-kIut-
J_oCmI/XNRSSXIyZ3I/AAAAAAAAE...](https://2.bp.blogspot.com/-kIut-
J_oCmI/XNRSSXIyZ3I/AAAAAAAAEHs/CiJIqr24OrY3srh_1FI2xlugEGvjjV06QCLcBGAs/s1600/image2.png)

~~~
villux
Yes I would be curious to know more insights on that automated process

------
martingoodson
Looks like H20 achieved a score of 0.61312 against Google's 0.61598, just
training on a single machine:

[https://twitter.com/ledell/status/1116533416155963392](https://twitter.com/ledell/status/1116533416155963392)

------
kelvin0
Let's just hope this does not become the 'Excel' of the ML space. Then anyone
will start 'coding' some godawful models and use them in critical day to day
infrastructure ...

Don't get me wrong, I'm all for democratizing ML, but sometime these tools
become fully-automatic-high-caliber footguns.

~~~
bouk
Excel is an amazing tool, and if this becomes the Excel of the ML space then
it will be a great success

~~~
kelvin0
The horrors I've seen made in Excel still give me nightmares to this day. It's
low barrier to entry is both a blessing and it's curse. But at any level this
can be said of any tool under the sun ...

~~~
cwilkes
One man’s horror is another man’s only way to get anything done.

I used to be violently anti spreadsheet but have come around to being amazed
what people can do with a very limited subset of tools. So instead of looking
down upon them (not saying that you are) I admire what non technical people
that just want to automate something can do.

Also I think spreadsheets are better than half of the code I’ve ever written
as it is easier to extend without having to delve into the guts of some
impossible to understand language.

------
ptah
Can automl be replicated outside of google cloud?

~~~
Tarq0n
There are many autoML libraries, however I'm not aware off any that do
production-grade neural architecture search.

~~~
pplonski86
I'm working on AutoML solution, that is available in the cloud
([https://mljar.com](https://mljar.com)). What is more, the core of my AutoML
is open source ([https://github.com/mljar/mljar-
supervised](https://github.com/mljar/mljar-supervised)) - both are easy to
use. The cloud version has user interface so you need just to upload data and
do few clicks. You don't need to have programming knowledge. For python
package, you need to know how to manipulate data in python (basic numpy and
pandas).

I've made many tests of my AutoML solution and I observe that Neural Networks
doesn't work the best on tabular dataset (maybe not training long enough, but
I dont have 2500CPU hours). I really prefer gradient boosting methods
(xgboost, catboost, lightgbm) on tabular data. They are much faster than NN
and require less preprocessing (no feature scaling).

