
Show HN: Provide a CSV and a target field, generate a model and code to run it - minimaxir
https://github.com/minimaxir/automl-gs
======
human_scientist
Nice project!

You use gridsearch for hyperparamter optimization and state that at some point
you would like to add a bayesian approach. One simple change that could boost
the performance, would be to use random search inplace of gridsearch. Grid
search is known to perform worse than random search in cases where not all
hyperparameters are of similar importance [1]. Intuitively, in grid search a
lot of evaluations evaluate the same setting of an important hyperparameter
with changing settings of not important ones.

[1] Section 1.3.1; Automatic Machine Learning: Methods, Systems, Challenges;
Chapter Hyperparameter Optimization

[https://www.automl.org/wp-
content/uploads/2018/11/hpo.pdf](https://www.automl.org/wp-
content/uploads/2018/11/hpo.pdf)

~~~
MasterScrat
What about HyperOpt?

~~~
human_scientist
HyperOpt is a library for hyperparameter optimization, my comment was about
algorithms. From the homepage of hyperopt:

"Currently two algorithms are implemented in hyperopt: Random Search Tree of
Parzen Estimators (TPE)"

(TPE is a Bayesian Optimization algorithm)

~~~
MasterScrat
Ah yes indeed, I thought that hyperopt was actually the name for the TPE
algorithm. TPE should outperform random search in all cases, right?

~~~
human_scientist
It depends on the number of evaluations. The more evaluations the stronger the
model built by the TPE algorithm. With very few evaluations, we would expect
TPE to match random sampling. This effect can for example be seen in the plots
of the "Bayesian Optimization and Hyperband" paper [1, 2], where the plotted
"Bayesian Optimization" approach is TPE.

Also, there might be model bias: for example if the objective function is
stochastic (e.g., a reinforcement learning algorithm that only converges
sometimes) or not very smooth, TPE might exploit areas that are not actually
good based on one good evaluation. In those cases TPE might perform worse than
random search! To alleviate the effect of model bias in model based
hyperparameter optimization (e.g., TPE) and to obtain convergence guarantees,
people often sample every k-th hyperparameter setting from a prior
distribution (random search) (this is also the case for the plots in [1, 2].

If you are wondering which HPO algorithm you should use (or HPO in general), I
would highly recommend the first part of the AutoML tutorial at NeurIPS2018
[3] given by my advisor.

[1] [https://www.automl.org/blog_bohb/](https://www.automl.org/blog_bohb/)

[2]
[http://proceedings.mlr.press/v80/falkner18a.html](http://proceedings.mlr.press/v80/falkner18a.html)

[3]
[https://nips.cc/Conferences/2018/Schedule?showEvent=10979](https://nips.cc/Conferences/2018/Schedule?showEvent=10979)

------
jdright
Please, consider changing instances of "Generates native Python code" to
something like "Generates standard Python code without 3rd party libraries
dependencies" or equivalent. The term "native" here is not correct and is
confusing.

------
pplonski86
Congratulations on lunching! I'm also working on open source autoML solution
and I have similar problem with finding good heuristics for column type
inference - I always left the final choice to the user.

Have you compared the solution accuracy with other frameworks? (or is it not
the priority for your package right now)

Do you have early stopping implemented?

I will play with your package and come back with more questions.

~~~
minimaxir
> Have you compared the solution accuracy with other frameworks? (or is it not
> the priority for your package right now)

Still working on that.

> Do you have early stopping implemented?

Early stopping is intentionally not implemented in order to keep things
apples-to-apples between trials by doing a full run, but I'm open to
reconsidering that, or at the least have a user-set option for early stopping.

------
nickvincent
This is extremely cool/useful. Really appreciated how well the project is
explained/motivated, e.g. the DESIGN.md file. Thanks for making it public!

Question about the long-term vision: do you see this as potentially being an
educational artifact? Seems like just reading through the code could be a
great way to get familiar with good ETL practices, good vectorization
practices, etc.

------
caseyf7
The titanic data is a nice demo, but I’m kinda disappointed you didn’t try to
predict WWMWD? Longer demo gif please!

~~~
yboris
Please ease up on acronyms or add the expanded form in parentheses. I have no
idea what WWMWD is even after googling for it for a while.

------
ptd
Cool project! Have you compared your results to Ludwig by Uber? It would be
interesting to see where the results differ.

[https://uber.github.io/ludwig/](https://uber.github.io/ludwig/)

------
mrwebmaster
There are some discussions on whether data scientists are going to be replaced
by automatic tools in the near future.

Can this be considered an example of a tool that partially replaces the work
done by a data scientist? At least it can save a lot of time.

~~~
laingc
Whenever anyone asks this, I always wonder whether I live in a bubble or they
do.

Creating simple predictive models where your problem is already easily
narrowed down to a "given x predict y" definition is pretty trivial. Having it
automated is nice, but not exactly a hard thing to do.

Genuine question: how many people have jobs where those kinds of problems form
any significant part of their workload?

I also often see a response to this sentiment along the lines of, "Yeah, but
there's also data cleaning..." etc. My reaction to this is mixed. I mean,
sure, there is also data cleaning involved, but is this really where people
spend most of their time?

My team spends most of our time doing the following:

1\. Formulating problems. Figuring out the various different ways that a real-
world problem can be expressed mathematically and feasibly attacked
computationally.

2\. Engineering software to implement the solutions to these problems,
sometimes using some of the (amazing) frameworks out there for ML or
probabilistic programming, but often having to develop our own approaches from
scratch.

3\. Doing all the management, stakeholder relationship stuff, business cases,
etc. that make your work relevant and possible.

4\. Getting data. Always an issue.

I'm very genuine in my curiosity here: are we total snowflakes, and most data
scientists spend their time cleaning data and building "given X predict y"
models?

~~~
MuffinFlavored
How many business analysts/low level coders have jobs because they just
implement the same repeated CRUD screens/wireframes or maintain WordPress
themes? Not the same as data science, but close.

------
malshe
This is fantastic! Thanks for putting this on Github. I am planning to build a
[R] Shiny app for automl. This gives me good learning opportunity.

------
ashelmire
Very cool. Can you provide some details about how this tool architects models?

~~~
minimaxir
The best way to do that is IMO to look at the templates itself:
[https://github.com/minimaxir/automl-
gs/tree/master/automl_gs...](https://github.com/minimaxir/automl-
gs/tree/master/automl_gs/templates/models/tensorflow)

tl;dr it uses the standard encoder-combiner-MLP-output results, but with a lot
of variability in the process.

------
aymeric
So, how do you survive a Titanic accident?

~~~
steve19
You probably want "old fashioned" regression models if you want to understand
which predictors affected survival and how they affected it.

