
A Tour of the Top Algorithms for Machine Learning Newbies - xTWOz
https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11
======
polkapolka
The image for logistic regression is hilariously wrong. It shows the sigmoid
as a decision boundary.

Also dont get hung up about no free lunch theorem. That is a great result in
computer science theory with little practical impact: just pick neural
networks for unstructured and GBDT for structured data. For the vast majority
of real-life problems (not all possible problems) these are the single best
algorithms.

~~~
bunderbunder
Logistic regression isn't sexy, but it can still achieve near state-of-the-art
results, is reasonably resistant to bias^H^H^H^H variance, and generates
parameters that you can easily explain to someone with no background in math.

There's a lot of value in all that. Especially if your deliverable is
something that a business is going to use, and not just a Kaggle entry.

~~~
claytonjy
> generates parameters that you can easily explain to someone with no
> background in math

I know it _seems_ that way, but there's a surprising amount of nuance there
and I think we're both fooling and limiting ourselves by letting this idea
fester.

For one, unlike linear regression, logistic regression estimates aren't
collapsible, so you can NOT interpret them as "changing this input by X
changes the output by Y". That's only true if your set of covariates is
perfect, which is never true, though in practice this interpretation might not
be _that_ far off.

Another issue I see is practitioners not being aware of scaled/unscaled
estimates; I've seen real papers from AI groups use logistic regression
estimates like feature importance rankings, but using estimates in the scale
of the original features, and not understanding the distinction when
confronted about it.

From a practical sense, I think practitioners are much better served using
random forests as their initial exploratory models. Less effort for results
that are in practice at least as good as a well-prepped logit. Plenty issues
with feature importance there, but not any worse than with logistic
regression.

~~~
bunderbunder
I don't think that's such a big deal in practice. See
[http://jakewestfall.org/blog/index.php/2018/03/12/logistic-r...](http://jakewestfall.org/blog/index.php/2018/03/12/logistic-
regression-is-not-fucked/), for example.

tl;dr: The upshot is that non-collapsibility means that I can't use LR
coefficients for things that I don't really need to use them for, anyway. That
doesn't feel like a crippling limitation to me.

(Well, also, I have to occasionally pause to cross my fingers and say, "
_ceteris paribus_ ," under my breath, which does admittedly make people think
I'm some sort of weird Harry Potter nut. Which is OK. They're not wrong,
they're just right for the wrong reason.)

Nor does it render its coefficients less interpretable than those of most
other models. "Less interpretable than OLS" can still be pretty darn
interpretable.

~~~
claytonjy
I had exactly that post in mind, it really raised my awareness of these
issues.

I agree with Jake's interpretation of the conditional interpretation of the
estimates, but the practical issue is that virtually nobody not well-educated
in statistics will do that correctly. In particular, people tend to do exactly
what Jake concedes rarely makes any sense, which is comparing estimates across
different model specifications.

You and I might interpret these betas just fine, but if we show them to a less
stats-y audience, will they?

~~~
bunderbunder
I guess it depends. I have the luxury of working in a very "this is machine
learning, which is _not_ to be confused with statistical inference" problem
domain. It doesn't really even really make sense to interpret most the models
I build as describing any sort of causal relationship, and when people are
looking at the parameter estimates, they're really just trying to figure out,
"What does _this model_ think is important?"

~~~
claytonjy
That sounds nice!

Feature ranking seems like a clearly safe interpretation of betas, though I've
been bitten too often by letting glm (in R) scale my predictors, giving me
back estimates on the original scales, and thus incomparable, and seen it
happen to others even more. Easy to miss when your original scales aren't all
that different.

------
PLenz
The algo is almost besides the point. The real work is in getting the data,
cleaning it, and then explaining WTF is going on to lay people. Im most DS
jobs how you get there is irrelevant.

------
tw1010
If you can't explain something in a different way from how you learned it, you
probably don't understand it adequately.

------
Wiretrip
Nice to see a roundup of ML that doesn't just go straight to Deep Learning for
a change :-)

~~~
ur-whale
Or at all, for that matter.

~~~
Wiretrip
Indeed!

------
echelon
I'm a machine learning newbie, and I'd really like to speak with someone in
the field to figure out what I'm wanting to learn.

The problem I have is that I have a large labeled data set of spoken words and
phonemes from a single speaker. I'd like to train a model and generate new
phonemes (of various pitches, speeds, and intonations) with which to build a
concatenative speech engine.

What algorithms and models would I be looking to use? What are the primary
techniques? Is this something I could quickly begin to see results in, or
would it take months or years of tweaking?

To clarify what I'm doing, I own/built trumped.com, and I'm trying to improve
the speech synthesis quality by generating better fitting units of speech.

~~~
PeterisP
If you've got a model that can generate new phonemes (of various pitches,
speeds, and intonations), then you have a parametric speech synthesis engine
and can use it directly as-is instead of strapping a concatenative engine on
top.

For the techniques, Wavenet and Tacotron (e.g.
[https://google.github.io/tacotron/publications/tacotron2/](https://google.github.io/tacotron/publications/tacotron2/))
seem to be the state of art, but they are reportedly hard to replicate.

------
didibus
I've been wondering, is nearest neighbour really ML? No part of the logic is
learned from the data. It feels just like a glorified lookup table, where the
look up is fuzzy to some predefined definition of nearest.

~~~
5minbreak
k-nearest neighbors classification is one of the first non-linear supervised
learning algorithms. Its predictions are derived from the data sample
distances.

It _is_ basically a glorified fuzzy lookup table, but then again, so can one
view deep learning (fuzzy hierarchical localized lookup).

Pure memorization can even outperform logistic regression, especially with big
data sets, so there is some recent debate as to what degree models memorize
and to what degree they generalize.

~~~
didibus
Interesting, though I don't totally see how deep learning would be similar. On
deep learning, it is my understanding the weights are learned from the data.
These are effectively constants, and represent logical rules.

So in essence, the rules which relates input to output are learned from the
data in deep learning.

In nearest neighbour, the rule wasn't learned, we figured out the rule
ourself: "use the nearest data point's result".

But in deep learning, the rule might be something like when feature x and y
are between z range of each other and etc. And this rule is not defined by us,
but by the weights which are learned from the data.

Effectively, deep learning thus learns the rules that define the relationship
between input and class. But nearest neighbor is just a static rule that
happens to be pretty general in essence, so it gives okay result for a lot of
problems.

Not an AI expert, so take all this as my simple current understanding.

~~~
ipsa
You could automatically encode a KNN model as a set of logical if-then rules:
"if x1 > 10 and x2 < 3 then 4 nearest labels are [1, 1, 1, 0]" so the
information is there. For KNN you could also train weights for every variable
(how much should they count in the distance calculation?). For deep learning
you have way more parameters and architecture choices than for nearest
neighbors (mostly the distance metric and the number of neighbors to
consider). After that, both learn a mapping from input data to a target.

------
Balgair
> The technique (linear discriminate analysis) assumes that the data has a
> Gaussian distribution (bell curve), so it is a good idea to remove outliers
> from your data before hand.

Um, this seems very fishy to do, right? What am I missing here?

~~~
claytonjy
Depends on your goal.

Outlier removal should be done much more carefully when the goal is inference;
trying to test if your hypothesis is true. Here, outlier removal is a super
easy way to accidentally p-hack. Current best-practice is to pre-register your
analysis, including how you'll define and handle outliers.

For predictive goals, where the idea is to predict the class/value of unseen
data, outlier removal is often a good way to keep your bias in check and not
bias your model towards the outliers. The trade-off is that future outliers
will be predicted as though they weren't, e.g. much closer to the mean than
they should be. This is what the article is trying to do.

There's also a whole wide world of outlier & anomaly detection, where you want
to say e.g. "this new data point is probably an outlier".

------
ur-whale
No mention of Deep Learning in a top ML algorithm list in 2018? Kinda odd if
you ask me.

Also, in the SVM section, no mention of kernel methods? (yet the picture shows
a windy boundary). Also odd.

~~~
joe_the_user
The first seven algorithms could be defended as "elementary" methods that
would help someone work up to neural nets and deep learning. But once he
starts talking about SVM, I think he's talking of a method as sophisticated as
neural nets and deep learning.

Neural nets and SVM were competitors - competitively applicable and
competitively difficult - in the aughts. Deep learning has now pulled away but
not by the discovery of fundamentally more complicated methods. Rather, the
process has involved _lot and lots and lots of little refinements_ , through
throwing lots of people, advanced-math intuitions and computing power at it,
etc. Learning everything needed to create state-of-the-art results is hard (as
far I can tell/guess) but the basics are reasonably simple.

