
On moving from statistics to machine learning, the final stage of grief (2019) - yoloswagins
https://ryxcommar.com/2019/07/14/on-moving-from-statistics-to-machine-learning-the-final-stage-of-grief/
======
laichzeit0
There's this undertone of "I should be payed as much/more than Data Science
people because I'm better than them at statistics and data science = machine
learning = statistics".

My experience doing data science at small companies that can't afford to hire
more than 1 person for the role is that it is so much more than just building
models or doing statistics.

You have to:

1\. Build APIs and work with developers to get predictive models integrated
into the rest of the software stack

2\. Know how to add logging, auditing, monitoring, containerizing, web
scrapers, cleaning data(!!), SQL scripts, dashboards, BI tools, etc.

3\. Do some basic descriptive stats, some basic inferential stats, some
predictive modeling, work on time-series data, sometimes apply survival
analysis, etc. (Python/R/Excel who cares)

4\. Setting up data pipelines and CI/CD to automate all this crap

5\. Trying to unpack vague high level requirements along the lines of "Hey do
you think we could use our data to build an 'AI' to do this instead of
manually doing it" and then coming up with a combination of software /
statistical models that perform as least as good or better than humans at the
task.

6\. Work with non-technical business users and be able to translate this back
to technical requirements.

Hey, if all you do all day is "build models" then that sounds like a very
cushy DS job you have. It's definitely not been my experience. I would
describe it more like a combination of software engineering and statistics and
business analyst. That's why it pays higher than just statistics. But this is
just my experience..

~~~
TrackerFF
And in all honesty, no data scientist with an ounce of self-worth should work
long-term for such companies, unless it just happens to be their own.

You're basically doing 3 jobs for the price of one: Software Engineer, Data
Engineer, Data Scientist.

Sure, you'll be a jack of all trades, as far as data goes, but it'll be at the
cost of some specialization.

I'm probably gonna get a lot of sh!t for this post - probably from data [x]
people that are in that exact position themselves, but the above description
is exactly why I'd aim for larger companies with somewhat established
analytics / data / ML teams or offices. You get to focus on the important
stuff, instead of juggling ten balls at the same time.

(And it's not only in the field of data science. Some of the traditional SE
positions I see at startups or small companies look absolutely grotesque -
basically the whole IT and Dev. department baked into one job)

~~~
itsoktocry
> _You get to focus on the important stuff, instead of juggling ten balls at
> the same time._

So on one hand, you can't build any models without the work of the engineers,
but on the other the model building is "the important stuff"?

Maybe it's just me, but I _enjoy_ working on all aspects of the data pipeline.

~~~
amznthrowaway5
Model building is often the trivial part, and often you can't build models
without a solid understanding of things like the data pipeline.

------
natalyarostova
> Machine learning is genuinely over-hyped. It’s very often statistics
> masquerading as something more grandiose. Its most ardent supporters are
> incredibly annoying, easily hated techbros

This sort of fashionable disparagement of a group of people to signal that
you’re not part of the “bad group of tech bro’s” is so trashy. Why are these
random people you easily hate? Who are they? Why take glee in shared hatred?

I’ve worked as a sr DS at FAANG for 4 years. I’ve recently worked through
Casella Berger, because I wasn’t comfortable being one of those DS who didn’t
know math stats. But before I did work through it, I worked with people from
PhD stat programs who were so ineffective. Despite knowing so much more stats
than me, they would freeze up and fail everytime they had to deal with any
sort of software system or IDE. It was so weird to me that my ability to use a
regression, even before I knew the theory, was more valuable than their
ability to use a regression to its full power, simply because I could fight
the intense battle to take that idea and put it into reliable production code.

But generally I hate _hate_ this war between DS and stats. It’s so stupid.
Maybe not their first year, but eventually any DS who wants to be a master of
their craft ought to learn math/theoretical stats. And some don’t want to be a
master of their craft, and instead want to go into management or whatever, and
that’s fine.

------
CrazyStat
> I’m sure you’re asking: “why allow your parameters to be biased?” Good
> question. The most straightforward answer is that there is a bias-variance
> trade-off. The Wikipedia article does a good job both illustrating and
> explaining it. For β-hat purposes, the notion of allowing any bias is crazy.
> For y-hat purposes, adding a little bias in exchange for a huge reduction in
> variance can improve the predictive power of your model.

I'm going to push back on this.

The author seems to understand the bias-variance tradeoff as applying
primarily to y-hat, and allows that if you are primarily interested in y-hat
then it can make sense to make that tradeoff (introduce bias in exchange for
lower variance). But the bias-variance tradeoff is more general than that.
There's also a bias-variance tradeoff in beta-hat, and you can make a similar
decision there to introduce some bias in beta-hat in exchange for lower
variance, lowering the overall mean square error.

There's nothing crazy about this. The entire field[1] of Bayesian statistics
does this every day--Bayesian priors introduce bias in the parameters, with
the benefit of decreasing variance. Bayesians use these biased parameter
estimates without any problems.

Classical (non-Bayesian) statistics has tended to focus heavily on unbiased
models. I suspect this is largely because restricting the class of models
you're looking at to unbiased models allows you to prove a lot of interesting
results. For example, if you restrict yourself to linear unbiased models, you
can identify one single `best` (i.e. lowest variance) estimator. As soon as
you allow bias you can't do that anymore.

[1] Except empirical Bayes, which is a dark art.

~~~
astrophysician
“Non-Bayesian” stats uses bias in the exact same way that Bayesian stats does,
because Bayesian stats and frequentist stats are mathematically equivalent.
When someone thinks they’re not using priors, they’re wrong, they’re usually
using a flat prior on the model parameters, but that adds bias just like any
other prior! A flat prior on theta is different than a flat prior on
log(theta) or some other parametrization, and flat priors are often times the
wrong choice, so this notion that Bayesian stats is some “special” type of
inference and that there is some way to do inference without bias is just a
very large misconception.

~~~
CrazyStat
> “Non-Bayesian” stats uses bias in the exact same way that Bayesian stats
> does

This is incorrect.

Bias has a very specific specific mathematical meaning in statistics--the
difference between the expected value of the estimate (under the sampling
distribution) and the true value. There are many examples of parameter
estimates in classical statistics that have zero bias under that definition.

> Bayesian stats and frequentist stats are mathematically equivalent.

Also incorrect. Bayesian and frequentist methods focus on different
conditional probabilities and can give very divergent results even in simple
cases. See e.g. Lindley's paradox [1].

[1]
[https://en.wikipedia.org/wiki/Lindley%27s%20paradox](https://en.wikipedia.org/wiki/Lindley%27s%20paradox)

~~~
astrophysician
I appreciate your comments.

> Bias has a very specific specific mathematical meaning in statistics--the
> difference between the expected value of the estimate (under the sampling
> distribution) and the true value.

Right, I'm aware of what bias is, and how it's defined, and while the
statement you make is true, it misses the point: regardless of the camp you're
in (frequentist or Bayesian), inference involves a prior, and that prior will
affect your inferred parameter estimates (it may bias them, it may not, but
flat priors do not guarantee unbiased estimates). I agree, under various
contrived scenarios, you can show your parameter estimate is unbiased when
using a flat prior (yay!) but what happens if you're using the wrong
parameterization of your model? A flat prior on \theta is not a flat prior on
\log\theta or any other transformation of theta, but if you're a frequentist
what do you do about that? If you're not conscious of the prior choice you are
making, you can easily introduce bias you don't want _even with a flat prior_.

> There are many examples of parameter estimates in classical statistics that
> have zero bias under that definition.

Elaborate. I assume by "parameter estimates" you mean MAP (or MLE = MAP with
flat prior). E[\hat{\theta} - \theta] may be zero with a flat prior, but
E[\hat{\log\theta} - \log\theta] wont be, so the bias in your parameter
estimate depends on the parameter you really care about.

see e.g. [1] for an example where flat priors do actually bias inferences.

> Also incorrect. Bayesian and frequentist methods focus on different
> conditional probabilities and can give very divergent results even in simple
> cases. See e.g. Lindley's paradox [1].

OK so let me clarify because I agree that my wording is wrong: while I agree
with you that there are differences between Bayesian and frequentist
statistics, they are philosophical; they answer different questions:
ironically the wikipedia article you linked to actually says it pretty well:
"Although referred to as a paradox, the differing results from the Bayesian
and frequentist approaches can be explained as using them to answer
fundamentally different questions, rather than actual disagreement between the
two methods."

[1]: [https://mc-stan.org/users/documentation/case-
studies/weakly_...](https://mc-stan.org/users/documentation/case-
studies/weakly_informative_shapes.html)

"Although flat priors are often motivated as being “non-informative”, they are
actually quite informative and pull the posterior towards extreme values that
can bias our inferences."

~~~
CrazyStat
> I agree, under various contrived scenarios, you can show your parameter
> estimate is unbiased when using a flat prior (yay!) but what happens if
> you're using the wrong parameterization of your model?

I wouldn't consider estimating, say, the mean length of a population of fish
contrived (unbiased estimate: x-bar). Nor would I consider estimating the
probability of an event based on observations of the event happening or not
happening contrived (unbiased estimate: p-hat = #successes/#trials).

These kinds of simple estimation problems and the associated statistical tests
account for probably 90% of statistical practice. Dismissing them as contrived
is silly.

More generally, MLE estimates are always (under regularity conditions)
asymptotically unbiased even if not unbiased for a finite sample. This means
that the amount of bias decreases to zero as the sample size increases, no
matter what the parameterization is.

Finally, there is very often a natural parameterization for any given problem.
If you're interested in the arithmetic mean of a population, there's no reason
to use a log-scale parameterization. Why worry about bias in other
parameterizations when you can just use the natural parameterization, where
the estimator is unbiased? Again, I don't think such scenarios are contrived:
a very large proportion of statistical analyses deal with simple measurements
in Euclidian (or very nearly Euclidian--we can typically ignore, for example,
relativistic effects) spaces: real world dimensions, time, etc. If you're a
Bayesian and very concerned about parameterization effects you can also use a
Jeffrey's prior, which is parameterization-invariant. Notably, for the mean of
a Normal distribution, the Jeffrey's prior is... the flat prior!

> OK so let me clarify because I agree that my wording is wrong: while I agree
> with you that there are differences between Bayesian and frequentist
> statistics, they are philosophical; they answer different questions:

Yes and no. The Bayesian and frequentist approaches answer different
mathematical questions, but they are used by humans to answer the same human
questions, such as "do these two populations have the same mean?"

> see e.g. [1] for an example where flat priors do actually bias inferences.

I don't consider that a good example of flat priors biasing the inference. The
posterior with flat prior is diffuse because the data doesn't provide much
information; in the lack of much prior or likelihood information, the
posterior _should_ be diffuse, so that's a perfectly reasonable result. If you
can't stand a diffuse posterior then either collect more data or (carefully!)
introduce a more informative prior. The more informative the prior you
introduce, the more biased your inference will be; that's fine as long as your
prior is chosen carefully.

This is not to say that flat priors are never a problem. The setting I'm
familiar with where diffuse priors are most dangerous is when doing model
comparison--but that issue is specific to Bayesian methods, not frequentist.
Indeed this is one of the primary reasons Lindley's paradox arises: the
Bayesian model comparison (using marginal likelihoods or Bayes factors) gets
tricked by the diffuse prior, while the frequentist model comparison (using
null hypothesis testing) does not.

~~~
astrophysician
> I wouldn't consider estimating, say, the mean length of a population of fish
> contrived (unbiased estimate: x-bar). Nor would I consider estimating the
> probability of an event based on observations of the event happening or not
> happening contrived (unbiased estimate: p-hat = #successes/#trials).

Sure, maybe not contrived; my point is that flat priors may work in many
"typical" textbook stats problems, but they are one of many choices, and that
choice is important to be explicit about and not sweep under the rug. Because
if your entire life is measuring sample means, fine, you're never going to
need to think about this very much and life will be nice. But when one fine
day you decide to do something more complex, these are the land mines that you
shouldn't really ignore.

> These kinds of simple estimation problems and the associated statistical
> tests account for probably 90% of statistical practice. Dismissing them as
> contrived is silly.

Whether it's 90% is totally dependent on the types of problems you do. I don't
mean to dismiss them, you're right for many problems MLE is just fine. I meant
to illustrate that "unbiased" comes with many caveats, and that in many real
scenarios flat priors are not ok.

> More generally, MLE estimates are always (under regularity conditions)
> asymptotically unbiased even if not unbiased for a finite sample. This means
> that the amount of bias decreases to zero as the sample size increases, no
> matter what the parameterization is.

Is this not true of the MAP for most priors? Gaussian/Laplace priors will have
this property too, since priors become asymptotically less important the more
data you have. If your prior is zero over some of the support, you're out of
luck but this doesn't strike me as a good argument for MLE > MAP or for using
flat priors everywhere. When we have infinite data, sure, priors are
irrelevant, but we live in the real world where data is not infinite.

> Finally, there is very often a natural parameterization for any given
> problem. If you're interested in the arithmetic mean of a population,
> there's no reason to use a log-scale parameterization.

Sure, agree that parametrization isn't a problem a lot of the time, but it
_is_ something important to be mindful of, and this points towards, again, not
forgetting that you are always using a prior and that you should think about
whether or not that prior makes sense.

> Why worry about bias in other parameterizations when you can just use the
> natural parameterization, where the estimator is unbiased? Again, I don't
> think such scenarios are contrived: a very large proportion of statistical
> analyses deal with simple measurements in Euclidian (or very nearly
> Euclidian--we can typically ignore, for example, relativistic effects)
> spaces: real world dimensions, time, etc.

Yea, I mean sure: for easy problems, parametrization is obvious. That's kind
of tautological. But sometimes it's not obvious, or sometimes for
computational reasons you need to work with a log(theta) instead of theta,
etc. If you're a frequentist and you're thinking life is great because you
don't need to worry about priors, you're wrong and sooner or later you will
get into trouble; be it a parametrization issue or something else, priors are
not just something you can completely ignore. It's like saying "I always drive
without looking in my rearview mirror" \-- ok, great, you will be fine a lot
of the time, but eventually one day you will change lanes on the highway at
the exact wrong time, and you will really regret your habit of not looking in
your mirror.

> If you're a Bayesian and very concerned about parameterization effects you
> can also use a Jeffrey's prior, which is parameterization-invariant.
> Notably, for the mean of a Normal distribution, the Jeffrey's prior is...
> the flat prior!

Yep, totally agree, I have no problem with Jeffrey's priors (when they make
sense), and that's all well and good. Just to clarify: I am _not_ saying
"don't use flat priors" \-- flat priors are extremely reasonable and a good
idea in many cases, my point is flat priors are still priors, and you are
still making a statement by using them: "lets assume all possible values of
theta are equally likely a priori". Sometimes we don't really _believe_ that
but it's useful to see the implications of making this assumption. But
sometimes priors are extremely important (e.g. we want a time-dependent
measurement of a poisson rate, like conversions per dollar of ad spend, and
conversions are relatively rare: priors are your friend here, e.g. a GP prior
= Cox process or something else, even if this prior is an operational
assumption)

> Yes and no. The Bayesian and frequentist approaches answer different
> mathematical questions, but they are used by humans to answer the same human
> questions, such as "do these two populations have the same mean?"

Yes, agreed.

> Indeed this is one of the primary reasons Lindley's paradox arises: the
> Bayesian model comparison (using marginal likelihoods or Bayes factors) gets
> tricked by the diffuse prior, while the frequentist model comparison (using
> null hypothesis testing) does not.

Ah lord, but this is a terrible justification for using null hypothesis
rejection: we're almost always choosing a very simplistic distribution (e.g.
Gaussian) to do this, and reducing the question to "we reject H0 because its
very unlikely" is part of the reason why there's a replication crisis in e.g.
social sciences, because they're taught this simplistic picture without any of
the necessary nuance ('here are the assumptions we make, and under these
assumptions + H0, it is a little bit unlikely that we would have observed x').
That's a recipe for disaster. Is it not much better to discuss the full
posterior, "degrees of belief" and to be explicit about all of our
uncomfortable prior assumptions? I prefer Bayesian model selection over null
hypothesis rejection 100% of the time, especially because "Bayesian model
selection" is the only logical way to do model selection, the only caveat is
that it depends on reasonable prior assumptions and these are the hard part
(but again, at least it is explicit!).

Also, the Lindley's "paradox" example certainly seems contrived: we believe
there's a 50% chance that p = 0.5 _exactly_?? I just don't understand that
type of analysis. Come up with a prior, derive your posterior, decide the
answer to your question yourself (what is the chance that p=0.5 exactly? well,
it's exactly 0%. How much more likely is it that p=0.5036 vs p=0.5? That's a
better question...). By contrived, I mean that it appears designed to exploit
the fact that Bayesian stats will automatically prefer simpler models,
especially one with 0 degrees of freedom that is relatively close to the right
answer, but that's a Good Thing (TM).

Both frequentist stats and Bayesian stats are easy to abuse: Bayesian stats
gives a false sense of comfort because people don't worry enough about their
choice of prior, _but at least Bayesian stats is explicit about the prior!_. I
won't say that hypothesis testing is complete garbage, but it is quite
dangerous and frankly dishonest to reduce things to a p value and pretend
that's the end of the discussion.

~~~
CrazyStat
> But when one fine day you decide to do something more complex, these are the
> land mines that you shouldn't really ignore.

> in many real scenarios flat priors are not ok.

> eventually one day you will change lanes on the highway at the exact wrong
> time, and you will really regret your habit of not looking in your mirror.

Can you give some examples where frequentists hit these alleged flat-prior
landmines? I am admittedly a Bayesian by training, not a frequentist, so
perhaps it's just my ignorance showing, but I'm not aware of any such
situations.

Frequentist statistics generally relies on performance guarantees (bounds on
the false positive error rate for tests, in particular, and coverage for
confidence intervals) which are derived under the lack-of-explicit-prior, so
as far as I can tell they should be doing fine. I'd be interested in seeing
examples where frequentist analyses fail because of the (implicit) flat prior.

> we're almost always choosing a very simplistic distribution (e.g. Gaussian)
> to do this

The Gaussian distribution is a marvelous thing. The central limit theorem is,
in my humble opinion, one of the most beautiful and surprising results in
mathematics.

> Is it not much better to discuss the full posterior, "degrees of belief" and
> to be explicit about all of our uncomfortable prior assumptions?

Perhaps I'm just cynical, but I'd say probably not. A Bayesian decision
process is still a decision process and still subject to all the problems that
the frequentist decision process (null hypothesis significance testing) is
subject to: inflated family-wise error rates, p-hacking (except with Bayes
factors rather than p-values), publication bias, and so on. At best getting
everyone to do Bayesian analyses might be roughly equivalent to getting
everyone to use a lower default significance threshold, like 0.005 instead of
0.05 (which prominent statisticians have advocated for).

> I prefer Bayesian model selection over null hypothesis rejection 100% of the
> time, especially because "Bayesian model selection" is the only logical way
> to do model selection, the only caveat is that it depends on reasonable
> prior assumptions and these are the hard part (but again, at least it is
> explicit!).

Sadly there's a trap in Bayesian model selection (often called Bartlett's
paradox, though it's essentially the same thing as Lindley's paradox) which
can be difficult to spot. No names out of respect to the victim, but several
years ago I saw a very experienced Bayesian statistician who has _published
papers about Lindley 's paradox_ fall prey to this. Explicit priors didn't
help him at all. He would not have fallen into it if he had used a frequentist
model selection method, though there are other problems with that.

> Also, the Lindley's "paradox" example certainly seems contrived:

And here we are again calling a statistical test that thousands of people do
every day "contrived." You already know how I feel about that.

Yes, it's a very simple example, because that helps illustrate what's
happening. Lindley's paradox can happen in arbitrarily complex models, any
time you're doing model selection.

> By contrived, I mean that it appears designed to exploit the fact that
> Bayesian stats will automatically prefer simpler models, especially one with
> 0 degrees of freedom that is relatively close to the right answer, but
> that's a Good Thing (TM).

Preferring simpler models is not exactly what's going on in Lindley's paradox,
at least not the way that most people talk about Bayes factors preferring
simpler models (e.g. by reference to the k*ln(n) term in the Bayesian
Information Criterion). The BIC is based on an asymptotic equivalence and
drops a constant term. That constant term is actually what is primarily
responsible for Lindley's paradox, and has only an indirect relationship to
the complexity of the model.

~~~
astrophysician
Hi, sorry for the delayed response!

> Can you give some examples where frequentists hit these alleged flat-prior
> landmines? I am admittedly a Bayesian by training, not a frequentist, so
> perhaps it's just my ignorance showing, but I'm not aware of any such
> situations.

You're probably right: I myself am also a Bayesian by training as you can
probably guess but went through the usual statistics education from a
frequentist standpoint, and once I learned Bayesian statistics it was almost
an epiphany and much more intuitive and understandable than the frequentist
interpretation (but that's just me). In all honesty, I think good frequentist
statisticians and good Bayesian statisticians have nothing to worry about,
since both should know exactly what they are doing and saying as well as the
limitations of their analysis.

I wouldn't put myself in either the "good frequentist" or "good Bayesian"
categories, by the way, I am just an imperfect practitioner, but I think
that's the case for most people. My argument against frequentist statistics
for the masses is a practical one: I found myself getting into much more
trouble and having much less insight into what I was doing when I had a
frequentist background than I did when doing things from a Bayesian
standpoint, and I see many imperfect frequentist statisticians like myself
running into the same problems I used to (mostly ignoring priors when they
shouldn't or thinking a flat prior is always uninformative, etc.), but I admit
that is a wholly subjective experience. I never once thought about priors
before learning Bayesian stats, and I find many people I meet with a
frequentist background also forget the significance of priors because they
also are imperfect practitioners.

> Frequentist statistics generally relies on performance guarantees (bounds on
> the false positive error rate for tests, in particular, and coverage for
> confidence intervals) which are derived under the lack-of-explicit-prior, so
> as far as I can tell they should be doing fine. I'd be interested in seeing
> examples where frequentist analyses fail because of the (implicit) flat
> prior.

Yea, I totally agree, I just find that statistics is important in many more
contexts than just this. While you can do this sort of thing from a Bayesian
perspective (using Jeffery's priors or whatever the situation calls for), in
my experience frequentists have a tough time departing from this type of
analysis once they start diving into areas where priors are important (unless
they are also familiar with Bayesian stats!)

> The Gaussian distribution is a marvelous thing. The central limit theorem
> is, in my humble opinion, one of the most beautiful and surprising results
> in mathematics.

Agree with you, but CLM doesn't always help you. You may not always be
interested in the statistics of averages in the limit of many samples. I agree
when you _are_ doing this, CLM is a godsend.

> Perhaps I'm just cynical, but I'd say probably not. A Bayesian decision
> process is still a decision process and still subject to all the problems
> that the frequentist decision process (null hypothesis significance testing)
> is subject to: inflated family-wise error rates, p-hacking (except with
> Bayes factors rather than p-values), publication bias, and so on. At best
> getting everyone to do Bayesian analyses might be roughly equivalent to
> getting everyone to use a lower default significance threshold, like 0.005
> instead of 0.05 (which prominent statisticians have advocated for).

I disagree here. Discussing the full posterior forces you _not_ to reduce the
analysis to a simple number like a significance threshold, and to acknowledge
the fact that there are actually a wide range of possibilities for different
parameter values, and it's important to do this when your posterior isn't nice
and unimodal, etc. I don't disagree that sometimes (well, many times) the
significance threshold is all you really care about (e.g. "is this treatment
effective, yes or no"), but that's still a _subset_ of where statistics is
used in the wild. E.g. try doing cosmology with just frequentist statistics
(actually, do not do that, you may be physically attacked at conferences).

But again, I want to emphasize that doing Bayesian stats can _also_ give you a
false sense of confidence in your results, I don't mean to say Bayesians are
right and frequentists are wrong or anything, I just mean to say that
sometimes priors are important and sometimes they aren't, and I personally
find that have an easier time understanding when to use different priors in a
Bayesian framework than a frequentist one.

> Sadly there's a trap in Bayesian model selection (often called Bartlett's
> paradox, though it's essentially the same thing as Lindley's paradox) which
> can be difficult to spot. No names out of respect to the victim, but several
> years ago I saw a very experienced Bayesian statistician who has published
> papers about Lindley's paradox fall prey to this. Explicit priors didn't
> help him at all. He would not have fallen into it if he had used a
> frequentist model selection method, though there are other problems with
> that.

Like you say, there are problems with both approaches, but my point is that
when the prior is explicit, we can all argue about its effects on the result
or lack thereof. Explicit priors don't "help" you, but they force you to make
your assumptions explicit and part of the discussion. If your only ever using
flat priors, it's easy to forget that they're there

> And here we are again calling a statistical test that thousands of people do
> every day "contrived." You already know how I feel about that.

I don't mean to be flippant about it or dismissive, I mean exactly what I
said:

contrived: "deliberately created rather than arising naturally or
spontaneously."

What test is it in lindley's paradox are you referring to when you say
thousands of people use everyday? Just the null rejection? Or is there another
part of it you're referring to?

> Yes, it's a very simple example, because that helps illustrate what's
> happening. Lindley's paradox can happen in arbitrarily complex models, any
> time you're doing model selection.

My point isn't that it's simple, my point is that it's incredibly awkward and
unrealistic and not representative of how a Bayesian statistician would answer
the question "is p=0.5" which is a very strange question to begin with. The
"prior" here treats it as equally likely that p=0.5 exactly and p != 0.5,
which if that's your true assumption, fine, but my point is that this is a
very bizarre and unrealistic assumption. Maybe it seems realistic to a
frequentist but not to me at all. If someone was doing this analysis, I would
_expect_ to get a weird answer to a weird question.

> Preferring simpler models is not exactly what's going on in Lindley's
> paradox,

Exactly! I'm not sure _what_ is going on in Lindley's paradox to be honest; I
don't understand the controversy here: the question poses a very strange prior
that seems designed to look perfectly reasonable to a frequentist but not to a
Bayesian. But I suppose this is an important point about the way priors can
fool you!

> at least not the way that most people talk about Bayes factors preferring
> simpler models (e.g. by reference to the k _ln(n) term in the Bayesian
> Information Criterion). The BIC is based on an asymptotic equivalence and
> drops a constant term.

I'm with you so far, and BIC is a good asymptotic result, but I'm talking
about the full solution here (which is rarely _practical*), that doesn't drop
the constant term

> That constant term is actually what is primarily responsible for Lindley's
> paradox, and has only an indirect relationship to the complexity of the
> model.

I mean I think we're splitting hairs here? Maybe? My point was that Bayesian
model selection won't make up for a strange prior, but given the right priors,
Bayesian model selection just makes sense to me. But again, this is the
important limitation of most Bayesian analyses: the prior can do strange
things, especially the one used in the Lindley's paradox example in the
Wikipedia page.

But honestly, if you think I'm missing some important part of Lindleys'
paradox, please do elaborate, I have not heard of this before you mentioned it
but I still am confused as to why this is considered something "deep" but I
assume that just means I am missing something important.

------
noelwelsh
Just a note that you can interpret regularization as placing a prior on
weights. L2 regularization is a Gaussian prior, and L1 is a Laplacian prior.
I.e. this is doing Bayesian statistics rather than an arbitrary hack to
improve predictions.

Elements of Statistical Learning is firmly in the frequentist world from what
I recall, so this might not be discussed in that book.

~~~
disgruntledphd2
This is discussed in Chapter 1 (or maybe 2), I think, which suggests to me
that the author should probably read a little bit more of it.

Mind you, it's a wonderful book, and I recommend that people should just read
it in general (you may not be able to do very many of the exercises, but it's
still worth it).

------
d_burfoot
The author makes it sound like statistics is this grand beautiful mathematical
edifice and ML is just a bunch of number crunching with computers. That
contrast is just unfair; a huge portion of stats is just made up of hacks and
cookbook recipes. Statistics has probably done more damage to the world than
any other discipline, by giving a sheen of respectability to fake science in
fields like nutrition, psychology, economics, and medicine.

I'm particularly annoyed by the implication that statisticians have better
understanding of the issue of overfitting ("why having p >> n means you can't
do linear regression"). Vast segments of the scientific literature falls
victim to a mistake that's fundamentally equivalent to overfitting, and the
statisticians either didn't understand the mistake, or liked their cushy jobs
too much to yell loudly about the problem. This is why we have fields where
half of the published research findings are wrong.

~~~
fractionalhare
_> Statistics has probably done more damage to the world than any other
discipline, by giving a sheen of respectability to fake science in fields like
nutrition, psychology, economics, and medicine._

This seems really unfair. You can misuse statistics, but it's an extremely
powerful tool when it's _properly used and understood._ Most powerful tools
can be misused - you can write terrible code and (try to) publish bad
mathematics too. But much of modern science would be intractable without
statistics; including physics, chemistry, biology and applied math, because
we'd be otherwise unable to draw reasonable conclusions from anything less
than a totality of data.

As someone with a graduate education in probability snd statistics, I think
it's fair to lay some of the blame for the reproducibility crisis at the feet
of statisticians because of poor education. Statisticians should accept at
least some responsibility if their students in non-math majors graduate
without understanding the material, for sure.

But that being said, it should definitely be noted that actual statisticians
have been talking about this crisis _for decades._ Statisticians have
basically always known that there's nothing magical about the 95% significance
threshold _p <= 0.05_, for example. And for the most part, it's not
statisticians who are causing the bad science to occur. Rather it's a problem
of non-statisticians using statistics without (qualified) peer review that
they can't be expected to do correctly if it's not their core competency.

In my opinion it's something of a philosophical problem - many fields and
journals are only realizing now that it's unreasonable to expect a e.g.
professional psychologist to also be an expert statistician. Having a
dedicated statistician - instead of another psychologist who hasn't reviewed
the material since their upper undergrad course - is a giant leap forward in
catching bad stats in new research.

------
cycrutchfield
If this topic interested you, it may also be worth reading Leo Breiman’s
“Statistical Modeling: The Two Cultures” from 2001.
[https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)

------
fxtentacle
The author appears to misunderstand the main difference between statistics and
ML. Let me cite him:

> my gut reaction is to barf when someone says “teaching the model” instead of
> “estimating the parameters.”

Typical statistics work is to use a known good model and estimate its
parameters. Typical machine learning work is to think back from what task you
want it to learn and then design a model that has a suitable structure for
learning it.

For statistics, the parameters are your bread and butter. For machine
learning, they are the afterthought to be automated away with lots of GPU
power.

A well-designed ML model can have competitive performance with randomly
initialized parameters, because the structure is far more important than the
parameters. In statistics, random parameters are usually worthless.

~~~
Rejoyce
"Typical statistics work is to use a known good model and estimate its
parameters. [...] For statistics, the parameters are your bread and butter"
Ever heard of non-parametric statistics?

"For machine learning, they are the afterthought to be automated away with
lots of GPU power." You seem to reduce statistics to undergraduate statistics
and machine learning to Deep Learning.

"A well-designed ML model can have competitive performance with randomly
initialized parameters, because the structure is far more important than the
parameters. In statistics, random parameters are usually worthless." This is
blatantly false see Frankle & Carbin, 2019 on the lottery ticket hypothesis.

~~~
fxtentacle
Yes, I have reduced both statistics and ML to the subsets that are usually
used when working in the field, because the blog post was about employment
options.

I would wager that people doing non-parametric statistics are both very rare
and most likely advertise themselves as machine learning experts, not as
statisticians.

As for the random network, I was referring to
[https://arxiv.org/abs/1911.13299](https://arxiv.org/abs/1911.13299) and I
have seen similar effects in my own work where a new architecture was
performing significantly better before training than the old one was after
training.

If you want a generally agreed upon example, it'd be conv nets with a cost
volume for optical flow. What the conv nets do is to implement a glorified
hashing function for a block of pixels. That'll work almost equally well with
random parameters. As the result, PWC-Net already has strong performance
before you even start training it.

~~~
contravariant
>As for the random network, I was referring to
[https://arxiv.org/abs/1911.13299](https://arxiv.org/abs/1911.13299) and I
have seen similar effects in my own work where a new architecture was
performing significantly better before training than the old one was after
training.

The fact that a dense neural network with 20M parameters performs equally well
as a model with 20M random values and 20M _bits_ worth of parameters means
nothing more than that the parameter space is ridiculously large.

The only models that perform well given random parameters are those that are
sufficiently restrictive. Like weather forecasts, where perturbations of the
initial conditions give a distribution of possible outcomes. Machine learning
models are almost never restrictive.

~~~
fxtentacle
Of course, I agree with you that the parameter space is ridiculously large.
But sadly, that's what people do in practice. And with 20mio, their example is
still small in comparison to GPT-3 with 175 billion parameters.

I disagree with you on the restrictive part. Those ML models that are inspired
by biology tend to be restrictive, the same way that the development of mammal
brains is assumed to be restricted by genetically determined structure. Pretty
much all SOTA optical flow algorithms are restricted in what they can learn.
And those restrictions are what makes convergence and unsupervised learning
possible, because the problem by itself is very ill posed.

------
kelvin0
It seems like the 'scientist' part of 'Data scientist' might cause this sort
of misunderstanding.

There's a lot more 'engineering' and fiddling going on than any type of
'science-y' stuff it seems.

~~~
NPMaxwell
At one point, the word "scientist" in "data scientist" was used to distinguish
between people who took the time to develop domain expertise from statistical
consultants who applied standard methodologies without reference to what the
data was or where it came from.

------
scribu
This is one of the clearest explanations I've read on the difference between
traditional Statistics and Machine Learning.

~~~
yt-sdb
A few years ago, Michael I. Jordan did an AMA on Reddit and discussed this
distinction as well. Maybe you'll find it interesting as a counterpoint [1].

[1]
[https://old.reddit.com/r/MachineLearning/comments/2fxi6v/ama...](https://old.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan/ckelmtt/)

~~~
mturmon
Perfect link. He’s the ideal commentator.

------
Konohamaru
> Traditionally, it’s a cardinal sin in academia to use parameters like these
> because you can’t say anything interesting about the parameters, but the
> trick in machine learning is that you don’t need to say anything about the
> parameters. In machine learning, your focus is on describing y-hat, not
> β-hat.

This kind of philosophy will cause future generations to see machine learning
as something worse than a fad, almost as something in between a fad and crank
science. If this encapsulates how all (generally speaking) machine learning
operates then we will enter big trouble, if we have not already.

> In machine learning, bad results are wrong if they catastrophically fail to
> predict the future, and nobody cares much how your crystal ball works, they
> only care that it works.

This has moved from Cargo Cult Science into numeromancy. It's leveraging the
occult (=hidden, incomprehensible parameters) for predicting the future.
Because there exist no first principles, nothing can be further interpreted.
Only more of the occult can be leveraged in order to make more predictions not
amenable to interpretation, which will in turn require MORE occult to make
MORE inscrutable predictions, until the heat death of the universe....

And appealing to 80's AI (neural networks) as case precedence further harms
the author's case. If ML operates like how AI neural network technology went,
then this whole rigamole will go tits up by case precedence as well.

~~~
hoseja
Have you considered that useful predictive models for reality are simply
irreducible to being comprehensible by swollen savannah monkey brains? It's
not magic. It simply cannot both be explained to a human and be worth
anything.

~~~
rob74
Can a swollen savannah monkey brain drive a car safely on public roads, using
nothing but its attached eyes and ears? Yes (most of the time). Can AI do it?
Not yet...

~~~
yeellow
Sure you can drive, but can you explain in details HOW you do it? Probably not
and that's why a machine cannot do it yet (until it learns by itself, as we
all did).

------
em500
This is really nice write-up, much better than yet-another-skin-deep-sklearn-
tutorial. Skimming some other posts of the author, his domain understanding
looks quite impressive to me.

(Judging his writing as an ex-academic econometrician Data Scientist, about to
be rebranded to Machine Learning Engineer by his megacorp employer, the author
appears to have more insight in the field than many a PhD professional Data
Scientist.)

~~~
QuesnayJr
It is basically the standard take of a statistician who tries to understand
machine learning. "It's yhat, rather than betahat" is a common slogan.

------
Uptrenda
Data science always seemed to me to be a profoundly boring job. Can anyone
shed some light on what you find the most fascinating about it?

~~~
_fourzerofour
I can tell a story. I used to work for a HVAC installation company, pretty
small in terms of staff but we subcontracted a lot. Initially brought on as a
mechanical engineering intern, but moved on to sales engineering when I found
an interesting statistical relationship.

A large factor in quotes to clients was the underlying cost of air
conditioning equipment in our niche, and often a game of sales intel was
played between suppliers and competing contractors (like us) for a given job
site. Favorites were picked, and we could get royally screwed in a quote,
losing the sale to the end-customer.

Fortunately, we had years of purchasing information. It turns out that as
varied as air conditioners are across brands and technical dimensions, when
you have years of accounts' line items and unused quotes, you don't get a
dimensionality issue. Since we operated in a clear-cut niche, this was
especially true. We could forecast, within a margin of error of two per cent,
exactly what any of our suppliers would quote us (or our competitors!) for a
job long before they could turn it around. Huge strategic advantage.

This was the watershed moment for me when I realized even basic multiple
linear regression was a scarily powerful tool when used correctly.

~~~
Uptrenda
That is cool when you put it like that. Uncovering hidden relationships that
are useful sounds romantic. Thanks for posting

~~~
NPMaxwell
And incredibly boring. The usual estimate is that data science is 80% data
wrangling: finding, collecting, and cleaning up data. The term "data
scientist" replaced "data miner", because miners are looking for gold.
Scientists are obsessed with finding out the nature of reality, gold or mud.
They will do seriously boring stuff to set things up so that reality is
revealed.

~~~
screye
It is only boring if you do it the boring way.

If the data cleaning is follows standard patterns, you should already have
scripts to offload that kind of work to. If not, then there some incredibly
interesting decisions hidden underneath. Like in text: Should character casing
be preserved ? What should be the unit of representation (word/character) ?
How should data be filtered: Quality vs quantity trade-off ?

All of those are non-trivial questions which involve a lot of thought to
reason through. You are correct that the modelling is only a small part of
DS's day to day job.

But, the rest of it is boring in the same way that coding is boring. It is
doesn't involve some grand epiphanies or discoveries, but there is joy similar
to the daily grind of "code -> get bug/ violate constraints -> follow
trace/problem -> figure a sensible solution" that a lot of software engineers
love.

------
jmount
Love the article. It inspired me to make a follow-up note on one of the memes:
[https://win-vector.com/2020/07/03/data-science-is-not-statis...](https://win-
vector.com/2020/07/03/data-science-is-not-statistics-done-wrong/)

------
ur-whale
From the article: " In statistics, bad results can be wrong, and being right
for bad reasons isn’t acceptable. In machine learning, bad results are wrong
if they catastrophically fail to predict the future, and nobody cares much how
your crystal ball works, they only care that it works."

~~~
NPMaxwell
Is that a typo? It makes more sense as "good results can be wrong, and being
right for bad reasons isn't acceptable"

------
rob74
Off topic, but if someone uses "gut reaction" and "barf" in the same sentence,
I'm tempted to think they really mean it literally...

------
xvilka
There is a big difference between ML practitioners and professional
statistians. Former commonly are unaware[1] of a rich set of statistical
biases and ways to tackle or mitigate them.

[1]
[https://towardsdatascience.com/survey-d4f168791e57](https://towardsdatascience.com/survey-d4f168791e57)

------
anonymousDan
Can someone elaborate on what is meant by 'estimating a parameter with a
natural experiment'? This seems to be the key difference but I don't quite get
how this would work. What would be your input data and how would the process
differ from an ML approach?

~~~
stdbrouw
A natural experiment is an experiment (an AB-test if you will) that occurs by
chance rather than conscious design. For example, two neighboring countries
contemplate banning smoking in restaurants, but in one the bill fails with 49%
of the vote, in the other the ban goes through with 51% of the vote. It's not
perfect, but you could argue that these countries can now be used to estimate
the effect of a smoking ban on mortality and health in a way that is almost as
good (but not quite as good) as a randomized clinical trial, whereas you can't
just compare two arbitrary countries with differential rates of smoking,
because they might be different on so many other counts as well and there is
no pre-intervention data to serve as a baseline.

More broadly, ML does not really answer questions like "was this death caused
by smoking, or rather by a hundred other things associated with smoking like
lower income and bad health insurance?", though it is excellent at predicting
who is likely to die prematurely. So it's great for prediction, but not so
great if you want to learn more about the underlying structure of a
phenomenon.

Statisticians are sometimes surprised to see so much interest in machine
learning given that its view of the world is not open to inspection (though
there's
[https://github.com/interpretml/interpret](https://github.com/interpretml/interpret)
I guess) so _we_ as humans learn nothing, but it turns out that in many cases
we really don't care all that much about the underlying mechanism, as long as
we can make accurate predictions.

------
Ericson2314
A pox in both there houses.

I kinda want to ban this stuff for economies like ours. Think about it, we
have many entrenched inefficient separate actors all engaging in nonsense
alchemy. Surely this ruins the convergence to economic equilibrium.

------
YeGoblynQueenne
Well, if you look at machine learning from the point of view of data science
it's inevitable to be confused about its relation to statistics, but machine
learning is a sub-field of AI and statistical techniques are only one tool in
its toolobx. Statistical techniques have dominated the field in recent ish
years but much work in machine learning has historically used e.g.
Probabilistic Graphical Models or symbolic logic as the "model" language. e.g.
one of the most famous and well-studied classes of machine learning
algorithms, decision tree learners, comprises algorithms and systems that
learn propositional logic models, rather than statistical models.

Tom Mitchell defined machine learning as "the study of computer algorithms
that improve automatically through experience"[1]. This definition does not
rely on any particular technique, other of course than the use of a computer.
Even the nature of "experience" doesn't necessarily need to mean "data" in the
way that data scientists mean "data"\- for example, "experience" could be
collected by an agent interacting with its environment, etc.

Unfortunately in very recent years, since the big success of Convolutional
Neural Networks in image classification tasks, in 2012, interest for machine
learning has shifted from AI research to ... well, let me quote the article:

>> Or you can start reading TESL and try to get some of that sweet, sweet
machine learning dough from impressionable venture capitalists who hand out
money like it’s candy to anyone who can type a few lines of code.

I suppose that's ironic. But the truth is that "machine learning" has very
much lost its meaning as industry and academia is flooded by thousands of new
entrants that do not know its history and do not undestand its goals. In that
context, it makes sense to have questions along the lines of "what is the
difference between statistics and machine learning", which otherwise have a
very obvious answer.

___________

[1]
[https://www.cs.cmu.edu/~tom/mlbook.html](https://www.cs.cmu.edu/~tom/mlbook.html)

The excerpt I quote is an informal definition. The wikipedia article on
machine learning has a more formal definition:

[https://en.wikipedia.org/wiki/Machine_learning#History_and_r...](https://en.wikipedia.org/wiki/Machine_learning#History_and_relationships_to_other_fields)

~~~
pietroppeter
Tom Mitchell book is still a great book to understand what Machine Learning is
about

------
fnord77
this reads like something 5 or 6 years old

------
aabajian
The author's pie chart showing data science to be 60% data manipulation is
accurate. The biggest gap between good and bad data scientist is their comfort
level with data wrangling. When interviewing candidates for data science
positions, one of the simplest questions is to have them sort a 1 GB tab-
delimited file.

1\. Poor candidates will try to open the file in Excel.

2a. Marginal candidates will use R or Stata.

2b. Okay candidates will use a scripting language like Python.

3\. Good candidates will use Unix sort.

To my knowledge, there are no university courses teaching the Unix toolchain
and it remains very much a skill learned through practice.

~~~
blackbear_
Not sure why you think a candidate who uses R is inferior to one who uses
Python?

Also, a really good candidate should use the right tool for the job, so if you
expect them to use Unix sort you should somehow imply a situation where that
is the best approach.

~~~
whym
> if you expect them to use Unix sort you should somehow imply a situation
> where that is the best approach.

I think the implied question is whether the interviewee is aware of the fact
that trying to load a 1 GB text file could use up too much RAM space of the
system. Unix sort is arguably the most memory efficient among the 4 choices
there. It depends on the amount of available resources (which was not
specified), and some companies might be willing to casually let people use 100
GB machines, though.

~~~
blackbear_
That is a fair point. A good script would be to move towards this scenario of
more data than RAM and see what the candidate comes up with.

