Hacker News new | past | comments | ask | show | jobs | submit login
On moving from statistics to machine learning, the final stage of grief (2019) (ryxcommar.com)
175 points by yoloswagins 36 days ago | hide | past | favorite | 103 comments



There's this undertone of "I should be payed as much/more than Data Science people because I'm better than them at statistics and data science = machine learning = statistics".

My experience doing data science at small companies that can't afford to hire more than 1 person for the role is that it is so much more than just building models or doing statistics.

You have to:

1. Build APIs and work with developers to get predictive models integrated into the rest of the software stack

2. Know how to add logging, auditing, monitoring, containerizing, web scrapers, cleaning data(!!), SQL scripts, dashboards, BI tools, etc.

3. Do some basic descriptive stats, some basic inferential stats, some predictive modeling, work on time-series data, sometimes apply survival analysis, etc. (Python/R/Excel who cares)

4. Setting up data pipelines and CI/CD to automate all this crap

5. Trying to unpack vague high level requirements along the lines of "Hey do you think we could use our data to build an 'AI' to do this instead of manually doing it" and then coming up with a combination of software / statistical models that perform as least as good or better than humans at the task.

6. Work with non-technical business users and be able to translate this back to technical requirements.

Hey, if all you do all day is "build models" then that sounds like a very cushy DS job you have. It's definitely not been my experience. I would describe it more like a combination of software engineering and statistics and business analyst. That's why it pays higher than just statistics. But this is just my experience..


The author starts with

The data science world may reject me and my lack of both experience and a credential above a bachelors degree

More likely the data science world will reject him because he is so confident a field he has so little experience or knowledge of.

Data scientist is a profession rather than the name of an academic field. So data scientists' job is to solve practical problems. That involves a lot more than class assignments, and in some cases involves using machine learning to maximize predictive accuracy (because common ML models like gradient boosting capture interactions and non-linearities in a richer way than the GLM models the author is familiar with).

Their argument "that's a garbage model because we can't reasonably interpret underlying parameters," is replacing their personal criteria above what is needed to solve some problems.

They can blame it on only having bachelor's degree. But the real problem is the belief that a bachelor's degree taught them everything there is to know, and those in the DS field are ~ idiots who got lucky enough to be paid more.


I feel like the blog post should be read in line with how it was probably written: informal, personal and somewhat sarcastic, with a bitter note because he chose one major and now it turns out people value something else instead. Hence the title "the final stage of grief". I did not get the impression that the author thinks machine learning is stupid or that he knows everything about it.


And in all honesty, no data scientist with an ounce of self-worth should work long-term for such companies, unless it just happens to be their own.

You're basically doing 3 jobs for the price of one: Software Engineer, Data Engineer, Data Scientist.

Sure, you'll be a jack of all trades, as far as data goes, but it'll be at the cost of some specialization.

I'm probably gonna get a lot of sh!t for this post - probably from data [x] people that are in that exact position themselves, but the above description is exactly why I'd aim for larger companies with somewhat established analytics / data / ML teams or offices. You get to focus on the important stuff, instead of juggling ten balls at the same time.

(And it's not only in the field of data science. Some of the traditional SE positions I see at startups or small companies look absolutely grotesque - basically the whole IT and Dev. department baked into one job)


No one with an ounce of self-worth should work long-term for companies that expect them to do exactly what their title implies they should and not a thing more.

You're basically being arbitrarily restricted to learning and enjoying exactly one thing when it would often make more sense in context to become involved in: Customer relations, systems administration, management, software engineering, data science, etc.

Sure, you'll become really good at that one thing, but it'll be at the cost of personal growth and job satisfaction.

> I'm probably gonna get a lot of sh!t for this post...

I mean, yeah. You've basically lampooned anybody who enjoys working in ill-defined cross-disciplinary circumstances as having "[not an] ounce of self-worth".

It sounds like, from your perspective, your field is "the important stuff" and other fields are just balls to be juggled. There's nothing wrong with that, but lots of people don't think that way. To some people, the important stuff is anything that makes their customers happy. To others, it's anything that helps them learn.

And let's dispense with the notion that "doing 3 jobs for the price of one" is an accurate description of having broad rather than narrow responsibilities. One comes at the cost of the other. If you're an equally capable specialist and generalist, and you're capable of genuinely performing those 3 jobs at once, then if you were to specialize you'd be performing the work of 3 average specialists, and you'd be in the same boat as before.

Do what you're best at and try to get the best possible compensation for it, monetary or experiential. It's as simple as that.


> And in all honesty, no data scientist with an ounce of self-worth should work long-term for such companies, unless it just happens to be their own.

You're kind of saying that no one should work for a startup ever. At small companies you have to do many things. As you said a software engineer, there aren't dedicated front-end engineers or dev-ops. Marketing teams don't have content vs. growth vs. performance vs. brand vs. email marketers. You might be the first sales person, which means no sales ops support, no account manager for ongoing relationships, etc.

There are trade-offs, of course! There are trade-offs to anything. Some people value working on many aspects of a company. Some people find understanding more than their narrow field to be interesting and rewarding and, you know, self-worth-y!

So I think it might be helpful to step back and consider that not everyone has the same priorities, experiences, interests, or definitions of happiness and self-worth as you do. And that's okay.


>You get to focus on the important stuff, instead of juggling ten balls at the same time.

So on one hand, you can't build any models without the work of the engineers, but on the other the model building is "the important stuff"?

Maybe it's just me, but I enjoy working on all aspects of the data pipeline.


Model building is often the trivial part, and often you can't build models without a solid understanding of things like the data pipeline.


Some people enjoy doing full stack DS and get paid well for it.


And there's the potential risk of being mediocre on many fields and become less competitive in each area. My opinion is it really needs to be the field you love (biology, science etc) to be worth the effort.


to be fair, how large should a field be?

specializing in physics as a whole is way to broad. but is specializing in front end software development to broad?

Once you have learned the fundementals of computer science and its assosicated fields (networking and systems engineering mainly), the difference between doing back/front end works is not that large.


I had two semesters of databases at university. I wonder why they wasted so much time. I mean, you don't really need university education to understand the SELECT statement, right? /s

The problem with computer science is that it may take a lot of time to understand something thoroughly, but it only takes a while to find a tutorial, copy some example code from web, and build a simple version that only contains a few bugs and will be a nightmare to maintain and scale, but hey, it mostly works, as long as you use it in the predicted way and don't put in too much data and don't type or click too fast.

And because this works most of the time, and can be sold, this became the standard. People only capable of copy-paste development still get jobs. Design patterns, that's just some academic nonsense for nerds, right? Hey, my teenage nephew made a simple application in PHP over the weekend; how much more difficult can your job be, seriously? Java and JavaScript are the same thing, aren't they? Okay, one of them has optional semicolons, but now you are making a mountain out of a molehill, just admit it. What database design? Just make some tables and put the data there; if it's not fast enough, add some indexes. Web page design? Just put the button on the top; if it doesn't fit, put it on the bottom; if it still doesn't fit, put it on the side or maybe in the header, whatever. What's all this talk about technical debt? I am paying you to add new features...

Yes, software developers who don't have deep knowledge in any field, and who do mediocre work, are quite interchangeable. No need to invent specialized job positions for them. If your company develops 10 products, all the wheels will be reinvented 10 times and most of them won't rotate properly, but that's life. The advantage is that replaceable people are cheaper and more obedient.


I mostly agree but the competition in the field will dictate that. Like you stated if it's features you are after and the sold party won't mind these small details then maybe they (developers) are interchangeable. But e.g in science mediocre work will cost a lot if certain minor details are found wrong. That's whole reproducibility crisis in science is about. The reputation of a whole institute / department might be at risk if the mistakes are exposed.


I talk regularly with business people who have hired data scientists, and 5 and 6 is always the biggest complaint. That, plus new hires are always unprepared to handle how messy real-world data is.


The messiness of data is something I even see creating a growing rift between academic ML/DS and real world applications.

What makes for a nice paper doesn't necessarily make for a model that will survive contact with new user generated data.


I remember some similar discussions when I worked in logistics consulting for a while. The data source was a mess, like a mess. Data was even included as screenshots of spreadsheets in other spreadsheets.

Some math PhD was in charge of that modelling. Without any domain knowledge concerning the data (logistics, consumption, maintenance) and thus unable to properly interpret the raw data to begin with. Most of the time was spent on writing some Python scripts to analyze the raw data, still full of errors. And build predictive models on top of that mess. Kind of formed my view of data science, unfairly so.


Just like almost all software work is maintenance, almost all data science is data cleaning.


I liken it to sending a chef to a grocery store. It's not just about being a good cook. Half the battle is in choosing the correct ingredients. Not just a dozen eggs, but free range where the yolks will be a vibrant orange yellow and improve the presentation. The cleanest models with the highest fidelity often fall out as the next obvious transformation of a well groomed and hygienic dataset.


Some of what you're describing in 1-4 is Data Engineering. 5-6 exists (in some form) for most software jobs.

The general breakdown I give people is:

Data Scientists:

* Get data.

* Clean data (~60% - 70% of time required).

* Research.

* Low level data analysis.

* Building models.

It's mostly the "knowing data" and the modelling.

Data Engineers:

* Data storage

* Data processing

* Automation

* Infrastructure

It's about getting the Data Scientist's output into production / making data easily available to them.

This is especially true for big ETL jobs. The more we can automate your ETL jobs, the happier you'll be!


I hate the fact that you're right. And that you've described my job in a small company.


> Machine learning is genuinely over-hyped. It’s very often statistics masquerading as something more grandiose. Its most ardent supporters are incredibly annoying, easily hated techbros

This sort of fashionable disparagement of a group of people to signal that you’re not part of the “bad group of tech bro’s” is so trashy. Why are these random people you easily hate? Who are they? Why take glee in shared hatred?

I’ve worked as a sr DS at FAANG for 4 years. I’ve recently worked through Casella Berger, because I wasn’t comfortable being one of those DS who didn’t know math stats. But before I did work through it, I worked with people from PhD stat programs who were so ineffective. Despite knowing so much more stats than me, they would freeze up and fail everytime they had to deal with any sort of software system or IDE. It was so weird to me that my ability to use a regression, even before I knew the theory, was more valuable than their ability to use a regression to its full power, simply because I could fight the intense battle to take that idea and put it into reliable production code.

But generally I hate hate this war between DS and stats. It’s so stupid. Maybe not their first year, but eventually any DS who wants to be a master of their craft ought to learn math/theoretical stats. And some don’t want to be a master of their craft, and instead want to go into management or whatever, and that’s fine.


> I’m sure you’re asking: “why allow your parameters to be biased?” Good question. The most straightforward answer is that there is a bias-variance trade-off. The Wikipedia article does a good job both illustrating and explaining it. For β-hat purposes, the notion of allowing any bias is crazy. For y-hat purposes, adding a little bias in exchange for a huge reduction in variance can improve the predictive power of your model.

I'm going to push back on this.

The author seems to understand the bias-variance tradeoff as applying primarily to y-hat, and allows that if you are primarily interested in y-hat then it can make sense to make that tradeoff (introduce bias in exchange for lower variance). But the bias-variance tradeoff is more general than that. There's also a bias-variance tradeoff in beta-hat, and you can make a similar decision there to introduce some bias in beta-hat in exchange for lower variance, lowering the overall mean square error.

There's nothing crazy about this. The entire field[1] of Bayesian statistics does this every day--Bayesian priors introduce bias in the parameters, with the benefit of decreasing variance. Bayesians use these biased parameter estimates without any problems.

Classical (non-Bayesian) statistics has tended to focus heavily on unbiased models. I suspect this is largely because restricting the class of models you're looking at to unbiased models allows you to prove a lot of interesting results. For example, if you restrict yourself to linear unbiased models, you can identify one single `best` (i.e. lowest variance) estimator. As soon as you allow bias you can't do that anymore.

[1] Except empirical Bayes, which is a dark art.


“Non-Bayesian” stats uses bias in the exact same way that Bayesian stats does, because Bayesian stats and frequentist stats are mathematically equivalent. When someone thinks they’re not using priors, they’re wrong, they’re usually using a flat prior on the model parameters, but that adds bias just like any other prior! A flat prior on theta is different than a flat prior on log(theta) or some other parametrization, and flat priors are often times the wrong choice, so this notion that Bayesian stats is some “special” type of inference and that there is some way to do inference without bias is just a very large misconception.


> “Non-Bayesian” stats uses bias in the exact same way that Bayesian stats does

This is incorrect.

Bias has a very specific specific mathematical meaning in statistics--the difference between the expected value of the estimate (under the sampling distribution) and the true value. There are many examples of parameter estimates in classical statistics that have zero bias under that definition.

> Bayesian stats and frequentist stats are mathematically equivalent.

Also incorrect. Bayesian and frequentist methods focus on different conditional probabilities and can give very divergent results even in simple cases. See e.g. Lindley's paradox [1].

[1] https://en.wikipedia.org/wiki/Lindley%27s%20paradox


I appreciate your comments.

> Bias has a very specific specific mathematical meaning in statistics--the difference between the expected value of the estimate (under the sampling distribution) and the true value.

Right, I'm aware of what bias is, and how it's defined, and while the statement you make is true, it misses the point: regardless of the camp you're in (frequentist or Bayesian), inference involves a prior, and that prior will affect your inferred parameter estimates (it may bias them, it may not, but flat priors do not guarantee unbiased estimates). I agree, under various contrived scenarios, you can show your parameter estimate is unbiased when using a flat prior (yay!) but what happens if you're using the wrong parameterization of your model? A flat prior on \theta is not a flat prior on \log\theta or any other transformation of theta, but if you're a frequentist what do you do about that? If you're not conscious of the prior choice you are making, you can easily introduce bias you don't want even with a flat prior.

> There are many examples of parameter estimates in classical statistics that have zero bias under that definition.

Elaborate. I assume by "parameter estimates" you mean MAP (or MLE = MAP with flat prior). E[\hat{\theta} - \theta] may be zero with a flat prior, but E[\hat{\log\theta} - \log\theta] wont be, so the bias in your parameter estimate depends on the parameter you really care about.

see e.g. [1] for an example where flat priors do actually bias inferences.

> Also incorrect. Bayesian and frequentist methods focus on different conditional probabilities and can give very divergent results even in simple cases. See e.g. Lindley's paradox [1].

OK so let me clarify because I agree that my wording is wrong: while I agree with you that there are differences between Bayesian and frequentist statistics, they are philosophical; they answer different questions: ironically the wikipedia article you linked to actually says it pretty well: "Although referred to as a paradox, the differing results from the Bayesian and frequentist approaches can be explained as using them to answer fundamentally different questions, rather than actual disagreement between the two methods."

[1]: https://mc-stan.org/users/documentation/case-studies/weakly_...

"Although flat priors are often motivated as being “non-informative”, they are actually quite informative and pull the posterior towards extreme values that can bias our inferences."


> I agree, under various contrived scenarios, you can show your parameter estimate is unbiased when using a flat prior (yay!) but what happens if you're using the wrong parameterization of your model?

I wouldn't consider estimating, say, the mean length of a population of fish contrived (unbiased estimate: x-bar). Nor would I consider estimating the probability of an event based on observations of the event happening or not happening contrived (unbiased estimate: p-hat = #successes/#trials).

These kinds of simple estimation problems and the associated statistical tests account for probably 90% of statistical practice. Dismissing them as contrived is silly.

More generally, MLE estimates are always (under regularity conditions) asymptotically unbiased even if not unbiased for a finite sample. This means that the amount of bias decreases to zero as the sample size increases, no matter what the parameterization is.

Finally, there is very often a natural parameterization for any given problem. If you're interested in the arithmetic mean of a population, there's no reason to use a log-scale parameterization. Why worry about bias in other parameterizations when you can just use the natural parameterization, where the estimator is unbiased? Again, I don't think such scenarios are contrived: a very large proportion of statistical analyses deal with simple measurements in Euclidian (or very nearly Euclidian--we can typically ignore, for example, relativistic effects) spaces: real world dimensions, time, etc. If you're a Bayesian and very concerned about parameterization effects you can also use a Jeffrey's prior, which is parameterization-invariant. Notably, for the mean of a Normal distribution, the Jeffrey's prior is... the flat prior!

> OK so let me clarify because I agree that my wording is wrong: while I agree with you that there are differences between Bayesian and frequentist statistics, they are philosophical; they answer different questions:

Yes and no. The Bayesian and frequentist approaches answer different mathematical questions, but they are used by humans to answer the same human questions, such as "do these two populations have the same mean?"

> see e.g. [1] for an example where flat priors do actually bias inferences.

I don't consider that a good example of flat priors biasing the inference. The posterior with flat prior is diffuse because the data doesn't provide much information; in the lack of much prior or likelihood information, the posterior should be diffuse, so that's a perfectly reasonable result. If you can't stand a diffuse posterior then either collect more data or (carefully!) introduce a more informative prior. The more informative the prior you introduce, the more biased your inference will be; that's fine as long as your prior is chosen carefully.

This is not to say that flat priors are never a problem. The setting I'm familiar with where diffuse priors are most dangerous is when doing model comparison--but that issue is specific to Bayesian methods, not frequentist. Indeed this is one of the primary reasons Lindley's paradox arises: the Bayesian model comparison (using marginal likelihoods or Bayes factors) gets tricked by the diffuse prior, while the frequentist model comparison (using null hypothesis testing) does not.


> I wouldn't consider estimating, say, the mean length of a population of fish contrived (unbiased estimate: x-bar). Nor would I consider estimating the probability of an event based on observations of the event happening or not happening contrived (unbiased estimate: p-hat = #successes/#trials).

Sure, maybe not contrived; my point is that flat priors may work in many "typical" textbook stats problems, but they are one of many choices, and that choice is important to be explicit about and not sweep under the rug. Because if your entire life is measuring sample means, fine, you're never going to need to think about this very much and life will be nice. But when one fine day you decide to do something more complex, these are the land mines that you shouldn't really ignore.

> These kinds of simple estimation problems and the associated statistical tests account for probably 90% of statistical practice. Dismissing them as contrived is silly.

Whether it's 90% is totally dependent on the types of problems you do. I don't mean to dismiss them, you're right for many problems MLE is just fine. I meant to illustrate that "unbiased" comes with many caveats, and that in many real scenarios flat priors are not ok.

> More generally, MLE estimates are always (under regularity conditions) asymptotically unbiased even if not unbiased for a finite sample. This means that the amount of bias decreases to zero as the sample size increases, no matter what the parameterization is.

Is this not true of the MAP for most priors? Gaussian/Laplace priors will have this property too, since priors become asymptotically less important the more data you have. If your prior is zero over some of the support, you're out of luck but this doesn't strike me as a good argument for MLE > MAP or for using flat priors everywhere. When we have infinite data, sure, priors are irrelevant, but we live in the real world where data is not infinite.

> Finally, there is very often a natural parameterization for any given problem. If you're interested in the arithmetic mean of a population, there's no reason to use a log-scale parameterization.

Sure, agree that parametrization isn't a problem a lot of the time, but it is something important to be mindful of, and this points towards, again, not forgetting that you are always using a prior and that you should think about whether or not that prior makes sense.

> Why worry about bias in other parameterizations when you can just use the natural parameterization, where the estimator is unbiased? Again, I don't think such scenarios are contrived: a very large proportion of statistical analyses deal with simple measurements in Euclidian (or very nearly Euclidian--we can typically ignore, for example, relativistic effects) spaces: real world dimensions, time, etc.

Yea, I mean sure: for easy problems, parametrization is obvious. That's kind of tautological. But sometimes it's not obvious, or sometimes for computational reasons you need to work with a log(theta) instead of theta, etc. If you're a frequentist and you're thinking life is great because you don't need to worry about priors, you're wrong and sooner or later you will get into trouble; be it a parametrization issue or something else, priors are not just something you can completely ignore. It's like saying "I always drive without looking in my rearview mirror" -- ok, great, you will be fine a lot of the time, but eventually one day you will change lanes on the highway at the exact wrong time, and you will really regret your habit of not looking in your mirror.

> If you're a Bayesian and very concerned about parameterization effects you can also use a Jeffrey's prior, which is parameterization-invariant. Notably, for the mean of a Normal distribution, the Jeffrey's prior is... the flat prior!

Yep, totally agree, I have no problem with Jeffrey's priors (when they make sense), and that's all well and good. Just to clarify: I am not saying "don't use flat priors" -- flat priors are extremely reasonable and a good idea in many cases, my point is flat priors are still priors, and you are still making a statement by using them: "lets assume all possible values of theta are equally likely a priori". Sometimes we don't really believe that but it's useful to see the implications of making this assumption. But sometimes priors are extremely important (e.g. we want a time-dependent measurement of a poisson rate, like conversions per dollar of ad spend, and conversions are relatively rare: priors are your friend here, e.g. a GP prior = Cox process or something else, even if this prior is an operational assumption)

> Yes and no. The Bayesian and frequentist approaches answer different mathematical questions, but they are used by humans to answer the same human questions, such as "do these two populations have the same mean?"

Yes, agreed.

> Indeed this is one of the primary reasons Lindley's paradox arises: the Bayesian model comparison (using marginal likelihoods or Bayes factors) gets tricked by the diffuse prior, while the frequentist model comparison (using null hypothesis testing) does not.

Ah lord, but this is a terrible justification for using null hypothesis rejection: we're almost always choosing a very simplistic distribution (e.g. Gaussian) to do this, and reducing the question to "we reject H0 because its very unlikely" is part of the reason why there's a replication crisis in e.g. social sciences, because they're taught this simplistic picture without any of the necessary nuance ('here are the assumptions we make, and under these assumptions + H0, it is a little bit unlikely that we would have observed x'). That's a recipe for disaster. Is it not much better to discuss the full posterior, "degrees of belief" and to be explicit about all of our uncomfortable prior assumptions? I prefer Bayesian model selection over null hypothesis rejection 100% of the time, especially because "Bayesian model selection" is the only logical way to do model selection, the only caveat is that it depends on reasonable prior assumptions and these are the hard part (but again, at least it is explicit!).

Also, the Lindley's "paradox" example certainly seems contrived: we believe there's a 50% chance that p = 0.5 exactly?? I just don't understand that type of analysis. Come up with a prior, derive your posterior, decide the answer to your question yourself (what is the chance that p=0.5 exactly? well, it's exactly 0%. How much more likely is it that p=0.5036 vs p=0.5? That's a better question...). By contrived, I mean that it appears designed to exploit the fact that Bayesian stats will automatically prefer simpler models, especially one with 0 degrees of freedom that is relatively close to the right answer, but that's a Good Thing (TM).

Both frequentist stats and Bayesian stats are easy to abuse: Bayesian stats gives a false sense of comfort because people don't worry enough about their choice of prior, but at least Bayesian stats is explicit about the prior!. I won't say that hypothesis testing is complete garbage, but it is quite dangerous and frankly dishonest to reduce things to a p value and pretend that's the end of the discussion.


> But when one fine day you decide to do something more complex, these are the land mines that you shouldn't really ignore.

> in many real scenarios flat priors are not ok.

> eventually one day you will change lanes on the highway at the exact wrong time, and you will really regret your habit of not looking in your mirror.

Can you give some examples where frequentists hit these alleged flat-prior landmines? I am admittedly a Bayesian by training, not a frequentist, so perhaps it's just my ignorance showing, but I'm not aware of any such situations.

Frequentist statistics generally relies on performance guarantees (bounds on the false positive error rate for tests, in particular, and coverage for confidence intervals) which are derived under the lack-of-explicit-prior, so as far as I can tell they should be doing fine. I'd be interested in seeing examples where frequentist analyses fail because of the (implicit) flat prior.

> we're almost always choosing a very simplistic distribution (e.g. Gaussian) to do this

The Gaussian distribution is a marvelous thing. The central limit theorem is, in my humble opinion, one of the most beautiful and surprising results in mathematics.

> Is it not much better to discuss the full posterior, "degrees of belief" and to be explicit about all of our uncomfortable prior assumptions?

Perhaps I'm just cynical, but I'd say probably not. A Bayesian decision process is still a decision process and still subject to all the problems that the frequentist decision process (null hypothesis significance testing) is subject to: inflated family-wise error rates, p-hacking (except with Bayes factors rather than p-values), publication bias, and so on. At best getting everyone to do Bayesian analyses might be roughly equivalent to getting everyone to use a lower default significance threshold, like 0.005 instead of 0.05 (which prominent statisticians have advocated for).

> I prefer Bayesian model selection over null hypothesis rejection 100% of the time, especially because "Bayesian model selection" is the only logical way to do model selection, the only caveat is that it depends on reasonable prior assumptions and these are the hard part (but again, at least it is explicit!).

Sadly there's a trap in Bayesian model selection (often called Bartlett's paradox, though it's essentially the same thing as Lindley's paradox) which can be difficult to spot. No names out of respect to the victim, but several years ago I saw a very experienced Bayesian statistician who has published papers about Lindley's paradox fall prey to this. Explicit priors didn't help him at all. He would not have fallen into it if he had used a frequentist model selection method, though there are other problems with that.

> Also, the Lindley's "paradox" example certainly seems contrived:

And here we are again calling a statistical test that thousands of people do every day "contrived." You already know how I feel about that.

Yes, it's a very simple example, because that helps illustrate what's happening. Lindley's paradox can happen in arbitrarily complex models, any time you're doing model selection.

> By contrived, I mean that it appears designed to exploit the fact that Bayesian stats will automatically prefer simpler models, especially one with 0 degrees of freedom that is relatively close to the right answer, but that's a Good Thing (TM).

Preferring simpler models is not exactly what's going on in Lindley's paradox, at least not the way that most people talk about Bayes factors preferring simpler models (e.g. by reference to the k*ln(n) term in the Bayesian Information Criterion). The BIC is based on an asymptotic equivalence and drops a constant term. That constant term is actually what is primarily responsible for Lindley's paradox, and has only an indirect relationship to the complexity of the model.


Hi, sorry for the delayed response!

> Can you give some examples where frequentists hit these alleged flat-prior landmines? I am admittedly a Bayesian by training, not a frequentist, so perhaps it's just my ignorance showing, but I'm not aware of any such situations.

You're probably right: I myself am also a Bayesian by training as you can probably guess but went through the usual statistics education from a frequentist standpoint, and once I learned Bayesian statistics it was almost an epiphany and much more intuitive and understandable than the frequentist interpretation (but that's just me). In all honesty, I think good frequentist statisticians and good Bayesian statisticians have nothing to worry about, since both should know exactly what they are doing and saying as well as the limitations of their analysis.

I wouldn't put myself in either the "good frequentist" or "good Bayesian" categories, by the way, I am just an imperfect practitioner, but I think that's the case for most people. My argument against frequentist statistics for the masses is a practical one: I found myself getting into much more trouble and having much less insight into what I was doing when I had a frequentist background than I did when doing things from a Bayesian standpoint, and I see many imperfect frequentist statisticians like myself running into the same problems I used to (mostly ignoring priors when they shouldn't or thinking a flat prior is always uninformative, etc.), but I admit that is a wholly subjective experience. I never once thought about priors before learning Bayesian stats, and I find many people I meet with a frequentist background also forget the significance of priors because they also are imperfect practitioners.

> Frequentist statistics generally relies on performance guarantees (bounds on the false positive error rate for tests, in particular, and coverage for confidence intervals) which are derived under the lack-of-explicit-prior, so as far as I can tell they should be doing fine. I'd be interested in seeing examples where frequentist analyses fail because of the (implicit) flat prior.

Yea, I totally agree, I just find that statistics is important in many more contexts than just this. While you can do this sort of thing from a Bayesian perspective (using Jeffery's priors or whatever the situation calls for), in my experience frequentists have a tough time departing from this type of analysis once they start diving into areas where priors are important (unless they are also familiar with Bayesian stats!)

> The Gaussian distribution is a marvelous thing. The central limit theorem is, in my humble opinion, one of the most beautiful and surprising results in mathematics.

Agree with you, but CLM doesn't always help you. You may not always be interested in the statistics of averages in the limit of many samples. I agree when you are doing this, CLM is a godsend.

> Perhaps I'm just cynical, but I'd say probably not. A Bayesian decision process is still a decision process and still subject to all the problems that the frequentist decision process (null hypothesis significance testing) is subject to: inflated family-wise error rates, p-hacking (except with Bayes factors rather than p-values), publication bias, and so on. At best getting everyone to do Bayesian analyses might be roughly equivalent to getting everyone to use a lower default significance threshold, like 0.005 instead of 0.05 (which prominent statisticians have advocated for).

I disagree here. Discussing the full posterior forces you not to reduce the analysis to a simple number like a significance threshold, and to acknowledge the fact that there are actually a wide range of possibilities for different parameter values, and it's important to do this when your posterior isn't nice and unimodal, etc. I don't disagree that sometimes (well, many times) the significance threshold is all you really care about (e.g. "is this treatment effective, yes or no"), but that's still a subset of where statistics is used in the wild. E.g. try doing cosmology with just frequentist statistics (actually, do not do that, you may be physically attacked at conferences).

But again, I want to emphasize that doing Bayesian stats can also give you a false sense of confidence in your results, I don't mean to say Bayesians are right and frequentists are wrong or anything, I just mean to say that sometimes priors are important and sometimes they aren't, and I personally find that have an easier time understanding when to use different priors in a Bayesian framework than a frequentist one.

> Sadly there's a trap in Bayesian model selection (often called Bartlett's paradox, though it's essentially the same thing as Lindley's paradox) which can be difficult to spot. No names out of respect to the victim, but several years ago I saw a very experienced Bayesian statistician who has published papers about Lindley's paradox fall prey to this. Explicit priors didn't help him at all. He would not have fallen into it if he had used a frequentist model selection method, though there are other problems with that.

Like you say, there are problems with both approaches, but my point is that when the prior is explicit, we can all argue about its effects on the result or lack thereof. Explicit priors don't "help" you, but they force you to make your assumptions explicit and part of the discussion. If your only ever using flat priors, it's easy to forget that they're there

> And here we are again calling a statistical test that thousands of people do every day "contrived." You already know how I feel about that.

I don't mean to be flippant about it or dismissive, I mean exactly what I said:

contrived: "deliberately created rather than arising naturally or spontaneously."

What test is it in lindley's paradox are you referring to when you say thousands of people use everyday? Just the null rejection? Or is there another part of it you're referring to?

> Yes, it's a very simple example, because that helps illustrate what's happening. Lindley's paradox can happen in arbitrarily complex models, any time you're doing model selection.

My point isn't that it's simple, my point is that it's incredibly awkward and unrealistic and not representative of how a Bayesian statistician would answer the question "is p=0.5" which is a very strange question to begin with. The "prior" here treats it as equally likely that p=0.5 exactly and p != 0.5, which if that's your true assumption, fine, but my point is that this is a very bizarre and unrealistic assumption. Maybe it seems realistic to a frequentist but not to me at all. If someone was doing this analysis, I would expect to get a weird answer to a weird question.

> Preferring simpler models is not exactly what's going on in Lindley's paradox,

Exactly! I'm not sure what is going on in Lindley's paradox to be honest; I don't understand the controversy here: the question poses a very strange prior that seems designed to look perfectly reasonable to a frequentist but not to a Bayesian. But I suppose this is an important point about the way priors can fool you!

> at least not the way that most people talk about Bayes factors preferring simpler models (e.g. by reference to the kln(n) term in the Bayesian Information Criterion). The BIC is based on an asymptotic equivalence and drops a constant term.

I'm with you so far, and BIC is a good asymptotic result, but I'm talking about the full solution here (which is rarely practical*), that doesn't drop the constant term

> That constant term is actually what is primarily responsible for Lindley's paradox, and has only an indirect relationship to the complexity of the model.

I mean I think we're splitting hairs here? Maybe? My point was that Bayesian model selection won't make up for a strange prior, but given the right priors, Bayesian model selection just makes sense to me. But again, this is the important limitation of most Bayesian analyses: the prior can do strange things, especially the one used in the Lindley's paradox example in the Wikipedia page.

But honestly, if you think I'm missing some important part of Lindleys' paradox, please do elaborate, I have not heard of this before you mentioned it but I still am confused as to why this is considered something "deep" but I assume that just means I am missing something important.


Limiting the number of models to choose from is not that useful from a practical point of view. For instance, I am yet to come across any practical use of the Vapnik–Chervonenkis dimension.


If you want to prove a result (for example the existence of a unique minimum-variance estimator) for a class of models, but can only prove it by restricting the class, then restricting the class is useful for that purpose.

It may not be useful in applications, other than if you want assurances provided by results that have been proven about the class of model you're using.


> [1] Except empirical Bayes, which is a dark art.

As a not-an-expert-in-stats, why would you say that? Is not empirical Bayes basically the same thing, but with priors stemming from the known data?


> Is not empirical Bayes basically the same thing, but with priors stemming from the known data?

Yes. The problem is that either you end up using the data twice (once in the prior and once in the likelihood) or you have to choose how to split the data between the prior and likelihood, which can lead to other problems (particularly if you want to compare different models).


Thanks!


Just a note that you can interpret regularization as placing a prior on weights. L2 regularization is a Gaussian prior, and L1 is a Laplacian prior. I.e. this is doing Bayesian statistics rather than an arbitrary hack to improve predictions.

Elements of Statistical Learning is firmly in the frequentist world from what I recall, so this might not be discussed in that book.


This is discussed in Chapter 1 (or maybe 2), I think, which suggests to me that the author should probably read a little bit more of it.

Mind you, it's a wonderful book, and I recommend that people should just read it in general (you may not be able to do very many of the exercises, but it's still worth it).


Additionally, when he rails against introducing bias to improve generalization, I believe of some parts of statistical learning theory: Expected risk can be viewed as empirical risk (fit) and model complexity (lack of bias).


Yeah, Kevin Murphy's book covers this in great detail.

It was a very satisfactory revelation.


The author makes it sound like statistics is this grand beautiful mathematical edifice and ML is just a bunch of number crunching with computers. That contrast is just unfair; a huge portion of stats is just made up of hacks and cookbook recipes. Statistics has probably done more damage to the world than any other discipline, by giving a sheen of respectability to fake science in fields like nutrition, psychology, economics, and medicine.

I'm particularly annoyed by the implication that statisticians have better understanding of the issue of overfitting ("why having p >> n means you can't do linear regression"). Vast segments of the scientific literature falls victim to a mistake that's fundamentally equivalent to overfitting, and the statisticians either didn't understand the mistake, or liked their cushy jobs too much to yell loudly about the problem. This is why we have fields where half of the published research findings are wrong.


> Statistics has probably done more damage to the world than any other discipline, by giving a sheen of respectability to fake science in fields like nutrition, psychology, economics, and medicine.

This seems really unfair. You can misuse statistics, but it's an extremely powerful tool when it's properly used and understood. Most powerful tools can be misused - you can write terrible code and (try to) publish bad mathematics too. But much of modern science would be intractable without statistics; including physics, chemistry, biology and applied math, because we'd be otherwise unable to draw reasonable conclusions from anything less than a totality of data.

As someone with a graduate education in probability snd statistics, I think it's fair to lay some of the blame for the reproducibility crisis at the feet of statisticians because of poor education. Statisticians should accept at least some responsibility if their students in non-math majors graduate without understanding the material, for sure.

But that being said, it should definitely be noted that actual statisticians have been talking about this crisis for decades. Statisticians have basically always known that there's nothing magical about the 95% significance threshold p <= 0.05, for example. And for the most part, it's not statisticians who are causing the bad science to occur. Rather it's a problem of non-statisticians using statistics without (qualified) peer review that they can't be expected to do correctly if it's not their core competency.

In my opinion it's something of a philosophical problem - many fields and journals are only realizing now that it's unreasonable to expect a e.g. professional psychologist to also be an expert statistician. Having a dedicated statistician - instead of another psychologist who hasn't reviewed the material since their upper undergrad course - is a giant leap forward in catching bad stats in new research.


> Statistics has probably done more damage to the world than any other discipline, by giving a sheen of respectability to fake science in fields like nutrition, psychology, economics, and medicine.

I think it's unfair to represent the class of people that misrepresent their findings (charlatans and liars) as a problem with statistics. I'd blame that on poor understanding of statistics and the publish or perish mindset of academia.

> the statisticians either didn't understand the mistake, or liked their cushy jobs too much to yell loudly about the problem

You're obviously not someone that considers themselves a statistician. I do, and we have been basically telling everyone that would listen that there are huge fundamental issues with the way many scientists hinge their whole careers on p-values and similar things. Whether that message has been properly received is another story. The American Statistical Association has even published multiple official statements cautioning against the use p-values, the 0.05 cutoff, and using a single quantity to assess the impact and validity of anything.

See [1] The ASA's Statement on p-Values: Context, Process, and Purpose and [2] Moving to a World Beyond "p < 0.05".

[1] https://amstat.tandfonline.com/doi/full/10.1080/00031305.201...

[2] https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...


If this topic interested you, it may also be worth reading Leo Breiman’s “Statistical Modeling: The Two Cultures” from 2001. https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...


The author appears to misunderstand the main difference between statistics and ML. Let me cite him:

> my gut reaction is to barf when someone says “teaching the model” instead of “estimating the parameters.”

Typical statistics work is to use a known good model and estimate its parameters. Typical machine learning work is to think back from what task you want it to learn and then design a model that has a suitable structure for learning it.

For statistics, the parameters are your bread and butter. For machine learning, they are the afterthought to be automated away with lots of GPU power.

A well-designed ML model can have competitive performance with randomly initialized parameters, because the structure is far more important than the parameters. In statistics, random parameters are usually worthless.


Isn't that basically what they said?

> The main difference between machine learning and statistics is what I’d call “β-hat versus y-hat.” (I’ve also heard it described as inference versus prediction.) Basically, academia cares a lot about what the estimated parameters look like (β-hat), and machine learning cares more about being able to estimate a dependent variable given some inputs (y-hat).


I tried to argue about the different ways how a prediction is made in applied statistics versus applied ML.

Your argument here is more between theoretical statistics and applied statistics, or similarly between theoretical and applied machine learning.


"Typical statistics work is to use a known good model and estimate its parameters. [...] For statistics, the parameters are your bread and butter" Ever heard of non-parametric statistics?

"For machine learning, they are the afterthought to be automated away with lots of GPU power." You seem to reduce statistics to undergraduate statistics and machine learning to Deep Learning.

"A well-designed ML model can have competitive performance with randomly initialized parameters, because the structure is far more important than the parameters. In statistics, random parameters are usually worthless." This is blatantly false see Frankle & Carbin, 2019 on the lottery ticket hypothesis.


Yes, I have reduced both statistics and ML to the subsets that are usually used when working in the field, because the blog post was about employment options.

I would wager that people doing non-parametric statistics are both very rare and most likely advertise themselves as machine learning experts, not as statisticians.

As for the random network, I was referring to https://arxiv.org/abs/1911.13299 and I have seen similar effects in my own work where a new architecture was performing significantly better before training than the old one was after training.

If you want a generally agreed upon example, it'd be conv nets with a cost volume for optical flow. What the conv nets do is to implement a glorified hashing function for a block of pixels. That'll work almost equally well with random parameters. As the result, PWC-Net already has strong performance before you even start training it.


>As for the random network, I was referring to https://arxiv.org/abs/1911.13299 and I have seen similar effects in my own work where a new architecture was performing significantly better before training than the old one was after training.

The fact that a dense neural network with 20M parameters performs equally well as a model with 20M random values and 20M _bits_ worth of parameters means nothing more than that the parameter space is ridiculously large.

The only models that perform well given random parameters are those that are sufficiently restrictive. Like weather forecasts, where perturbations of the initial conditions give a distribution of possible outcomes. Machine learning models are almost never restrictive.


Of course, I agree with you that the parameter space is ridiculously large. But sadly, that's what people do in practice. And with 20mio, their example is still small in comparison to GPT-3 with 175 billion parameters.

I disagree with you on the restrictive part. Those ML models that are inspired by biology tend to be restrictive, the same way that the development of mammal brains is assumed to be restricted by genetically determined structure. Pretty much all SOTA optical flow algorithms are restricted in what they can learn. And those restrictions are what makes convergence and unsupervised learning possible, because the problem by itself is very ill posed.


Non-parametric statistics blurs the lines a bit, prequential statistics (ala Dawid) blurs the lines even more, but he is not wrong. A traditional statistician will be excited about a method because it can recover the parameter (be it finite dimensional, or infinite dimensional). On the other hand an ML person will be excited about a method because, even if the method sucks at recovering the parameters, it does well on the prediction task (if it can be shown that it approaches the theoretical limit of the best that one can do, no matter what the distribution of the dats, and it can do so with efficient use of compute power, ... that would be the holy grail).


It seems like the 'scientist' part of 'Data scientist' might cause this sort of misunderstanding.

There's a lot more 'engineering' and fiddling going on than any type of 'science-y' stuff it seems.


At one point, the word "scientist" in "data scientist" was used to distinguish between people who took the time to develop domain expertise from statistical consultants who applied standard methodologies without reference to what the data was or where it came from.


'science-y' stuff usually doesn't have 'science' in the name.


This is one of the clearest explanations I've read on the difference between traditional Statistics and Machine Learning.


A few years ago, Michael I. Jordan did an AMA on Reddit and discussed this distinction as well. Maybe you'll find it interesting as a counterpoint [1].

[1] https://old.reddit.com/r/MachineLearning/comments/2fxi6v/ama...


Perfect link. He’s the ideal commentator.


> Traditionally, it’s a cardinal sin in academia to use parameters like these because you can’t say anything interesting about the parameters, but the trick in machine learning is that you don’t need to say anything about the parameters. In machine learning, your focus is on describing y-hat, not β-hat.

This kind of philosophy will cause future generations to see machine learning as something worse than a fad, almost as something in between a fad and crank science. If this encapsulates how all (generally speaking) machine learning operates then we will enter big trouble, if we have not already.

> In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works.

This has moved from Cargo Cult Science into numeromancy. It's leveraging the occult (=hidden, incomprehensible parameters) for predicting the future. Because there exist no first principles, nothing can be further interpreted. Only more of the occult can be leveraged in order to make more predictions not amenable to interpretation, which will in turn require MORE occult to make MORE inscrutable predictions, until the heat death of the universe....

And appealing to 80's AI (neural networks) as case precedence further harms the author's case. If ML operates like how AI neural network technology went, then this whole rigamole will go tits up by case precedence as well.


I think essentially the opposite is true, and academia's notion of cardinal sins have held back the analysis of data by decades. And I say this as an academic.

You make predictions all the time, and you don't know how. You don't know how you walk, how you drive a car, how you know the rules of English grammar. Your mind is a black box. And yet, you can do things that are still beyond the power of what we can understand using first principles. In many domains, first principles have achieved nothing. Years and years of effort by some of the smartest people in the world, and we pushed the ball one inch down the football field.

Supervised machine learning asks the question "To what extent can we predict this outcome, given these predictors?" This is a perfectly valid question, one that we can try to answer given enough data. And we can do it under the exact same conditions we can do standard statistical inference.

Ultimately, we want to answer "why" questions. But sometimes we can't even answer "what" questions. Most data is so complex that it's hard to say what the data even says, let alone why it says it. We could have been using what as a stepping stone to why, but our own provincialism as statisticians prevented us. I hope now we are learning to do better.


How does one interpret what a black box means? In truth, any black box must remain beyond interpretation as a consequence of the fact that Verificationalism failed.

There exist hidden verificationalist presuppositions embedded in your conclusions. If verificationalism doesn't hold, then by-first-principles stands as the best approach.


> This kind of philosophy will cause future generations to see machine learning as something worse than a fad

Why do you think so ? Parameters are a piece of fiction, no one has actually seen them. Sometimes they are an useful piece of fiction, but not a very falsifiable notion.

Prediction accuracy, well, that I can definitely measure without resorting to pieces of fiction that are epistemologically unknowable.


I agree from the perspective that often complex models that cannot be interpreted are created because Neural Networks sound fancy, when much simpler models could be used instead.

But on the other hand complex models such as Neural Networks are being used for very wide datasets with incredible amounts of parameters, so that understanding a single parameter's contribution is not useful in the real world.

For example, using mouse movement data to classify users into age groups. Knowing that some movement vector adds 0.000067 to the probability of a user being in the 16-25 age group, and so being allowed to watch a movie rated 16+, is not very useful.


> But on the other hand complex models such as Neural Networks are being used for very wide datasets with incredible amounts of parameters, so that understanding a single parameter's contribution is not useful in the real world.

In the near future AI and statisticians will need to cooperate by:

1. Finding means for extracting rules/general principles from neural networks.

2. Creating new fields of statistics that can handle multi-dimensional data sets, such as by representing them as small-dimensional datasets that interact (?), which converges to the same multidimensional model when the modeler sutures these interactions (?) together according to a topological structure.

We know that we can do this because somehow human beings reason about complex systems successfully.


There’s a whole field around 1 called explainable ai. Interestingly, one of the SOTA techniques, SHOP values, comes from economics (game theory).

Regarding the article, I think it was a good read, but as a data scientist on a research at Amazon, there’s a reason our interviews have shifted from stats-heavy to more cs-heavy: the ability to actually implement and maintain analytic products is just more useful. (Note, this isn’t across the board - we still hire PhD-level candidates to do research).


>human beings reason about complex systems successfully

Have you SEEN economics?


It's conceivable that humans minds are not perfectly modeled by software neural networks.

If neural networks are not perfect models of our minds, then your last paragraph does not hold.


Have you considered that useful predictive models for reality are simply irreducible to being comprehensible by swollen savannah monkey brains? It's not magic. It simply cannot both be explained to a human and be worth anything.


By the Good Regulator Theorem, if models of reality were incapable of being made comprehensible, then we would have went extinct. What we need to figure out then becomes how to externalize what we know unconsciously.


Sounds interesting, but I'm not sure if I understand.

From Wikipedia: It is stated that "every good regulator of a system must be a model of that system".

How is that related to comprehensibility? Or going extinct?

A lot of smart people seem to be surprised that anything can be comprehended at all:

https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

The lack of mathematics effectiveness in description of intelligence seems unsurprising to me.


Can a swollen savannah monkey brain drive a car safely on public roads, using nothing but its attached eyes and ears? Yes (most of the time). Can AI do it? Not yet...


Sure you can drive, but can you explain in details HOW you do it? Probably not and that's why a machine cannot do it yet (until it learns by itself, as we all did).


Not an AI in a publicly available release that can be obtained by Joe consumer.

Supposedly the development versions we the general public can’t get yet are getting pretty good and would meet your “most of the time” bar on public roads.

But we don’t really know how good AI is at any given moment, because those that are leading in any AI area sometimes have reasons for not telling yet. Or because we don’t believe them when they do tell us.


Well, statistics is for those who understood that showing your work on an exam question was the whole point of the exam. ML is for those who just wrote down the final answer and dismissed showing their work as a waste of time. There's no need to understand the 'why' if you already know the 'what'.


I love this and I'm stealing this but I think you maybe didn't take it far enough. To extend your analogy,

Statistics is for places where making meaning of or taking action on the answer must (for practical or normative reasons) include the process

Machine learning is for places where meaning or action can be done without regard for the process.


So, you're saying ML is for situations in which type II errors are considered irrelevant?


Given that every semester is have at least one student who can't conceptualize type II errors and at least one who just can't accept type I error...I don't even k ow anymore


This is really nice write-up, much better than yet-another-skin-deep-sklearn-tutorial. Skimming some other posts of the author, his domain understanding looks quite impressive to me.

(Judging his writing as an ex-academic econometrician Data Scientist, about to be rebranded to Machine Learning Engineer by his megacorp employer, the author appears to have more insight in the field than many a PhD professional Data Scientist.)


It is basically the standard take of a statistician who tries to understand machine learning. "It's yhat, rather than betahat" is a common slogan.


Data science always seemed to me to be a profoundly boring job. Can anyone shed some light on what you find the most fascinating about it?


I can tell a story. I used to work for a HVAC installation company, pretty small in terms of staff but we subcontracted a lot. Initially brought on as a mechanical engineering intern, but moved on to sales engineering when I found an interesting statistical relationship.

A large factor in quotes to clients was the underlying cost of air conditioning equipment in our niche, and often a game of sales intel was played between suppliers and competing contractors (like us) for a given job site. Favorites were picked, and we could get royally screwed in a quote, losing the sale to the end-customer.

Fortunately, we had years of purchasing information. It turns out that as varied as air conditioners are across brands and technical dimensions, when you have years of accounts' line items and unused quotes, you don't get a dimensionality issue. Since we operated in a clear-cut niche, this was especially true. We could forecast, within a margin of error of two per cent, exactly what any of our suppliers would quote us (or our competitors!) for a job long before they could turn it around. Huge strategic advantage.

This was the watershed moment for me when I realized even basic multiple linear regression was a scarily powerful tool when used correctly.


That is cool when you put it like that. Uncovering hidden relationships that are useful sounds romantic. Thanks for posting


And incredibly boring. The usual estimate is that data science is 80% data wrangling: finding, collecting, and cleaning up data. The term "data scientist" replaced "data miner", because miners are looking for gold. Scientists are obsessed with finding out the nature of reality, gold or mud. They will do seriously boring stuff to set things up so that reality is revealed.


It is only boring if you do it the boring way.

If the data cleaning is follows standard patterns, you should already have scripts to offload that kind of work to. If not, then there some incredibly interesting decisions hidden underneath. Like in text: Should character casing be preserved ? What should be the unit of representation (word/character) ? How should data be filtered: Quality vs quantity trade-off ?

All of those are non-trivial questions which involve a lot of thought to reason through. You are correct that the modelling is only a small part of DS's day to day job.

But, the rest of it is boring in the same way that coding is boring. It is doesn't involve some grand epiphanies or discoveries, but there is joy similar to the daily grind of "code -> get bug/ violate constraints -> follow trace/problem -> figure a sensible solution" that a lot of software engineers love.


Love the article. It inspired me to make a follow-up note on one of the memes: https://win-vector.com/2020/07/03/data-science-is-not-statis...


From the article: " In statistics, bad results can be wrong, and being right for bad reasons isn’t acceptable. In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works."


Is that a typo? It makes more sense as "good results can be wrong, and being right for bad reasons isn't acceptable"


Off topic, but if someone uses "gut reaction" and "barf" in the same sentence, I'm tempted to think they really mean it literally...


There is a big difference between ML practitioners and professional statistians. Former commonly are unaware[1] of a rich set of statistical biases and ways to tackle or mitigate them.

[1] https://towardsdatascience.com/survey-d4f168791e57


Can someone elaborate on what is meant by 'estimating a parameter with a natural experiment'? This seems to be the key difference but I don't quite get how this would work. What would be your input data and how would the process differ from an ML approach?


A natural experiment is an experiment (an AB-test if you will) that occurs by chance rather than conscious design. For example, two neighboring countries contemplate banning smoking in restaurants, but in one the bill fails with 49% of the vote, in the other the ban goes through with 51% of the vote. It's not perfect, but you could argue that these countries can now be used to estimate the effect of a smoking ban on mortality and health in a way that is almost as good (but not quite as good) as a randomized clinical trial, whereas you can't just compare two arbitrary countries with differential rates of smoking, because they might be different on so many other counts as well and there is no pre-intervention data to serve as a baseline.

More broadly, ML does not really answer questions like "was this death caused by smoking, or rather by a hundred other things associated with smoking like lower income and bad health insurance?", though it is excellent at predicting who is likely to die prematurely. So it's great for prediction, but not so great if you want to learn more about the underlying structure of a phenomenon.

Statisticians are sometimes surprised to see so much interest in machine learning given that its view of the world is not open to inspection (though there's https://github.com/interpretml/interpret I guess) so we as humans learn nothing, but it turns out that in many cases we really don't care all that much about the underlying mechanism, as long as we can make accurate predictions.


A pox in both there houses.

I kinda want to ban this stuff for economies like ours. Think about it, we have many entrenched inefficient separate actors all engaging in nonsense alchemy. Surely this ruins the convergence to economic equilibrium.


Well, if you look at machine learning from the point of view of data science it's inevitable to be confused about its relation to statistics, but machine learning is a sub-field of AI and statistical techniques are only one tool in its toolobx. Statistical techniques have dominated the field in recent ish years but much work in machine learning has historically used e.g. Probabilistic Graphical Models or symbolic logic as the "model" language. e.g. one of the most famous and well-studied classes of machine learning algorithms, decision tree learners, comprises algorithms and systems that learn propositional logic models, rather than statistical models.

Tom Mitchell defined machine learning as "the study of computer algorithms that improve automatically through experience"[1]. This definition does not rely on any particular technique, other of course than the use of a computer. Even the nature of "experience" doesn't necessarily need to mean "data" in the way that data scientists mean "data"- for example, "experience" could be collected by an agent interacting with its environment, etc.

Unfortunately in very recent years, since the big success of Convolutional Neural Networks in image classification tasks, in 2012, interest for machine learning has shifted from AI research to ... well, let me quote the article:

>> Or you can start reading TESL and try to get some of that sweet, sweet machine learning dough from impressionable venture capitalists who hand out money like it’s candy to anyone who can type a few lines of code.

I suppose that's ironic. But the truth is that "machine learning" has very much lost its meaning as industry and academia is flooded by thousands of new entrants that do not know its history and do not undestand its goals. In that context, it makes sense to have questions along the lines of "what is the difference between statistics and machine learning", which otherwise have a very obvious answer.

___________

[1] https://www.cs.cmu.edu/~tom/mlbook.html

The excerpt I quote is an informal definition. The wikipedia article on machine learning has a more formal definition:

https://en.wikipedia.org/wiki/Machine_learning#History_and_r...


Tom Mitchell book is still a great book to understand what Machine Learning is about


this reads like something 5 or 6 years old


The author's pie chart showing data science to be 60% data manipulation is accurate. The biggest gap between good and bad data scientist is their comfort level with data wrangling. When interviewing candidates for data science positions, one of the simplest questions is to have them sort a 1 GB tab-delimited file.

1. Poor candidates will try to open the file in Excel.

2a. Marginal candidates will use R or Stata.

2b. Okay candidates will use a scripting language like Python.

3. Good candidates will use Unix sort.

To my knowledge, there are no university courses teaching the Unix toolchain and it remains very much a skill learned through practice.


Not sure why you think a candidate who uses R is inferior to one who uses Python?

Also, a really good candidate should use the right tool for the job, so if you expect them to use Unix sort you should somehow imply a situation where that is the best approach.


> if you expect them to use Unix sort you should somehow imply a situation where that is the best approach.

I think the implied question is whether the interviewee is aware of the fact that trying to load a 1 GB text file could use up too much RAM space of the system. Unix sort is arguably the most memory efficient among the 4 choices there. It depends on the amount of available resources (which was not specified), and some companies might be willing to casually let people use 100 GB machines, though.


That is a fair point. A good script would be to move towards this scenario of more data than RAM and see what the candidate comes up with.


I would also add: if it’s a one time thing, I would just do it in visual studio code or any other editor that doesn’t die on a 1GB file. And I have 12+ years experience with bash and unix tools, so it’s not about a lack of knowledge or experience. There isn’t anything magical regarding “sort” versus another tool, if there isn’t a need for automation they are equivalent.


Demonstrating again that interviewing is mostly about confirming the prejudices of the interviewer.


In my experience the real meat of data manipulation/wrangling is in extracting statistical value from the data. If the data is noisy, what aggregations and what filters produce the best signal without smudging the underlying statistical properties?

While this is a good skill to have and is a good sign of efficiency, for me the deal breaker is how well the purpose of data transformations is understood and the care to which statistical value is extracted and tested.


This may doom me as a merely okay candidate but a simple paging strategy in python trivializes the problem and has the added benefit of probably being the "okay-est" tool for whatever the next transform is as well.


Why is chunking, with Pandas, a bad choice?


It's not. Welcome to okay-land.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: