On moving from statistics to machine learning, the final stage of grief (2019) 175 points by yoloswagins on July 3, 2020 | hide | past | favorite | 103 comments

 There's this undertone of "I should be payed as much/more than Data Science people because I'm better than them at statistics and data science = machine learning = statistics".My experience doing data science at small companies that can't afford to hire more than 1 person for the role is that it is so much more than just building models or doing statistics.You have to:1. Build APIs and work with developers to get predictive models integrated into the rest of the software stack2. Know how to add logging, auditing, monitoring, containerizing, web scrapers, cleaning data(!!), SQL scripts, dashboards, BI tools, etc.3. Do some basic descriptive stats, some basic inferential stats, some predictive modeling, work on time-series data, sometimes apply survival analysis, etc. (Python/R/Excel who cares)4. Setting up data pipelines and CI/CD to automate all this crap5. Trying to unpack vague high level requirements along the lines of "Hey do you think we could use our data to build an 'AI' to do this instead of manually doing it" and then coming up with a combination of software / statistical models that perform as least as good or better than humans at the task.6. Work with non-technical business users and be able to translate this back to technical requirements.Hey, if all you do all day is "build models" then that sounds like a very cushy DS job you have. It's definitely not been my experience. I would describe it more like a combination of software engineering and statistics and business analyst. That's why it pays higher than just statistics. But this is just my experience..
 The author starts withThe data science world may reject me and my lack of both experience and a credential above a bachelors degreeMore likely the data science world will reject him because he is so confident a field he has so little experience or knowledge of.Data scientist is a profession rather than the name of an academic field. So data scientists' job is to solve practical problems. That involves a lot more than class assignments, and in some cases involves using machine learning to maximize predictive accuracy (because common ML models like gradient boosting capture interactions and non-linearities in a richer way than the GLM models the author is familiar with).Their argument "that's a garbage model because we can't reasonably interpret underlying parameters," is replacing their personal criteria above what is needed to solve some problems.They can blame it on only having bachelor's degree. But the real problem is the belief that a bachelor's degree taught them everything there is to know, and those in the DS field are ~ idiots who got lucky enough to be paid more.
 I feel like the blog post should be read in line with how it was probably written: informal, personal and somewhat sarcastic, with a bitter note because he chose one major and now it turns out people value something else instead. Hence the title "the final stage of grief". I did not get the impression that the author thinks machine learning is stupid or that he knows everything about it.
 And in all honesty, no data scientist with an ounce of self-worth should work long-term for such companies, unless it just happens to be their own.You're basically doing 3 jobs for the price of one: Software Engineer, Data Engineer, Data Scientist.Sure, you'll be a jack of all trades, as far as data goes, but it'll be at the cost of some specialization.I'm probably gonna get a lot of sh!t for this post - probably from data [x] people that are in that exact position themselves, but the above description is exactly why I'd aim for larger companies with somewhat established analytics / data / ML teams or offices. You get to focus on the important stuff, instead of juggling ten balls at the same time.(And it's not only in the field of data science. Some of the traditional SE positions I see at startups or small companies look absolutely grotesque - basically the whole IT and Dev. department baked into one job)
 No one with an ounce of self-worth should work long-term for companies that expect them to do exactly what their title implies they should and not a thing more.You're basically being arbitrarily restricted to learning and enjoying exactly one thing when it would often make more sense in context to become involved in: Customer relations, systems administration, management, software engineering, data science, etc.Sure, you'll become really good at that one thing, but it'll be at the cost of personal growth and job satisfaction.> I'm probably gonna get a lot of sh!t for this post...I mean, yeah. You've basically lampooned anybody who enjoys working in ill-defined cross-disciplinary circumstances as having "[not an] ounce of self-worth".It sounds like, from your perspective, your field is "the important stuff" and other fields are just balls to be juggled. There's nothing wrong with that, but lots of people don't think that way. To some people, the important stuff is anything that makes their customers happy. To others, it's anything that helps them learn.And let's dispense with the notion that "doing 3 jobs for the price of one" is an accurate description of having broad rather than narrow responsibilities. One comes at the cost of the other. If you're an equally capable specialist and generalist, and you're capable of genuinely performing those 3 jobs at once, then if you were to specialize you'd be performing the work of 3 average specialists, and you'd be in the same boat as before.Do what you're best at and try to get the best possible compensation for it, monetary or experiential. It's as simple as that.
 > And in all honesty, no data scientist with an ounce of self-worth should work long-term for such companies, unless it just happens to be their own.You're kind of saying that no one should work for a startup ever. At small companies you have to do many things. As you said a software engineer, there aren't dedicated front-end engineers or dev-ops. Marketing teams don't have content vs. growth vs. performance vs. brand vs. email marketers. You might be the first sales person, which means no sales ops support, no account manager for ongoing relationships, etc.There are trade-offs, of course! There are trade-offs to anything. Some people value working on many aspects of a company. Some people find understanding more than their narrow field to be interesting and rewarding and, you know, self-worth-y!So I think it might be helpful to step back and consider that not everyone has the same priorities, experiences, interests, or definitions of happiness and self-worth as you do. And that's okay.
 >You get to focus on the important stuff, instead of juggling ten balls at the same time.So on one hand, you can't build any models without the work of the engineers, but on the other the model building is "the important stuff"?Maybe it's just me, but I enjoy working on all aspects of the data pipeline.
 Model building is often the trivial part, and often you can't build models without a solid understanding of things like the data pipeline.
 Some people enjoy doing full stack DS and get paid well for it.
 And there's the potential risk of being mediocre on many fields and become less competitive in each area. My opinion is it really needs to be the field you love (biology, science etc) to be worth the effort.
 to be fair, how large should a field be?specializing in physics as a whole is way to broad. but is specializing in front end software development to broad?Once you have learned the fundementals of computer science and its assosicated fields (networking and systems engineering mainly), the difference between doing back/front end works is not that large.
 I mostly agree but the competition in the field will dictate that. Like you stated if it's features you are after and the sold party won't mind these small details then maybe they (developers) are interchangeable. But e.g in science mediocre work will cost a lot if certain minor details are found wrong. That's whole reproducibility crisis in science is about. The reputation of a whole institute / department might be at risk if the mistakes are exposed.
 Some of what you're describing in 1-4 is Data Engineering. 5-6 exists (in some form) for most software jobs.The general breakdown I give people is:Data Scientists:* Get data.* Clean data (~60% - 70% of time required).* Research.* Low level data analysis.* Building models.It's mostly the "knowing data" and the modelling.Data Engineers:* Data storage* Data processing* Automation* InfrastructureIt's about getting the Data Scientist's output into production / making data easily available to them.This is especially true for big ETL jobs. The more we can automate your ETL jobs, the happier you'll be!
 I talk regularly with business people who have hired data scientists, and 5 and 6 is always the biggest complaint. That, plus new hires are always unprepared to handle how messy real-world data is.
 The messiness of data is something I even see creating a growing rift between academic ML/DS and real world applications.What makes for a nice paper doesn't necessarily make for a model that will survive contact with new user generated data.
 I remember some similar discussions when I worked in logistics consulting for a while. The data source was a mess, like a mess. Data was even included as screenshots of spreadsheets in other spreadsheets.Some math PhD was in charge of that modelling. Without any domain knowledge concerning the data (logistics, consumption, maintenance) and thus unable to properly interpret the raw data to begin with. Most of the time was spent on writing some Python scripts to analyze the raw data, still full of errors. And build predictive models on top of that mess. Kind of formed my view of data science, unfairly so.
 Just like almost all software work is maintenance, almost all data science is data cleaning.
 I liken it to sending a chef to a grocery store. It's not just about being a good cook. Half the battle is in choosing the correct ingredients. Not just a dozen eggs, but free range where the yolks will be a vibrant orange yellow and improve the presentation. The cleanest models with the highest fidelity often fall out as the next obvious transformation of a well groomed and hygienic dataset.
 I hate the fact that you're right. And that you've described my job in a small company.
 > Machine learning is genuinely over-hyped. It’s very often statistics masquerading as something more grandiose. Its most ardent supporters are incredibly annoying, easily hated techbrosThis sort of fashionable disparagement of a group of people to signal that you’re not part of the “bad group of tech bro’s” is so trashy. Why are these random people you easily hate? Who are they? Why take glee in shared hatred?I’ve worked as a sr DS at FAANG for 4 years. I’ve recently worked through Casella Berger, because I wasn’t comfortable being one of those DS who didn’t know math stats. But before I did work through it, I worked with people from PhD stat programs who were so ineffective. Despite knowing so much more stats than me, they would freeze up and fail everytime they had to deal with any sort of software system or IDE. It was so weird to me that my ability to use a regression, even before I knew the theory, was more valuable than their ability to use a regression to its full power, simply because I could fight the intense battle to take that idea and put it into reliable production code.But generally I hate hate this war between DS and stats. It’s so stupid. Maybe not their first year, but eventually any DS who wants to be a master of their craft ought to learn math/theoretical stats. And some don’t want to be a master of their craft, and instead want to go into management or whatever, and that’s fine.
 > I’m sure you’re asking: “why allow your parameters to be biased?” Good question. The most straightforward answer is that there is a bias-variance trade-off. The Wikipedia article does a good job both illustrating and explaining it. For β-hat purposes, the notion of allowing any bias is crazy. For y-hat purposes, adding a little bias in exchange for a huge reduction in variance can improve the predictive power of your model.I'm going to push back on this.The author seems to understand the bias-variance tradeoff as applying primarily to y-hat, and allows that if you are primarily interested in y-hat then it can make sense to make that tradeoff (introduce bias in exchange for lower variance). But the bias-variance tradeoff is more general than that. There's also a bias-variance tradeoff in beta-hat, and you can make a similar decision there to introduce some bias in beta-hat in exchange for lower variance, lowering the overall mean square error.There's nothing crazy about this. The entire field[1] of Bayesian statistics does this every day--Bayesian priors introduce bias in the parameters, with the benefit of decreasing variance. Bayesians use these biased parameter estimates without any problems.Classical (non-Bayesian) statistics has tended to focus heavily on unbiased models. I suspect this is largely because restricting the class of models you're looking at to unbiased models allows you to prove a lot of interesting results. For example, if you restrict yourself to linear unbiased models, you can identify one single best (i.e. lowest variance) estimator. As soon as you allow bias you can't do that anymore.[1] Except empirical Bayes, which is a dark art.
 “Non-Bayesian” stats uses bias in the exact same way that Bayesian stats does, because Bayesian stats and frequentist stats are mathematically equivalent. When someone thinks they’re not using priors, they’re wrong, they’re usually using a flat prior on the model parameters, but that adds bias just like any other prior! A flat prior on theta is different than a flat prior on log(theta) or some other parametrization, and flat priors are often times the wrong choice, so this notion that Bayesian stats is some “special” type of inference and that there is some way to do inference without bias is just a very large misconception.
 > “Non-Bayesian” stats uses bias in the exact same way that Bayesian stats doesThis is incorrect.Bias has a very specific specific mathematical meaning in statistics--the difference between the expected value of the estimate (under the sampling distribution) and the true value. There are many examples of parameter estimates in classical statistics that have zero bias under that definition.> Bayesian stats and frequentist stats are mathematically equivalent.Also incorrect. Bayesian and frequentist methods focus on different conditional probabilities and can give very divergent results even in simple cases. See e.g. Lindley's paradox [1].
 Limiting the number of models to choose from is not that useful from a practical point of view. For instance, I am yet to come across any practical use of the Vapnik–Chervonenkis dimension.
 If you want to prove a result (for example the existence of a unique minimum-variance estimator) for a class of models, but can only prove it by restricting the class, then restricting the class is useful for that purpose.It may not be useful in applications, other than if you want assurances provided by results that have been proven about the class of model you're using.
 > [1] Except empirical Bayes, which is a dark art.As a not-an-expert-in-stats, why would you say that? Is not empirical Bayes basically the same thing, but with priors stemming from the known data?
 > Is not empirical Bayes basically the same thing, but with priors stemming from the known data?Yes. The problem is that either you end up using the data twice (once in the prior and once in the likelihood) or you have to choose how to split the data between the prior and likelihood, which can lead to other problems (particularly if you want to compare different models).
 Thanks!
 Just a note that you can interpret regularization as placing a prior on weights. L2 regularization is a Gaussian prior, and L1 is a Laplacian prior. I.e. this is doing Bayesian statistics rather than an arbitrary hack to improve predictions.Elements of Statistical Learning is firmly in the frequentist world from what I recall, so this might not be discussed in that book.
 This is discussed in Chapter 1 (or maybe 2), I think, which suggests to me that the author should probably read a little bit more of it.Mind you, it's a wonderful book, and I recommend that people should just read it in general (you may not be able to do very many of the exercises, but it's still worth it).
 Additionally, when he rails against introducing bias to improve generalization, I believe of some parts of statistical learning theory: Expected risk can be viewed as empirical risk (fit) and model complexity (lack of bias).
 Yeah, Kevin Murphy's book covers this in great detail.It was a very satisfactory revelation.
 The author makes it sound like statistics is this grand beautiful mathematical edifice and ML is just a bunch of number crunching with computers. That contrast is just unfair; a huge portion of stats is just made up of hacks and cookbook recipes. Statistics has probably done more damage to the world than any other discipline, by giving a sheen of respectability to fake science in fields like nutrition, psychology, economics, and medicine.I'm particularly annoyed by the implication that statisticians have better understanding of the issue of overfitting ("why having p >> n means you can't do linear regression"). Vast segments of the scientific literature falls victim to a mistake that's fundamentally equivalent to overfitting, and the statisticians either didn't understand the mistake, or liked their cushy jobs too much to yell loudly about the problem. This is why we have fields where half of the published research findings are wrong.
 > Statistics has probably done more damage to the world than any other discipline, by giving a sheen of respectability to fake science in fields like nutrition, psychology, economics, and medicine.I think it's unfair to represent the class of people that misrepresent their findings (charlatans and liars) as a problem with statistics. I'd blame that on poor understanding of statistics and the publish or perish mindset of academia.> the statisticians either didn't understand the mistake, or liked their cushy jobs too much to yell loudly about the problemYou're obviously not someone that considers themselves a statistician. I do, and we have been basically telling everyone that would listen that there are huge fundamental issues with the way many scientists hinge their whole careers on p-values and similar things. Whether that message has been properly received is another story. The American Statistical Association has even published multiple official statements cautioning against the use p-values, the 0.05 cutoff, and using a single quantity to assess the impact and validity of anything.See [1] The ASA's Statement on p-Values: Context, Process, and Purpose and [2] Moving to a World Beyond "p < 0.05".
 > Traditionally, it’s a cardinal sin in academia to use parameters like these because you can’t say anything interesting about the parameters, but the trick in machine learning is that you don’t need to say anything about the parameters. In machine learning, your focus is on describing y-hat, not β-hat.This kind of philosophy will cause future generations to see machine learning as something worse than a fad, almost as something in between a fad and crank science. If this encapsulates how all (generally speaking) machine learning operates then we will enter big trouble, if we have not already.> In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works.This has moved from Cargo Cult Science into numeromancy. It's leveraging the occult (=hidden, incomprehensible parameters) for predicting the future. Because there exist no first principles, nothing can be further interpreted. Only more of the occult can be leveraged in order to make more predictions not amenable to interpretation, which will in turn require MORE occult to make MORE inscrutable predictions, until the heat death of the universe....And appealing to 80's AI (neural networks) as case precedence further harms the author's case. If ML operates like how AI neural network technology went, then this whole rigamole will go tits up by case precedence as well.
 I think essentially the opposite is true, and academia's notion of cardinal sins have held back the analysis of data by decades. And I say this as an academic.You make predictions all the time, and you don't know how. You don't know how you walk, how you drive a car, how you know the rules of English grammar. Your mind is a black box. And yet, you can do things that are still beyond the power of what we can understand using first principles. In many domains, first principles have achieved nothing. Years and years of effort by some of the smartest people in the world, and we pushed the ball one inch down the football field.Supervised machine learning asks the question "To what extent can we predict this outcome, given these predictors?" This is a perfectly valid question, one that we can try to answer given enough data. And we can do it under the exact same conditions we can do standard statistical inference.Ultimately, we want to answer "why" questions. But sometimes we can't even answer "what" questions. Most data is so complex that it's hard to say what the data even says, let alone why it says it. We could have been using what as a stepping stone to why, but our own provincialism as statisticians prevented us. I hope now we are learning to do better.
 How does one interpret what a black box means? In truth, any black box must remain beyond interpretation as a consequence of the fact that Verificationalism failed.There exist hidden verificationalist presuppositions embedded in your conclusions. If verificationalism doesn't hold, then by-first-principles stands as the best approach.
 > This kind of philosophy will cause future generations to see machine learning as something worse than a fadWhy do you think so ? Parameters are a piece of fiction, no one has actually seen them. Sometimes they are an useful piece of fiction, but not a very falsifiable notion.Prediction accuracy, well, that I can definitely measure without resorting to pieces of fiction that are epistemologically unknowable.
 I agree from the perspective that often complex models that cannot be interpreted are created because Neural Networks sound fancy, when much simpler models could be used instead.But on the other hand complex models such as Neural Networks are being used for very wide datasets with incredible amounts of parameters, so that understanding a single parameter's contribution is not useful in the real world.For example, using mouse movement data to classify users into age groups. Knowing that some movement vector adds 0.000067 to the probability of a user being in the 16-25 age group, and so being allowed to watch a movie rated 16+, is not very useful.
 > But on the other hand complex models such as Neural Networks are being used for very wide datasets with incredible amounts of parameters, so that understanding a single parameter's contribution is not useful in the real world.In the near future AI and statisticians will need to cooperate by:1. Finding means for extracting rules/general principles from neural networks.2. Creating new fields of statistics that can handle multi-dimensional data sets, such as by representing them as small-dimensional datasets that interact (?), which converges to the same multidimensional model when the modeler sutures these interactions (?) together according to a topological structure.We know that we can do this because somehow human beings reason about complex systems successfully.
 There’s a whole field around 1 called explainable ai. Interestingly, one of the SOTA techniques, SHOP values, comes from economics (game theory).Regarding the article, I think it was a good read, but as a data scientist on a research at Amazon, there’s a reason our interviews have shifted from stats-heavy to more cs-heavy: the ability to actually implement and maintain analytic products is just more useful. (Note, this isn’t across the board - we still hire PhD-level candidates to do research).
 >human beings reason about complex systems successfullyHave you SEEN economics?
 It's conceivable that humans minds are not perfectly modeled by software neural networks.If neural networks are not perfect models of our minds, then your last paragraph does not hold.
 Have you considered that useful predictive models for reality are simply irreducible to being comprehensible by swollen savannah monkey brains? It's not magic. It simply cannot both be explained to a human and be worth anything.
 By the Good Regulator Theorem, if models of reality were incapable of being made comprehensible, then we would have went extinct. What we need to figure out then becomes how to externalize what we know unconsciously.
 Sounds interesting, but I'm not sure if I understand.From Wikipedia: It is stated that "every good regulator of a system must be a model of that system".How is that related to comprehensibility? Or going extinct?A lot of smart people seem to be surprised that anything can be comprehended at all:https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...The lack of mathematics effectiveness in description of intelligence seems unsurprising to me.
 Can a swollen savannah monkey brain drive a car safely on public roads, using nothing but its attached eyes and ears? Yes (most of the time). Can AI do it? Not yet...
 Sure you can drive, but can you explain in details HOW you do it? Probably not and that's why a machine cannot do it yet (until it learns by itself, as we all did).
 Not an AI in a publicly available release that can be obtained by Joe consumer.Supposedly the development versions we the general public can’t get yet are getting pretty good and would meet your “most of the time” bar on public roads.But we don’t really know how good AI is at any given moment, because those that are leading in any AI area sometimes have reasons for not telling yet. Or because we don’t believe them when they do tell us.
 Well, statistics is for those who understood that showing your work on an exam question was the whole point of the exam. ML is for those who just wrote down the final answer and dismissed showing their work as a waste of time. There's no need to understand the 'why' if you already know the 'what'.
 I love this and I'm stealing this but I think you maybe didn't take it far enough. To extend your analogy,Statistics is for places where making meaning of or taking action on the answer must (for practical or normative reasons) include the processMachine learning is for places where meaning or action can be done without regard for the process.
 So, you're saying ML is for situations in which type II errors are considered irrelevant?
 Given that every semester is have at least one student who can't conceptualize type II errors and at least one who just can't accept type I error...I don't even k ow anymore
 If this topic interested you, it may also be worth reading Leo Breiman’s “Statistical Modeling: The Two Cultures” from 2001. https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...
 The author appears to misunderstand the main difference between statistics and ML. Let me cite him:> my gut reaction is to barf when someone says “teaching the model” instead of “estimating the parameters.”Typical statistics work is to use a known good model and estimate its parameters. Typical machine learning work is to think back from what task you want it to learn and then design a model that has a suitable structure for learning it.For statistics, the parameters are your bread and butter. For machine learning, they are the afterthought to be automated away with lots of GPU power.A well-designed ML model can have competitive performance with randomly initialized parameters, because the structure is far more important than the parameters. In statistics, random parameters are usually worthless.
 Isn't that basically what they said?> The main difference between machine learning and statistics is what I’d call “β-hat versus y-hat.” (I’ve also heard it described as inference versus prediction.) Basically, academia cares a lot about what the estimated parameters look like (β-hat), and machine learning cares more about being able to estimate a dependent variable given some inputs (y-hat).
 I tried to argue about the different ways how a prediction is made in applied statistics versus applied ML.Your argument here is more between theoretical statistics and applied statistics, or similarly between theoretical and applied machine learning.
 "Typical statistics work is to use a known good model and estimate its parameters. [...] For statistics, the parameters are your bread and butter" Ever heard of non-parametric statistics?"For machine learning, they are the afterthought to be automated away with lots of GPU power." You seem to reduce statistics to undergraduate statistics and machine learning to Deep Learning."A well-designed ML model can have competitive performance with randomly initialized parameters, because the structure is far more important than the parameters. In statistics, random parameters are usually worthless." This is blatantly false see Frankle & Carbin, 2019 on the lottery ticket hypothesis.
 Yes, I have reduced both statistics and ML to the subsets that are usually used when working in the field, because the blog post was about employment options.I would wager that people doing non-parametric statistics are both very rare and most likely advertise themselves as machine learning experts, not as statisticians.As for the random network, I was referring to https://arxiv.org/abs/1911.13299 and I have seen similar effects in my own work where a new architecture was performing significantly better before training than the old one was after training.If you want a generally agreed upon example, it'd be conv nets with a cost volume for optical flow. What the conv nets do is to implement a glorified hashing function for a block of pixels. That'll work almost equally well with random parameters. As the result, PWC-Net already has strong performance before you even start training it.
 >As for the random network, I was referring to https://arxiv.org/abs/1911.13299 and I have seen similar effects in my own work where a new architecture was performing significantly better before training than the old one was after training.The fact that a dense neural network with 20M parameters performs equally well as a model with 20M random values and 20M _bits_ worth of parameters means nothing more than that the parameter space is ridiculously large.The only models that perform well given random parameters are those that are sufficiently restrictive. Like weather forecasts, where perturbations of the initial conditions give a distribution of possible outcomes. Machine learning models are almost never restrictive.
 Of course, I agree with you that the parameter space is ridiculously large. But sadly, that's what people do in practice. And with 20mio, their example is still small in comparison to GPT-3 with 175 billion parameters.I disagree with you on the restrictive part. Those ML models that are inspired by biology tend to be restrictive, the same way that the development of mammal brains is assumed to be restricted by genetically determined structure. Pretty much all SOTA optical flow algorithms are restricted in what they can learn. And those restrictions are what makes convergence and unsupervised learning possible, because the problem by itself is very ill posed.
 Non-parametric statistics blurs the lines a bit, prequential statistics (ala Dawid) blurs the lines even more, but he is not wrong. A traditional statistician will be excited about a method because it can recover the parameter (be it finite dimensional, or infinite dimensional). On the other hand an ML person will be excited about a method because, even if the method sucks at recovering the parameters, it does well on the prediction task (if it can be shown that it approaches the theoretical limit of the best that one can do, no matter what the distribution of the dats, and it can do so with efficient use of compute power, ... that would be the holy grail).
 This is one of the clearest explanations I've read on the difference between traditional Statistics and Machine Learning.
 A few years ago, Michael I. Jordan did an AMA on Reddit and discussed this distinction as well. Maybe you'll find it interesting as a counterpoint [1].
 Perfect link. He’s the ideal commentator.
 It seems like the 'scientist' part of 'Data scientist' might cause this sort of misunderstanding.There's a lot more 'engineering' and fiddling going on than any type of 'science-y' stuff it seems.
 At one point, the word "scientist" in "data scientist" was used to distinguish between people who took the time to develop domain expertise from statistical consultants who applied standard methodologies without reference to what the data was or where it came from.
 'science-y' stuff usually doesn't have 'science' in the name.
 This is really nice write-up, much better than yet-another-skin-deep-sklearn-tutorial. Skimming some other posts of the author, his domain understanding looks quite impressive to me.(Judging his writing as an ex-academic econometrician Data Scientist, about to be rebranded to Machine Learning Engineer by his megacorp employer, the author appears to have more insight in the field than many a PhD professional Data Scientist.)
 It is basically the standard take of a statistician who tries to understand machine learning. "It's yhat, rather than betahat" is a common slogan.
 Data science always seemed to me to be a profoundly boring job. Can anyone shed some light on what you find the most fascinating about it?
 I can tell a story. I used to work for a HVAC installation company, pretty small in terms of staff but we subcontracted a lot. Initially brought on as a mechanical engineering intern, but moved on to sales engineering when I found an interesting statistical relationship.A large factor in quotes to clients was the underlying cost of air conditioning equipment in our niche, and often a game of sales intel was played between suppliers and competing contractors (like us) for a given job site. Favorites were picked, and we could get royally screwed in a quote, losing the sale to the end-customer.Fortunately, we had years of purchasing information. It turns out that as varied as air conditioners are across brands and technical dimensions, when you have years of accounts' line items and unused quotes, you don't get a dimensionality issue. Since we operated in a clear-cut niche, this was especially true. We could forecast, within a margin of error of two per cent, exactly what any of our suppliers would quote us (or our competitors!) for a job long before they could turn it around. Huge strategic advantage.This was the watershed moment for me when I realized even basic multiple linear regression was a scarily powerful tool when used correctly.
 That is cool when you put it like that. Uncovering hidden relationships that are useful sounds romantic. Thanks for posting
 And incredibly boring. The usual estimate is that data science is 80% data wrangling: finding, collecting, and cleaning up data. The term "data scientist" replaced "data miner", because miners are looking for gold. Scientists are obsessed with finding out the nature of reality, gold or mud. They will do seriously boring stuff to set things up so that reality is revealed.
 It is only boring if you do it the boring way.If the data cleaning is follows standard patterns, you should already have scripts to offload that kind of work to. If not, then there some incredibly interesting decisions hidden underneath. Like in text: Should character casing be preserved ? What should be the unit of representation (word/character) ? How should data be filtered: Quality vs quantity trade-off ?All of those are non-trivial questions which involve a lot of thought to reason through. You are correct that the modelling is only a small part of DS's day to day job.But, the rest of it is boring in the same way that coding is boring. It is doesn't involve some grand epiphanies or discoveries, but there is joy similar to the daily grind of "code -> get bug/ violate constraints -> follow trace/problem -> figure a sensible solution" that a lot of software engineers love.
 Love the article. It inspired me to make a follow-up note on one of the memes: https://win-vector.com/2020/07/03/data-science-is-not-statis...
 From the article: " In statistics, bad results can be wrong, and being right for bad reasons isn’t acceptable. In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works."
 Is that a typo? It makes more sense as "good results can be wrong, and being right for bad reasons isn't acceptable"
 Off topic, but if someone uses "gut reaction" and "barf" in the same sentence, I'm tempted to think they really mean it literally...
 There is a big difference between ML practitioners and professional statistians. Former commonly are unaware[1] of a rich set of statistical biases and ways to tackle or mitigate them.
 Can someone elaborate on what is meant by 'estimating a parameter with a natural experiment'? This seems to be the key difference but I don't quite get how this would work. What would be your input data and how would the process differ from an ML approach?
 A natural experiment is an experiment (an AB-test if you will) that occurs by chance rather than conscious design. For example, two neighboring countries contemplate banning smoking in restaurants, but in one the bill fails with 49% of the vote, in the other the ban goes through with 51% of the vote. It's not perfect, but you could argue that these countries can now be used to estimate the effect of a smoking ban on mortality and health in a way that is almost as good (but not quite as good) as a randomized clinical trial, whereas you can't just compare two arbitrary countries with differential rates of smoking, because they might be different on so many other counts as well and there is no pre-intervention data to serve as a baseline.More broadly, ML does not really answer questions like "was this death caused by smoking, or rather by a hundred other things associated with smoking like lower income and bad health insurance?", though it is excellent at predicting who is likely to die prematurely. So it's great for prediction, but not so great if you want to learn more about the underlying structure of a phenomenon.Statisticians are sometimes surprised to see so much interest in machine learning given that its view of the world is not open to inspection (though there's https://github.com/interpretml/interpret I guess) so we as humans learn nothing, but it turns out that in many cases we really don't care all that much about the underlying mechanism, as long as we can make accurate predictions.
 A pox in both there houses.I kinda want to ban this stuff for economies like ours. Think about it, we have many entrenched inefficient separate actors all engaging in nonsense alchemy. Surely this ruins the convergence to economic equilibrium.
 Well, if you look at machine learning from the point of view of data science it's inevitable to be confused about its relation to statistics, but machine learning is a sub-field of AI and statistical techniques are only one tool in its toolobx. Statistical techniques have dominated the field in recent ish years but much work in machine learning has historically used e.g. Probabilistic Graphical Models or symbolic logic as the "model" language. e.g. one of the most famous and well-studied classes of machine learning algorithms, decision tree learners, comprises algorithms and systems that learn propositional logic models, rather than statistical models.Tom Mitchell defined machine learning as "the study of computer algorithms that improve automatically through experience"[1]. This definition does not rely on any particular technique, other of course than the use of a computer. Even the nature of "experience" doesn't necessarily need to mean "data" in the way that data scientists mean "data"- for example, "experience" could be collected by an agent interacting with its environment, etc.Unfortunately in very recent years, since the big success of Convolutional Neural Networks in image classification tasks, in 2012, interest for machine learning has shifted from AI research to ... well, let me quote the article:>> Or you can start reading TESL and try to get some of that sweet, sweet machine learning dough from impressionable venture capitalists who hand out money like it’s candy to anyone who can type a few lines of code.I suppose that's ironic. But the truth is that "machine learning" has very much lost its meaning as industry and academia is flooded by thousands of new entrants that do not know its history and do not undestand its goals. In that context, it makes sense to have questions along the lines of "what is the difference between statistics and machine learning", which otherwise have a very obvious answer.___________The excerpt I quote is an informal definition. The wikipedia article on machine learning has a more formal definition:
 Tom Mitchell book is still a great book to understand what Machine Learning is about
 this reads like something 5 or 6 years old
 The author's pie chart showing data science to be 60% data manipulation is accurate. The biggest gap between good and bad data scientist is their comfort level with data wrangling. When interviewing candidates for data science positions, one of the simplest questions is to have them sort a 1 GB tab-delimited file.1. Poor candidates will try to open the file in Excel.2a. Marginal candidates will use R or Stata.2b. Okay candidates will use a scripting language like Python.3. Good candidates will use Unix sort.To my knowledge, there are no university courses teaching the Unix toolchain and it remains very much a skill learned through practice.
 Not sure why you think a candidate who uses R is inferior to one who uses Python?Also, a really good candidate should use the right tool for the job, so if you expect them to use Unix sort you should somehow imply a situation where that is the best approach.
 > if you expect them to use Unix sort you should somehow imply a situation where that is the best approach.I think the implied question is whether the interviewee is aware of the fact that trying to load a 1 GB text file could use up too much RAM space of the system. Unix sort is arguably the most memory efficient among the 4 choices there. It depends on the amount of available resources (which was not specified), and some companies might be willing to casually let people use 100 GB machines, though.
 That is a fair point. A good script would be to move towards this scenario of more data than RAM and see what the candidate comes up with.
 I would also add: if it’s a one time thing, I would just do it in visual studio code or any other editor that doesn’t die on a 1GB file. And I have 12+ years experience with bash and unix tools, so it’s not about a lack of knowledge or experience. There isn’t anything magical regarding “sort” versus another tool, if there isn’t a need for automation they are equivalent.
 Demonstrating again that interviewing is mostly about confirming the prejudices of the interviewer.
 In my experience the real meat of data manipulation/wrangling is in extracting statistical value from the data. If the data is noisy, what aggregations and what filters produce the best signal without smudging the underlying statistical properties?While this is a good skill to have and is a good sign of efficiency, for me the deal breaker is how well the purpose of data transformations is understood and the care to which statistical value is extracted and tested.
 This may doom me as a merely okay candidate but a simple paging strategy in python trivializes the problem and has the added benefit of probably being the "okay-est" tool for whatever the next transform is as well.
 Why is chunking, with Pandas, a bad choice?
 It's not. Welcome to okay-land.

Search: